Even after carefully tuning the memory characteristics to the application properties and the processor speed, during the execution of real applications there are times when the processor stalls, waiting for data from the memory. Processor stall can be used to increase the throughput by temporarily switching to a different thread of execution, or reduce the power and energy consumption by temporarily switching the processor to low-power mode. However, any such technique has a performance overhead in terms of switching time. Even though over the execution of an application the processor is stalled for a considerable amount of time, each stall duration is too small to profitably perform any state switch. In this paper, we present code transformations to aggregate processor free time. Our experiments on the Intel XScale and Stream kernels show that up to 50,000 processor cycles can be aggregated, and used to profitably switch the processor to low-power mode. We further show that our co...
Aviral Shrivastava, Eugene Earlie, Nikil D. Dutt,