Techniques for aggressive optimization and parallelization of applications can have the side-effect of introducing copy instructions, register-to-register move instructions, into the generated code. This preserves program correctness while avoiding the need for global searchand-update of registers. However, copy instructions only transfer data between registers while requiring the use of system resources (ALUs) and are essentially overhead operations which can potentially limit performance. Conventional copy propagation and copy removal techniques are not powerful enough to remove these copies as, during loop parallelization, the lifetimes of the values copied may span over loop boundaries. In this paper, we present a technique for copy removal that incrementally unrolls a loop body and re-allocates registers to values so that no copy operations are required. We also present a heuristic version that limits the amount of unrolling and present experimentation that demonstrates the necess...
David J. Kolson, Alexandru Nicolau, Nikil D. Dutt