Abstract. Embedded devices designed for various real-time multimedia and telecom applications, have a bottleneck in energy consumption and performance that becomes day by day more crucial. This is imposed by the increasing gap between processor and memory speed. Many authors have addressed this problem, but all existing techniques either consider only performance without any other trade-off, or they operate at the level of individual loops. We fill this gap, by presenting a technique which achieves parallelization in the memory accesses through four loop transformations. Our estimations from two real-life applications from the multimedia and telecom domain, reveal that using our technique, we can either increase the performance (up to 35%) or lower the energy consumption (up to 20%) for the same cost.