— Architectural resources and program recurrences are the main limitations to the amount of Instruction-Level Parallelism (ILP) exploitable from loops, the most time-consuming part in numerical computations. In order to increase the number of operations per second, current designs use growing degrees of resource replication for memory ports and functional units. But the high costs in terms of power and cycle time of this technique precludes the use of high degrees of replication. High cycle times may result in diminishing returns while excessive power consumption may lead to unreliable operation. Clustering is a technique aimed at decentralizing the design of future wide issue cores and enable them to meet the technology constraints in terms of cycle time, area and power. Another way to reduce the area of recent cores is the use of wide functional units. This technique only requires minor modifications to the underlying hardware, but imposes a penalty on the exploitable parallelism....