Clustered architectures which intend to process data within a localized PE are one of the approaches to increase the performance under the difficulties of the wire delay problems. The performance of clustered architectures depends on the amount of parallel execution of instructions and the amount of inter-PE communication to synchronize dependent instructions. In this paper, we propose an arrangement of PEs cooperating with the adjacent PEs by means of adding communication structures between the adjacent PEs in order to relax the inter-PE communication and workload imbalance in an effective manner. We evaluate the proposed configurations and compare them with the existing one so far considered. The results show that the proposed adjacent forwarding network configuration with the instruction steering scheme that concerns both the register fanout and available free register can achieve higher instructions per clock (IPC) with the small number of registers per PE than the other confi...