This paper describes a new parallel algorithm for Minimum Cost Path computation on the Polymorphic Processor Array, a massively parallel architecture based on a reconfigurable mesh...
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. We present two parallel algorithms, the direct algorithm and the split ...
Fully-populated tori, where every node has a processor attached, do not scale well since load on edges increases superlinearly with network size under heavy communication, resulti...
We study the problem of mapping the N nodes of a complete t-ary tree on M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. Typica...
Vincenzo Auletta, Sajal K. Das, Amelia De Vivo, Ma...
The increasing gap in processor and memory speeds has forced microprocessors to rely on deep cache hierarchies to keep the processors from starving for data. For many applications...
A non-redundant fault-tolerant broadcasting algorithm in a faulty k-ary n-cube is designed. The algorithm can adapt up to 2n,2 node failures. Compared to the optimal algorithm in a...
Abstract. Programmability is an important capability that provides flexible computing devices, but it incurs significant performance and power penalties. We have proposed an archit...
Arthur Abnous, Katsunori Seno, Yuji Ichikawa, Marl...
Characterizing shared-memory applications provides insight to design efficient systems, and provides awareness to identify and correct application performance bottlenecks. Configu...
Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs tar...
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, Jam...
Register coalescing is used, as part of register allocation, to reduce the number of register copies. Developing efficient register coalescing heuristics is particularly important ...