We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarizat...
Scott Schneider, Henrique Andrade, Bugra Gedik, Ku...
In this paper, we describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel mem...
There are two important hurdles that restrict the scalability of directory-based shared-memory multiprocessors: the directory memory overhead and the long L2 miss latencies due to ...
Utilizing the nonuniform latencies of SDRAM devices, access reordering mechanisms alter the sequence of main memory access streams to reduce the observed access latency. Using a r...
It is possible to implement the parallel random access machine (PRAM) on a chip multiprocessor (CMP) efficiently with an emulated shared memory (ESM) architecture to gain easy par...