Transactional Coherence and Consistency (TCC) is a novel coherence scheme for shared memory multiprocessors that uses programmer-defined transactions as the fundamental unit of p...
Scratchpad memory (SPM), a fast software-managed onchip SRAM, is now widely used in modern embedded processors. Compared to hardware-managed cache, it is more efficient in perfor...
Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlappi...
Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applic...
In this paper we compare the performance of area equivalent small, medium, and large-scale multithreaded chip multiprocessors (CMTs) using throughput-oriented applications. We use...
Current high-performance computer systems are unable to saturate the latest available high-bandwidth networks such as 10 Gigabit Ethernet. A key obstacle in achieving 10 gigabits ...
Nathan L. Binkert, Lisa R. Hsu, Ali G. Saidi, Rona...
The continual demand for greater performance and growing concerns about the power consumption in highperformance microprocessors make the branch predictor a critical component of ...
This paper presents a new technique for efficient usage of small trace caches. A trace cache can significantly increase the performance of wide out-oforder processors, but to be e...
This paper proposes a new hardware technique for using one core of a CMP to prefetch data for a thread running on another core. Our approach simply executes a copy of all non-cont...