The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for bandwidth. The size of the L1 data cache did not scale over the past dec...
Large datasets, on the order of GB and TB, are increasingly common as abundant computational resources allow practitioners to collect, produce and store data at higher rates. As d...
This paper describes a novel approach to generate an optimized schedule to run threads on distributed shared memory (DSM) systems. The approach relies upon a binary instrumentatio...
In enterprise and data center networks, the scalability of the data plane becomes increasingly challenging as forwarding tables and link speeds grow. Simply building switches with...
The use of large instruction windows coupled with aggressive out-oforder and prefetching capabilities has provided significant improvements in processor performance. In this paper...