In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to...
Future single-board multi-socket systems may be unable to deliver the needed memory bandwidth electrically due to power limitations, which will hurt their ability to drive perform...
Scott Beamer, Krste Asanovic, Christopher Batten, ...
Several existing compiler transformations can help improve communication-computation overlap in MPI applications. However, traditional compilers treat calls to the MPI library as ...
Anthony Danalis, Lori L. Pollock, D. Martin Swany,...
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest perfo...
M. Suhail Rehman, Kishore Kothapalli, P. J. Naraya...
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture...
The speed gap between processor and memory continues to limit performance. To address this problem, we explore the potential of eliminating Zero Loads—loads accessing memory loc...
Md. Mafijul Islam, Sally A. McKee, Per Stenstr&oum...
Matching regular expressions (regexps) is a very common workload. For example, tokenization, which consists of recognizing words or keywords in a character stream, appears in ever...
Tiling is a crucial loop transformation for generating high performance code on modern architectures. Efficient generation of multilevel tiled code is essential for maximizing da...
Albert Hartono, Muthu Manikandan Baskaran, C&eacut...