This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-...
—We present Huckleberry, a tool for automatically generating parallel implementations for multi-core platforms from sequential recursive divide-and-conquer programs. The recursiv...
Rebecca L. Collins, Bharadwaj Vellore, Luca P. Car...
The significant growth in computational power of modern Graphics Processing Units(GPUs) coupled with the advent of general purpose programming environments like NVIDA's CUDA,...
Kishore Kothapalli, Rishabh Mukherjee, M. Suhail R...
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizatio...
Samuel Williams, Jonathan Carter, Leonid Oliker, J...
—The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this pap...