Sciweavers

260 search results - page 23 / 52
» Performance Modelling and Optimization of Memory Access on C...
Sort
View
PPOPP
2010
ACM
14 years 4 months ago
Data transformations enabling loop vectorization on multithreaded data parallel architectures
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memo...
Byunghyun Jang, Perhaad Mistry, Dana Schaa, Rodrig...
EUROPAR
1997
Springer
13 years 11 months ago
Modulo Scheduling with Cache Reuse Information
Instruction scheduling in general, and software pipelining in particular face the di cult task of scheduling operations in the presence of uncertain latencies. The largest contrib...
Chen Ding, Steve Carr, Philip H. Sweany
HPCA
2009
IEEE
14 years 8 months ago
Design and implementation of software-managed caches for multicores with local memory
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies fr...
Sangmin Seo, Jaejin Lee, Zehra Sura
IPPS
1997
IEEE
13 years 11 months ago
DPF: A Data Parallel Fortran Benchmark Suite
We present the Data Parallel Fortran (DPF) benchmark suite, a set of data parallel Fortran codes forevaluatingdata parallel compilers appropriatefor any target parallel architectu...
Y. Charlie Hu, S. Lennart Johnsson, Dimitris Kehag...
PC
2010
190views Management» more  PC 2010»
13 years 6 months ago
High-performance cone beam reconstruction using CUDA compatible GPUs
Compute unified device architecture (CUDA) is a software development platform that allows us to run C-like programs on the nVIDIA graphics processing unit (GPU). This paper prese...
Yusuke Okitsu, Fumihiko Ino, Kenichi Hagihara