Sciweavers

901 search results - page 51 / 181
» Hiding Communication Latency in Data Parallel Applications
Sort
View
CCGRID
2004
IEEE
13 years 11 months ago
High performance LU factorization for non-dedicated clusters
This paper describes an implementation of parallel LU factorization. The focus is to achieve high performance on non-dedicated clusters, where the number of available computing re...
Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori ...
HPCA
2007
IEEE
14 years 8 months ago
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications
Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of powe...
Hongtao Zhong, Steven A. Lieberman, Scott A. Mahlk...
IEEECIT
2010
IEEE
13 years 6 months ago
Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures
Reduction is a common component of many applications, but can often be the limiting factor for parallelization. Previous reduction work has focused on detecting reduction idioms a...
Xiao-Long Wu, Nady Obeid, Wen-Mei Hwu
IPPS
2007
IEEE
14 years 2 months ago
Improving Data Access Performance with Server Push Architecture
Data prefetching, where data is fetched before CPU demands for it, has been considered as an effective solution to mask data access latency. However, the current client-initiated ...
Xian-He Sun, Surendra Byna, Yong Chen
ICS
2007
Tsinghua U.
14 years 2 months ago
Performance driven data cache prefetching in a dynamic software optimization system
Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximi...
Jean Christophe Beyler, Philippe Clauss