Sciweavers

45 search results - page 2 / 9
» Performance Evaluation of Tiling for the Register Level
Sort
View
CF
2009
ACM
14 years 4 months ago
Mapping the LU decomposition on a many-core architecture: challenges and solutions
Recently, multi-core architectures with alternative memory subsystem designs have emerged. Instead of using hardwaremanaged cache hierarchies, they employ software-managed embedde...
Ioannis E. Venetis, Guang R. Gao
ICS
1995
Tsinghua U.
14 years 1 months ago
Optimum Modulo Schedules for Minimum Register Requirements
Modulo scheduling is an e cient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirement...
Alexandre E. Eichenberger, Edward S. Davidson, San...
HPCA
2008
IEEE
14 years 10 months ago
An OS-based alternative to full hardware coherence on tiled CMPs
The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling...
Christian Fensch, Marcelo Cintra
MICRO
2000
IEEE
176views Hardware» more  MICRO 2000»
13 years 9 months ago
An Advanced Optimizer for the IA-64 Architecture
level of abstraction, compared with the program representation for scalar optimizations. For example, loop unrolling and loop unrolland-jam transformations exploit the large regist...
Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel M....
EUROPAR
2010
Springer
13 years 11 months ago
Optimized Dense Matrix Multiplication on a Many-Core Architecture
Abstract. Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), b...
Elkin Garcia, Ioannis E. Venetis, Rishi Khan, Guan...