Sciweavers

14 search results - page 2 / 3
» High-performance implementation of the level-3 BLAS
Sort
View
JPDC
2008
135views more  JPDC 2008»
13 years 8 months ago
Parallel block tridiagonalization of real symmetric matrices
Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for...
Yihua Bai, Robert C. Ward
ARCS
2008
Springer
13 years 10 months ago
An Optimized ZGEMM Implementation for the Cell BE
: The architecture of the IBM Cell BE processor represents a new approach for designing CPUs. The fast execution of legacy software has to stand back in order to achieve very high ...
Timo Schneider, Torsten Hoefler, Simon Wunderlich,...
SPAA
1998
ACM
14 years 24 days ago
Elimination Forest Guided 2D Sparse LU Factorization
Sparse LU factorization with partial pivoting is important for many scienti c applications and delivering high performance for this problem is di cult on distributed memory machin...
Kai Shen, Xiangmin Jiao, Tao Yang
PVM
2009
Springer
14 years 1 months ago
Multiple-Level MPI File Write-Back and Prefetching for Blue Gene Systems
This paper presents the design and implementation of an asynchronous data-staging strategy for file accesses based on ROMIO, the most popular MPI-IO distribution, and ZeptoOS, an ...
Javier García Blas, Florin Isaila, Jes&uacu...
EUROPAR
2009
Springer
14 years 1 months ago
Using Hybrid CPU-GPU Platforms to Accelerate the Computation of the Matrix Sign Function
Abstract. We investigate the performance of two approaches for matrix inversion based on Gaussian (LU factorization) and Gauss-Jordan eliminations. The target architecture is a cur...
Peter Benner, Pablo Ezzatti, Enrique S. Quintana-O...