This site uses cookies to deliver our services and to ensure you get the best experience. By continuing to use this site, you consent to our use of cookies and acknowledge that you have read and understand our Privacy Policy, Cookie Policy, and Terms
In modern clustering environments where the memory hierarchy has many layers (distributed memory, shared memory layer, cache, ¡ ¢ ), an important question is how to fully u...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how t...
Ziang Hu, Juan del Cuvillo, Weirong Zhu, Guang R. ...
Moore’s Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level par...
Nan Yuan, Yongbin Zhou, Guangming Tan, Junchao Zha...
We present the preliminary design for a C++ template library to enable the compositional construction of matrix classes suitable for high performance numerical linear algebra comp...
We present a new parallel algorithm to compute an exact triangularization of large square or rectangular and dense or sparse matrices in any field. Using fast matrix multiplicatio...