The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, QR factorizations to t...
Abstract--On NVIDIA's many-core GPUs, there is no synchronization function among parallel thread blocks. When finegranularity of data communication and synchronization is requ...
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir,...
Address translation often emerges as a critical performance bottleneck for virtualized systems and has recently been the impetus for hardware paging mechanisms. These mechanisms ap...
Giang Hoang, Chang Bae, Jack Lange, Lide Zhang, Pe...
Memory models like SC, TSO, and PC enforce load-load ordering, requiring that loads from any single thread appear to occur in program order to all other threads. Out-of-order execu...
Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL...
Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor N. ...