In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high perf...
Exploiting the emerging reality of affordable multi-core architeces through providing programmers with simple abstractions that would enable them to easily turn their sequential p...
This paper presents and validates performance models for a variety of high-performance collective communication algorithms for systems with Cell processors. The systems modeled in...
Adaptive, or self-aware, computing has been proposed to help application programmers confront the growing complexity of multicore software development. However, existing approache...
Henry Hoffmann, Jonathan Eastep, Marco D. Santambr...
Transactional memory promises to make parallel programming easier than with fine-grained locking, while performing just as well. This performance claim is not always borne out bec...
We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing...
Alexandros Tzannes, George C. Caragea, Rajeev Baru...
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such ...
Torsten Hoefler, Christian Siebert, Andrew Lumsdai...
As concurrent programming becomes prevalent, software providers are investing in concurrency libraries to improve programmer productivity. Concurrency libraries improve productivi...
Katherine E. Coons, Sebastian Burckhardt, Madanlal...
Drawing inspiration from several previous projects, we present an ownership-record-free software transactional memory (STM) system that combines extremely low overhead with unusua...
Luke Dalessandro, Michael F. Spear, Michael L. Sco...