This paper presents and validates performance models for a variety of high-performance collective communication algorithms for systems with Cell processors. The systems modeled in...
Adaptive, or self-aware, computing has been proposed to help application programmers confront the growing complexity of multicore software development. However, existing approache...
Henry Hoffmann, Jonathan Eastep, Marco D. Santambr...
Transactional memory promises to make parallel programming easier than with fine-grained locking, while performing just as well. This performance claim is not always borne out bec...
We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing...
Alexandros Tzannes, George C. Caragea, Rajeev Baru...
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such ...
Torsten Hoefler, Christian Siebert, Andrew Lumsdai...
As concurrent programming becomes prevalent, software providers are investing in concurrency libraries to improve programmer productivity. Concurrency libraries improve productivi...
Katherine E. Coons, Sebastian Burckhardt, Madanlal...
Drawing inspiration from several previous projects, we present an ownership-record-free software transactional memory (STM) system that combines extremely low overhead with unusua...
Luke Dalessandro, Michael F. Spear, Michael L. Sco...
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memo...
Byunghyun Jang, Perhaad Mistry, Dana Schaa, Rodrig...
We present a static analysis for the automatic generation of symbolic prefetches in a transactional distributed shared memory. A symbolic prefetch specifies the first object to be...