Abstract--With General Purpose programmable GPUs becoming more and more popular, automated tools are needed to bridge the gap between achievable performance from highly parallel ar...
In chip-multiprocessors (CMPs), the number of cores and the issue width of each core presents an important design trade-off to balance the amount of TLP and ILP between multi-thre...
nt, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted ...
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laur...
Practical Recurrent Learning (PRL) has been proposed as a simple learning algorithm for recurrent neural networks[1][2]. This algorithm enables learning with practical order O(n2 )...
Hardware prefetching is a simple and effective technique for hiding cache miss latency and thus improving the overall performance. However, it comes with addition of prefetch buff...