Cache oblivious parallelograms in iterative stencil computations

14 years 5 months ago

Download www.mpi-inf.mpg.de

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results. The performance beneﬁts stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a pre-deﬁned hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, Ha

Real-time Traffic

Cache Oblivious Scheme | Enormous On-chip Cache | ICS 2010 | Locality Optimizer Pluto |

claim paper

Post Info
More Details (n/a)

Added	19 Jul 2010
Updated	19 Jul 2010
Type	Conference
Year	2010
Where	ICS
Authors	Robert Strzodka, Mohammed Shaheen, Dawid Pajak, Hans-Peter Seidel

Comments (0)

Sciweavers

Cache oblivious parallelograms in iterative stencil computations

Cache Oblivious Scheme | Enormous On-chip Cache | ICS 2010 | Locality Optimizer Pluto |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers