Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches

14 years 7 months ago

Download www.cs.pitt.edu

This paper presents a two-part study on managing distributed NUCA (Non-Uniform Cache Architecture) L2 caches in a future manycore processor to obtain high singlethread program performance. The ﬁrst part of our study is a limit study where we determine data to cache slice mappings at the memory page granularity based on detailed inter-page conﬂict information derived from program’s memory reference trace. By considering cache access latency and cache miss rate simultaneously when mapping data to L2 cache slices, this “oracle” scheme outperforms the conventional shared caching scheme by up to 208% with an average of 45% on a sixteen-core processor. In the second part of the study, we propose and evaluate a dynamic cache management scheme that determines the home cache slice and cache bin for memory pages without any static program information. The dynamic scheme outperforms the shared caching scheme by up to 191% with an average of 32%, achieving much of the performance we obs...

Lei Jin, Sangyeun Cho

Real-time Traffic

Cache | Cache Slice | Caching Scheme | Distributed And Parallel Computing | ICPP 2008 |

claim paper

Post Info
More Details (n/a)

Added	30 May 2010
Updated	30 May 2010
Type	Conference
Year	2008
Where	ICPP
Authors	Lei Jin, Sangyeun Cho

Comments (0)

Sciweavers

Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches

Cache | Cache Slice | Caching Scheme | Distributed And Parallel Computing | ICPP 2008 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers