Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

14 years 11 months ago

Download www.usenix.org

We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the ...

Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat,

Real-time Traffic

Chunk Index | Chunk-lookup Disk Bottleneck | FAST 2009 | Operating System | Sparse |

claim paper

Added	17 Feb 2011
Updated	17 Feb 2011
Type	Journal
Year	2009
Where	FAST
Authors	Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, Peter Camble

Sciweavers

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

Chunk Index | Chunk-lookup Disk Bottleneck | FAST 2009 | Operating System | Sparse |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers