Column-Oriented Storage Techniques for MapReduce

15 years 1 months ago

Download www.vldb.org

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text ﬁles. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can...

Avrilia Floratou, Jignesh M. Patel, Eugene J. Shek

Real-time Traffic

Column-oriented Storage | Computer Networks | Hadoop | PVLDB 2011 | Storage Format |

claim paper

» Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services

» Query processing of massive trajectory data based on mapreduce

» No free lunch brute force vs localitysensitive hashing for crosslingual pairwise similarit...

» Making cloud intermediate data faulttolerant

Post Info
More Details (n/a)

Added	14 May 2011
Updated	14 May 2011
Type	Journal
Year	2011
Where	PVLDB
Authors	Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, Sandeep Tata

Comments (0)

Sciweavers

Column-Oriented Storage Techniques for MapReduce

Column-oriented Storage | Computer Networks | Hadoop | PVLDB 2011 | Storage Format |

Explore & Download

Productivity Tools

Sciweavers