Sciweavers

PVLDB
2011

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

13 years 3 months ago
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches, CoHadoop retains the flexibility of Hadoop in that it does not require users to convert their data to a certain format (e.g., a relational database or a specific file format). Instead, applications give hints to CoHadoop that some set of files are related and may be processed jointly; CoHadoop then tries to colocate these files for improved efficiency. Our approach is designed such that the strong fault tolerance properties of Hadoop are retained. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessi...
Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özc
Added 17 Sep 2011
Updated 17 Sep 2011
Type Journal
Year 2011
Where PVLDB
Authors Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, John McPherson
Comments (0)