Distributed aggregation for data-parallel computing: interfaces and implementations

15 years 11 months ago

Download www.sigops.org

Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efﬁcient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require nonstandard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-deﬁned aggregations, and the efﬁciency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for user-deﬁned aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between userdeﬁned functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface...

Yuan Yu, Pradeep Kumar Gunda, Michael Isard

Real-time Traffic