Sciweavers

IPPS
2010
IEEE

Scalable failure recovery for high-performance data aggregation

13 years 9 months ago
Scalable failure recovery for high-performance data aggregation
Many high-performance tools, applications and infrastructures, such as Paradyn, STAT, TAU, Ganglia, SuperMon, Astrolabe, Borealis, and MRNet, use data aggregation to synthesize large data sets and reduce data volumes while retaining relevant information content. Hierarchical or tree-based overlay networks (TBONs) are often used to execute data aggregation operations in a scalable, piecewise fashion. In this paper, we present state compensation, a scalable failure recovery model for highbandwidth, low-latency TBON computations. By leveraging inherently redundant state information found in many TBON computations, state compensation avoids explicit state replication (for example, process checkpoints and message logging) and incurs no overhead in the absence of failures. Further, when failures do occur, state compensation uses a weak data consistency model and localized protocols that allow processes to recover from failures independently and responsively. Based on a formal specification o...
Dorian C. Arnold, Barton P. Miller
Added 13 Feb 2011
Updated 13 Feb 2011
Type Journal
Year 2010
Where IPPS
Authors Dorian C. Arnold, Barton P. Miller
Comments (0)