Search Sciweavers | Sciweavers

164

DSN
2006
IEEE

135views Computer Networks» more DSN 2006»

A large-scale study of failures in high-performance computing systems

16 years 22 days ago

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly avai...

Bianca Schroeder, Garth A. Gibson

claim paper

Read More »

148

click to vote

CORR
2006
Springer

80views Education» more CORR 2006»

Exact Failure Frequency Calculations for Extended Systems

15 years 6 months ago

Download hal.archives-ouvertes.fr

This paper shows how the steady-state availability and failure frequency can be calculated in a single pass for very large systems, when the availability is expressed as a product...

Annie Druault-Vicard, Christian Tanguy

claim paper

Read More »

166

click to vote

ICPP
2007
IEEE

123views Distributed And Parallel Com...» more ICPP 2007»

A Meta-Learning Failure Predictor for Blue Gene/L Systems

16 years 29 days ago

Download www.mcs.anl.gov

The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...

Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev T...

claim paper

Read More »

170

click to vote

EDCC
2005
Springer

134views Applied Computing» more EDCC 2005»

Failure Detection with Booting in Partially Synchronous Systems

16 years 6 days ago

Download www-rocq.inria.fr

Unreliable failure detectors are a well known means to enrich asynchronous distributed systems with time-free semantics that allow to solve consensus in the presence of crash failu...

Josef Widder, Gérard Le Lann, Ulrich Schmid

claim paper

Read More »

184

click to vote

PODC
2009
ACM

105views Distributed and Parallel Com...» more PODC 2009»

The weakest failure detector for solving k-set agreement

15 years 11 months ago

Download www.net.t-labs.tu-berlin.de

A failure detector is a distributed oracle that provides processes in a distributed system with hints about failures. The notion of a weakest failure detector captures the exact a...

Eli Gafni, Petr Kuznetsov

claim paper

Read More »

Sciweavers

Explore & Download

Productivity Tools

Sciweavers