The paper presents and evaluates Nysiad,1 a system that implements a new technique for transforming a scalable distributed system or network protocol tolerant only of crash failur...
Chi Ho, Robbert van Renesse, Mark Bickford, Danny ...
We develop a machine-learned similarity metric for Windows failure reports using telemetry data gathered from clients describing the failures. The key feature is a tuned callstack...
Kevin Bartz, Jack W. Stokes, John C. Platt, Ryan K...
In a distributed system, replication of components, such as objects, is a well known way of achieving availability. For increased availability, crashed and disconnected components...
A proactive handling of faults requires that the risk of upcoming failures is continuously assessed. One of the promising approaches is online failure prediction, which means that...