Several recent studies identify the memory system as the most frequent source of hardware failures in commercial servers. Techniques to protect the memory system from failures mus...
Jangwoo Kim, Jared C. Smolens, Babak Falsafi, Jame...
Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service down-time. Most industrial applicati...
Border Gateway Protocol (BGP) is the standard routing protocol used in the Internet for routing packets between the Autonomous Systems (ASes). It is known that BGP can take hundre...
RAID architectures have been used for more than two decades to recover data upon disk failures. Disk failure is just one of the many causes of damaged data. Data can be damaged by...
As software Distributed Shared Memory(DSM) systems become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, w...