Failure detectors — oracles that provide information about process crashes — are an important ion for crash tolerance in distributed systems. Although current failure-detector...
Alejandro Cornejo, Nancy A. Lynch, Srikanth Sastry
—Network Tomography (or network monitoring) uses end-to-end path-level measurements to characterize the network, such as topology estimation and failure detection. This work prov...
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
Abstract—A probabilistic analytical framework for decentralized load balancing (LB) strategies for heterogeneous distributed-computing systems (DCSs) is presented with the overal...