Sciweavers

1119 search results - page 15 / 224
» Computing in the Presence of Timing Failures
Sort
View
PODC
2012
ACM
13 years 8 months ago
Asynchronous failure detectors
Failure detectors — oracles that provide information about process crashes — are an important ion for crash tolerance in distributed systems. Although current failure-detector...
Alejandro Cornejo, Nancy A. Lynch, Srikanth Sastry
INFOCOM
2010
IEEE
15 years 4 months ago
Network Coding Tomography for Network Failures
—Network Tomography (or network monitoring) uses end-to-end path-level measurements to characterize the network, such as topology estimation and failure detection. This work prov...
Hongyi Yao, Sidharth Jaggi, Minghua Chen
ICPP
2007
IEEE
16 years 1 days ago
A Meta-Learning Failure Predictor for Blue Gene/L Systems
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...
Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev T...
CCGRID
2010
IEEE
15 years 6 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
ICPPW
2009
IEEE
16 years 11 days ago
Decentralized Load Balancing for Improving Reliability in Heterogeneous Distributed Systems
Abstract—A probabilistic analytical framework for decentralized load balancing (LB) strategies for heterogeneous distributed-computing systems (DCSs) is presented with the overal...
Jorge E. Pezoa, Sagar Dhakal, Majeed M. Hayat