Sciweavers

1119 search results - page 15 / 224
» Computing in the Presence of Timing Failures
Sort
View
PODC
2012
ACM
11 years 10 months ago
Asynchronous failure detectors
Failure detectors — oracles that provide information about process crashes — are an important ion for crash tolerance in distributed systems. Although current failure-detector...
Alejandro Cornejo, Nancy A. Lynch, Srikanth Sastry
INFOCOM
2010
IEEE
13 years 6 months ago
Network Coding Tomography for Network Failures
—Network Tomography (or network monitoring) uses end-to-end path-level measurements to characterize the network, such as topology estimation and failure detection. This work prov...
Hongyi Yao, Sidharth Jaggi, Minghua Chen
ICPP
2007
IEEE
14 years 1 months ago
A Meta-Learning Failure Predictor for Blue Gene/L Systems
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...
Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev T...
CCGRID
2010
IEEE
13 years 8 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
ICPPW
2009
IEEE
14 years 2 months ago
Decentralized Load Balancing for Improving Reliability in Heterogeneous Distributed Systems
Abstract—A probabilistic analytical framework for decentralized load balancing (LB) strategies for heterogeneous distributed-computing systems (DCSs) is presented with the overal...
Jorge E. Pezoa, Sagar Dhakal, Majeed M. Hayat