Sciweavers

7271 search results - page 138 / 1455
» Fault-Tolerant Distributed Simulation
Sort
View
ICPP
2007
IEEE
14 years 3 months ago
A Meta-Learning Failure Predictor for Blue Gene/L Systems
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...
Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev T...
ICPP
2007
IEEE
14 years 3 months ago
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He ...
ICPP
2007
IEEE
14 years 3 months ago
Mercury: Combining Performance with Dependability Using Self-virtualization
There has recently been increasing interests in using system virtualization to improve the dependability of HPC cluster systems. However, it is not cost-free and may come with som...
Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, ...
IPPS
2007
IEEE
14 years 3 months ago
RI2N/UDP: High bandwidth and fault-tolerant network for a PC-cluster based on multi-link Ethernet
PC-clusters with high performance/cost ratio have been one of the typical platforms for high performance computing. To lower costs, Gigabit Ethernet is often used for intercommuni...
Takayuki Okamoto, Shin'ichi Miura, Taisuke Boku, M...
SRDS
2007
IEEE
14 years 3 months ago
The Fail-Heterogeneous Architectural Model
Fault tolerant distributed protocols typically utilize a homogeneous fault model, either fail-crash or fail-Byzantine, where all processors are assumed to fail in the same manner....
Marco Serafini, Neeraj Suri