Sciweavers

482 search results - page 28 / 97
» A large-scale study of failures in high-performance computin...
Sort
View
IPPS
2006
IEEE
15 years 9 months ago
A distributed paging RAM grid system for wide-area memory sharing
Memory-intensive applications often suffer from the poor performance of disk swapping when memory is inadequate. Remote memory sharing schemes, which provide a remote memory that ...
Rui Chu, Nong Xiao, Yongzhen Zhuang, Yunhao Liu, X...
IPPS
2008
IEEE
15 years 10 months ago
Enhancing application robustness through adaptive fault tolerance
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Guj...
SAC
2006
ACM
15 years 3 months ago
Combining supervised and unsupervised monitoring for fault detection in distributed computing systems
Fast and accurate fault detection is becoming an essential component of management software for mission critical systems. A good fault detector makes possible to initiate repair a...
Haifeng Chen, Guofei Jiang, Cristian Ungureanu, Ke...
137
Voted
GI
2003
Springer
15 years 8 months ago
Policy Based Management for Critical Infrastructure Protection
: Our current societies are fully dependent on large complex critical infrastructures (LCCIs). These LCCIs are large scale distributed systems that are highly interdependent, both ...
Gwendal Le Grand, Franck Springinsfeld, Michel Rig...
123
Voted
IPPS
1998
IEEE
15 years 7 months ago
Measuring the Vulnerability of Interconnection Networks in Embedded Systems
Studies of the fault-tolerance of graphs have tended to largely concentrate on classical graph connectivity. This measure is very basic, and conveys very little information for des...
Vijay Lakamraju, Zahava Koren, Israel Koren, C. Ma...