Sciweavers

81 search results - page 15 / 17
» Challenging the Mean Time to Failure: Measuring Dependabilit...
Sort
View
CCGRID
2006
IEEE
14 years 1 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
ICPP
2009
IEEE
14 years 2 months ago
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales...
CLUSTER
2004
IEEE
13 years 11 months ago
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...
Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé
INFOCOM
2005
IEEE
14 years 1 months ago
Topology aware overlay networks
— Recently, overlay networks have emerged as a means to enhance end-to-end application performance and availability. Overlay networks attempt to leverage the inherent redundancy ...
Junghee Han, David Watson, Farnam Jahanian
MAGS
2010
97views more  MAGS 2010»
13 years 6 months ago
Towards reliable multi-agent systems: An adaptive replication mechanism
Abstract. Distributed cooperative applications (e.g., e-commerce) are now increasingly being designed as a set of autonomous entities, named agents, which interact and coordinate (...
Zahia Guessoum, Jean-Pierre Briot, Nora Faci, Oliv...