Sciweavers

66 search results - page 7 / 14
» The Checkpoint Problem
Sort
View
IPPS
2008
IEEE
14 years 2 months ago
Enhancing application robustness through adaptive fault tolerance
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Guj...
GI
2004
Springer
14 years 1 months ago
Crash Management for Distributed Parallel Systems
: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...
Jan Haase, Frank Eschmann
ISCC
2002
IEEE
14 years 20 days ago
Session level rollback recovery
The problem of rollback recovery is traditionally approached using a model oriented to packet delivery. Instead, we introduce a model centered around complex sessions, and we expl...
Augusto Ciuffoletti
CORR
2010
Springer
101views Education» more  CORR 2010»
13 years 7 months ago
A Multi-agent Framework for Performance Tuning in Distributed Environment
: This paper presents the overall design of a multi-agent framework for improving the performance of an application executing in a distributed environment. The multi-agent framewor...
Sarbani Roy, Saikat Halder, Nandini Mukherjee
ICDE
2007
IEEE
123views Database» more  ICDE 2007»
14 years 9 months ago
A Cooperative, Self-Configuring High-Availability Solution for Stream Processing
We present a collaborative, self-configuring high availability (HA) approach for stream processing that enables low-latency failure recovery while incurring small run-time overhea...
Jeong-Hyon Hwang, Ying Xing, Ugur Çetinteme...