As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...
The problem of rollback recovery is traditionally approached using a model oriented to packet delivery. Instead, we introduce a model centered around complex sessions, and we expl...
: This paper presents the overall design of a multi-agent framework for improving the performance of an application executing in a distributed environment. The multi-agent framewor...
We present a collaborative, self-configuring high availability (HA) approach for stream processing that enables low-latency failure recovery while incurring small run-time overhea...