Parallel and distributed computing infrastructure are increasingly being embraced in the context of manufacturing applications, including real-time scheduling. In this paper, we pr...
This paper develops some control structures suitable for composing fault-tolerant distrib uted applications using atomic actions (atomic transactions) as building blocks, and then...
Record and Replay (RR) is a software based state replication solution designed to support recording and subsequent replay of the execution of unmodified applications running on mu...
Philippe Bergheaud, Dinesh Subhraveti, Marc Vertes
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...
For hypercube networks which have faulty nodes, a few ecient dynamic routing algorithms have been proposed by allowing each node to hold the status of neighbors. We propose two im...