— Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechan...
The execution of an application on a high performance system requires parameters concerning the problem in hand, and those that determine the system mapping, to be specified by a ...
Darren J. Kerbyson, Efstathios Papaefstathiou, Gra...
NAMD† is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique now used by most scalable pr...
Abhinav Bhatele, Sameer Kumar, Chao Mei, James C. ...
Despite using high-speed network interconnection systems like InfiniBand, the communication overhead for parallel applications is still high. In this paper we show, how such costs...
Robert Rex, Frank Mietke, Wolfgang Rehm, Christoph...
—Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering ne...