What Supercomputers Say: A Study of Five System Logs

16 years 1 months ago

Download adam.oliner.net

If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from ﬁve supercomputers, with the aim of providing useful insight and direction for future research into the use of such logs. We present details about the systems, methods of log collection, and how alerts were identiﬁed; propose a simpler and more effective ﬁltering algorithm; and deﬁne operational context to encompass the crucial information that we found to be currently missing from most logs. The machines we consider (and the number of processors) are: Blue Gene/L (131072), Red Storm (10880), Thunderbird (9024), Spirit (1028), and Liberty (512). This is the ﬁrst study of raw system logs from multiple supercomputers.

Adam J. Oliner, Jon Stearley

Real-time Traffic

Computer Networks | Deﬁne Operational Context | DSN 2007 | Effective ﬁltering Algorithm | Real Deployed Systems |

claim paper

Added	02 Jun 2010
Updated	02 Jun 2010
Type	Conference
Year	2007
Where	DSN
Authors	Adam J. Oliner, Jon Stearley

Sciweavers

What Supercomputers Say: A Study of Five System Logs

Computer Networks | Deﬁne Operational Context | DSN 2007 | Effective ﬁltering Algorithm | Real Deployed Systems |

Explore & Download

Productivity Tools

Sciweavers