Issues in applying data mining to grid job failure detection and diagnosis

16 years 1 months ago

Download pages.cs.wisc.edu

As grid computation systems become larger and more complex, manually diagnosing failures in jobs becomes impractical. Recently, machine-learning techniques have been proposed to detect a variety of application failures in grids. While this is a promising approach, there are many options as to how to apply machine learning to this problem, and it not always obvious which approaches are feasible or effective. We explore some issues that arise when we try to apply existing implementations of data mining algorithms to diagnose as well as predict job failures in grids. We demonstrate that a) it is feasible to gather enough data in real-time to train useful classiﬁer algorithms, using only a small fraction of the grid’s computational resources, b) it is important to choose the features used for classiﬁcation with care, and c) it is useful to have both peruser and system-wide classiﬁers, as they diagnose different kinds of problems. We illustrate all these issues using a prototype sy...

Lakshmikant Shrinivas, Jeffrey F. Naughton

Real-time Traffic