Through massive parallelism, distributed systems enable the multiplication of productivity. Unfortunately, increasing the scale of available machines to users will also multiply debugging when failure occurs. Data mining allows the extraction of patterns within large amounts of data and therefore forms the foundation for a useful method of debugging, particularly within such distributed systems. This paper outlines a successful application of data mining in troubleshooting distributed systems, proposes a framework for further study, and speculates on other future work.
David A. Cieslak, Douglas Thain, Nitesh V. Chawla