Debugging Grids with Machine Learning Techniques

CCL Home

Software

Community

Operations

Debugging Grids with Machine Learning Techniques

PIs: Nitesh Chawla, Xiaohui Song, Shaowen Wang, and Douglas Thain. This work is supported by the National Science Foundation under grant CNS-07-20813.

Debugging distributed systems is notoriously hard. Not only do distributed systems fail frequently, they fail strangely. System failure can arise from physical failures, from network outages, from misconfiguration, or simply from program inputs. Not only is distinguishing between these cases difficult, but users are often unable to even extract the necessary information from the system. The state of the art in distributed debugging is to use ssh to log into a node and use grep to search server logs.

Of course, we have good tools such as gdb and purify for debugging standalone systems, but these models of debugging just don't apply to distributed systems. We cannot just stop processes in a distributed system in order to examine their memory contents: we may not have permission to do so, nor may the process in question be currently accessible. New models of debugging are required.

Data mining techniques can be applied to the problem of large scale troubleshooting. Computing grids expose a large amount of structured information about both jobs and the resources that they consume. If these items are well described, then classification algorithms can be used to find properties of each that correlate with success or failure. An ideal troubleshooter would report to the user something like: Your jobs always fail on Linux 2.8 machines, always fail on cluster X between midnight and 6 A.M, and fail with 50 percent probability on machines owned by user Y. Further, these discoveries may be used to automatically avoid making bad placement decisions that waste time and resources.

We have implemented a prototype of such a debugging system that examines log data produced by the Condor distributed batch system, and has diagnosed several previously unknown problems on the Teragrid and the Northwest Indiana Computational Grid.

Publications

(Showing papers with tag debugdm. See all papers instead.)

David Cieslak, Nitesh Chawla, and Douglas Thain,
Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques,
IEEE Grid Computing, pages 217-224, August, 2008. DOI: 10.1109/GRID.2008.4662802

David Cieslak, Douglas Thain, Nitesh Chawla,
Short Paper: Troubleshooting Distributed Systems via Data Mining,
IEEE Symposium on High Performance Distributed Computing (HPDC), pages 309-312, June, 2006.