Digging deeper into cluster system logs for failure prediction and root cause diagnosis
Paper in proceedings, 2014
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, nor a precise one that uses fine-grained information (such as failure type, node location, related application, and time of occurrence). A deeper and more precise log analysis technique is needed. We propose a three-step approach to draw out event dependencies and to identify failure-event generating processes. First, we cluster frequent event sequences into event groups based on common events. Then we infer causal dependencies between events in each event group. Finally, we extract failure rules based on the observation that events of the same event types, on the same nodes or from the same applications have similar operational behaviors. We use this rich information to improve failure prediction. Our approach semi-automates diagnosing the root causes of failure events, making it a valuable tool for system administrators.
root cause diagnosis
large-scale cluster systems
event causal dependency inference