Gretel: Lightweight Fault Localization for Openstack


Ayush Goel*, Sukrit Kalra*, Mohan Dhawan. Conference on emerging Networking EXperiments and Technologies. CoNEXT 2016


Like any other distributed system, cloud management stacks such as OpenStack, are susceptible to faults whose root cause is often hard to diagnose and may take hours or days to fix. We present GRETEL, a system that leverages non-intrusive system monitoring, to expedite root cause analysis of both operational and performance faults manifesting in OpenStack operations. GRETEL uses unique operational fingerprints to quickly identify faulty operations at runtime. GRETEL is accurate in its diagnosis, and achieves> 98% precision in identifying the faulty operation with very few false positives and negatives even under conditions of stress. GRETEL is lightweight and orders of magnitude faster than prior work, sustaining a throughput of~ 77 Mbps.