Improving automated fault injection

Automated failure analysis is hard, manual failure analysis requires great expertise.

Consider:

An ideal top-down solution for ensuring that distributed data management systems operate correctly under fault would enrich the fault injection methodology with the best features of formal component verification. In addition to identifying bugs, a principled fault injector should provide assurances. The analysis should be sound: any generated counterexamples should correspond to meaningful fault tolerance bugs. When possible, it should also be complete: when analysis completes without finding counterexamples for a particular input and execution bound, it should guarantee that no bugs exist for that configuration, even if the space of possible executions is enormous.

Authors Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein in their 2015 paper, which I was happy to learn about at o11ycon.io. They propose:

Rather than generating faults at random (or using application-specific heuristics), a lineage-driven fault injector chooses only those failures that could have affected a known good outcome, exercising fault-tolerance code at increasing levels of complexity. Injecting faults in this targeted way allows LDFI to provide completeness guarantees like those achievable with formal methods such as model checking [39, 63,85,86], which have typically been used to verify small protocols in a bottom-up fashion. When bugs are encountered, LDFI’s topdown approach provides—in addition to a counterexample trace— fine-grained data lineage visualizations to help programmers understand the root cause of the bad outcome and consider possible remediation strategies.

The holy grail, of course, is automated Jepsen testing (but nobody can automate Kyle Kingsbury).