Failures

Common root causes of intra data center network incidents at Facebook from 2011 to 2018

From A Large Scale Study of Data Center Network Reliability by Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu, the categorized root causes of intra data center incidents at Fabook from 2011 to 2018:

CategoryFractionDescription
Maintenance17%Routine maintenance (for example, upgrading the software and firmware of network devices).
Hardware13%Failing devices (for example, faulty memory modules, processors, and ports).
Misconfiguration13%Incorrect or unintended configurations (for example, routing rules blocking production traffic).
Bug12%Logical errors in network device software or firmware.
Accidents11%Unintended actions (for example, disconnecting or power cycling the wrong network device).
Capacity planning5%High load due to insufficient capacity planning.
Undetermined29%Inconclusive root cause.

Two notes worth considering:

We use “failures” to refer to any network device misbehavior. The root cause of a failure includes not only hardware faults, but also misconfigurations, maintenance mistakes, firmware bugs, and other issues.

And:

We use Govindan et al.’s definition of root cause: “A failure event’s root-cause is one that, if it had not occurred, the failure event would not have manifested.”