failures

Common root causes of intra data center network incidents at Facebook from 2011 to 2018

From A Large Scale Study of Data Center Network Reliability by Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu, the categorized root causes of intra data center incidents at Fabook from 2011 to 2018:

Category Fraction Description
Maintenance 17% Routine maintenance (for example, upgrading the software and firmware of network devices).
Hardware 13% Failing devices (for example, faulty memory modules, processors, and ports).
Misconfiguration 13% Incorrect or unintended configurations (for example, routing rules blocking production traffic).
Bug 12% Logical errors in network device software or firmware.
Accidents 11% Unintended actions (for example, disconnecting or power cycling the wrong network device).
Capacity planning 5% High load due to insufficient capacity planning.
Undetermined 29% Inconclusive root cause.

Two notes worth considering:

We use “failures” to refer to any network device misbehavior. The root cause of a failure includes not only hardware faults, but also misconfigurations, maintenance mistakes, firmware bugs, and other issues.

And:

We use Govindan et al.’s definition of root cause: “A failure event’s root-cause is one that, if it had not occurred, the failure event would not have manifested.”