software engineering

In praise of refactoring

Under the right conditions refactoring provides a sort of express lane to becoming a master developer. […] Through refactoring, a developer can develop insights, skills, and techniques more quickly by addressing a well understood problem from a more experienced perspective. Practice make perfect. If not the code, maybe the coder.

From Patrick Goddi, who argues refactoring is about more than code quality.

Apple CloudKit uses FoundationDB Record Layer

Together, the Record Layer and FoundationDB form the backbone of Apple’s CloudKit. We wrote a paper describing how we built the Record Layer to run at massive scale and how CloudKit uses it. Today, you can read the preprint to learn more.

From an anonymous FoundationDB blog post introducing relational database capabilities built atop FoundationDB’s key-value store. The paper about CloudKit (PDF) is also worth a read. CloudKit is Apple’s free at any legitimate scale back-end as a service for all iOS and MacOS apps.

Technology choices, belonging, and contempt

I was taught to be contemptuous of the non-blessed narratives, and I was taught to pay for my continued access to the technical communities through perpetuating that contempt. I was taught to have an elevated sense of self-worth, driven by the elitism baked into the hacker ethos as I learned to program. By adopting the same patterns that other, more knowledgable people expressed I could feel more credible, more like a real part of the community, more like I belonged.

I bought my sense of belonging, with contempt, and paid for it with contempt and exclusionary behaviour.

And now, I realise how much of it is an anxiety response. What if I chose the wrong thing? What if other people judge me for my choices and assert that my hard-earned skills actually aren’t worth anything?

From Aurynn Shaw on cultures of exclusion and contempt in technology professions.

Rollback buttons and time machines

Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. […]

Mounting a rollback button within easy reach […] means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.

From Dan McKinley, speaking about the complications and near impossibility of rolling back a deployment.

Common root causes of intra data center network incidents at Facebook from 2011 to 2018

From A Large Scale Study of Data Center Network Reliability by Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu, the categorized root causes of intra data center incidents at Fabook from 2011 to 2018:

CategoryFractionDescription
Maintenance17%Routine maintenance (for example, upgrading the software and firmware of network devices).
Hardware13%Failing devices (for example, faulty memory modules, processors, and ports).
Misconfiguration13%Incorrect or unintended configurations (for example, routing rules blocking production traffic).
Bug12%Logical errors in network device software or firmware.
Accidents11%Unintended actions (for example, disconnecting or power cycling the wrong network device).
Capacity planning5%High load due to insufficient capacity planning.
Undetermined29%Inconclusive root cause.

Two notes worth considering:

We use “failures” to refer to any network device misbehavior. The root cause of a failure includes not only hardware faults, but also misconfigurations, maintenance mistakes, firmware bugs, and other issues.

And:

We use Govindan et al.’s definition of root cause: “A failure event’s root-cause is one that, if it had not occurred, the failure event would not have manifested.”

Time synchronization is rough

CloudFlare on the frustrations of clock skew:

It may surprise you to learn that, in practice, clients’ clocks are heavily skewed. A recent study of Chrome users showed that a significant fraction of reported TLS-certificate errors are caused by client-clock skew. During the period in which error reports were collected, 6.7% of client-reported times were behind by more than 24 hours. (0.05% were ahead by more than 24 hours.) This skew was a causal factor for at least 33.5% of the sampled reports from Windows users, 8.71% from Mac OS, 8.46% from Android, and 1.72% from Chrome OS.

They’re proposing Roughtime as a solution.