Under the right conditions refactoring provides a sort of express lane to becoming a master developer. […] Through refactoring, a developer can develop insights, skills, and techniques more quickly by addressing a well understood problem from a more experienced perspective. Practice make perfect. If not the code, maybe the coder.
Together, the Record Layer and FoundationDB form the backbone of Apple’s CloudKit. We wrote a paper describing how we built the Record Layer to run at massive scale and how CloudKit uses it. Today, you can read the preprint to learn more.
From an anonymous FoundationDB blog post introducing relational database capabilities built atop FoundationDB’s key-value store. The paper about CloudKit (PDF) is also worth a read. CloudKit is Apple’s free at any legitimate scale back-end as a service for all iOS and MacOS apps.
I was taught to be contemptuous of the non-blessed narratives, and I was taught to pay for my continued access to the technical communities through perpetuating that contempt. I was taught to have an elevated sense of self-worth, driven by the elitism baked into the hacker ethos as I learned to program. By adopting the same patterns that other, more knowledgable people expressed I could feel more credible, more like a real part of the community, more like I belonged.
I bought my sense of belonging, with contempt, and paid for it with contempt and exclusionary behaviour.
And now, I realise how much of it is an anxiety response. What if I chose the wrong thing? What if other people judge me for my choices and assert that my hard-earned skills actually aren’t worth anything?
Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. […]
Mounting a rollback button within easy reach […] means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.
In practice, we have fixed whole classes of reliability problems by forcing engineers to define deadlines in their service definitions.
How do we learn from incidents, and how do we rebuild customer trust after an incident? Customer-facing postmortems are critical to this, but they have to answer the right questions. » about 700 words
PID controllers are all around us. Their straightforward, closed loop cycle of set-point, observation, and response are the basis for almost every control system in our everyday world, but the the Wikipedia article obscures their simple beauty and ubiquity. » about 700 words
From A Large Scale Study of Data Center Network Reliability by Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu, the categorized root causes of intra data center incidents at Fabook from 2011 to 2018:
|Maintenance||17%||Routine maintenance (for example, upgrading the software and firmware of network devices).|
|Hardware||13%||Failing devices (for example, faulty memory modules, processors, and ports).|
|Misconfiguration||13%||Incorrect or unintended configurations (for example, routing rules blocking production traffic).|
|Bug||12%||Logical errors in network device software or firmware.|
|Accidents||11%||Unintended actions (for example, disconnecting or power cycling the wrong network device).|
|Capacity planning||5%||High load due to insufficient capacity planning.|
|Undetermined||29%||Inconclusive root cause.|
Two notes worth considering:
We use “failures” to refer to any network device misbehavior. The root cause of a failure includes not only hardware faults, but also misconfigurations, maintenance mistakes, firmware bugs, and other issues.
We use Govindan et al.’s definition of root cause: “A failure event’s root-cause is one that, if it had not occurred, the failure event would not have manifested.”
It may surprise you to learn that, in practice, clients’ clocks are heavily skewed. A recent study of Chrome users showed that a significant fraction of reported TLS-certificate errors are caused by client-clock skew. During the period in which error reports were collected, 6.7% of client-reported times were behind by more than 24 hours. (0.05% were ahead by more than 24 hours.) This skew was a causal factor for at least 33.5% of the sampled reports from Windows users, 8.71% from Mac OS, 8.46% from Android, and 1.72% from Chrome OS.
The title is a quote from Nikita Prokopov, who is wallowing in disenchantment.