Andy Troutman’s talk is useful in explaining complex deployment workflows to management types.
His slides are in Slideshare.
My brief notes:
The formal “process” for changes to a service starts with a pull request and significant code review. Both the engineer that makes the change and the reviewer are held liable for failures caused by the change.
When reviewing a change, assume the service will fail, and use pre-mortem analysis to figure out why and how to reduce risks.
Anybody can pull the Andon cord to stop the release process.
Rolling releases are common. They sometimes use canaries. “Release” happens after the canaries are tested.
Service health tracking is tied to deployment systems. With or without canaries, automated rollbacks (roll-forward to a previous release) triggered by relatively low alarm thresholds are common.
And then there’s how AWS runs their weekly operations meetings.