If you read between the lines, it's pretty simple. Cloudflare has a very sophisticated setup of automated continuous deployment and canaries/progressive deploys. Plus a virtual mode for syntax checking
A human bypassed the former two to run the latter globally by hand.
Which they had probably done countless times before, to no ill effect. It is undoubtedly MUCH faster than waiting for all the automated systems to do their thing, safely and cautiously.
(blog referenced: https://blog.cloudflare.com/cloudflare-outage/ …)
Who would ever deploy a config change globally, all at once?! EVEN IF it's supposed to be a no-op?!!!?
Someone who hasn't internalized that config files are code. That's most of us, btw.
It further shines a light on one of the cardinal laws of release engineering: running any process to deploy config or code other than the blessed path is heresy. No tolerance. No exceptions.
Corollary: you should invest a lot into making the default PATH fast and easy.
No one-off scripts. No shell snippets. No old code that is kept around just for this one weird edge case, or in case everything is already hosed or if you need it up quickly but destructively.
One tool. One very well tested, documented, highly optimized, trustworthy tool.
You test the code path every time someone runs it, or the machines run it, because 🌟everybody runs the same shit🌟. You test it many times a day. There are no other "shortcuts" out there, lurking, rotting.
But again: it MUST be fast. Or shortcuts will bloom to get around it.
They will be completely well meaning.
"Wait 30 min just to test a config change that doesn't even DO anything? That would blow the rest of my afternoon!"
... aaand now you're in the NYT for taking down the internet.
Optimizing build and deploy speeds is one of those very accessible engineering tasks that engineers either love or hate (a Sorting Hat might well sort engineers into infra and product, respectively), but higher level management chafes at, doesn't understand its impact.
It is absolutely the responsibility of line managers and senior engineers to push back forcefully. Educate them.
Every time there's an outage, explain how it was caused by deploy tooling or contributed to the impact.
Forward them post mortems like this one and explain how it will happen to you. Not might; will.
Amplify some less dramatic costs, too. Like the time someone spends babysitting a deploy, or waiting on it, or queueing up, or if multiple changesets go out at once (😱😱😱).
You can't blame upper management for making decisions based on the information they have. And you forget how invisible things like this can be.
So make them visible. Help your management make better informed decisions.
One of my favorite blog posts on continuous deploys from one of my favorite teams, @IntercomEng: https://www.intercom.com/blog/why-continuous-deployment-just-keeps-on-giving/ …
.. found a couple other gems while searching : https://www.intercom.com/blog/videos/reaching-the-right-balance-between-speed-and-safety/ …, and https://www.intercom.com/blog/moving-faster-with-smaller-steps/ …
Fun fact: intercom used honeycomb to instrument their pipeline, using spans so they can visualize it like a trace and see where the time goes, and got it down to 3-4 min to run tests and deploy code.
This is a ruby shop, y'all.
You can follow @mipsytipsy.
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.
Enjoy Threader? Sign up.
Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.