Charity Majors+ Your Authors @mipsytipsy CTO @honeycombio; co-wrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤Black Lives Matter🖤 May. 04, 2020 3 min read + Your Authors

This is a fun and informative thread on matters such as progressive deployments, serverless, canaries, and chaos.

And at least one useful point I don't think I've ever thought to articulate before, along the subject of testing in production ...

... which is that canaries are useful for testing code correctness and perf of the service itself, while rolling upgrades are useful for testing the impact of your changes on other services or storage systems.

You really want the ability to do both, as someone shipping code.

Actually, this is a great example of sociotechnical problemsolving.

Are you frustrated by too many failed deploys, or days elapsing before bugs are noticed and reverted? Do you struggle with people getting paged and losing time debugging changes that weren't theirs?

What won't help: getting upset with people, begging them to be more careful, endless retrospective, blaming and shaming.

What might: assuming your team already wants to do a good job, and building tooling to boost visibility, create feedback loops, and give fine grained control

Some examples. Is your problem that it's difficult and time consuming to figure out which diff is causing a problem, and who owns it?

Fix your deploys so that each deploy contains a single changeset. Generate a new, tested artifact after each merge, and deploy them in order.

This, fwiw, is the single most important piece of advice I would give to ANYONE. Unbundling the snarl of merges and autodeploying each diff after tests run and produce an artifact -- this is THE key to unfucking deploys and hooking up the right feedback loops.


Once you do that, you are on to more exciting options.

Struggle with ownership? Modify your paging alerts so that if it's within an hour of a deploy to the complaining service, it pages whoever wrote and merged the diff that just rolled out. I

Struggle with a team that keeps lobbing a deploy out and not finding errors until days or weeks later? Aha, now we are deep in social/technical experimentation territory. ☺️

Likely your team is very weak at practicing instrumentation, for starters.

You might want to devote some real cycles to standardizing your instrumentation and observability, and create social and technical pressures to do the right thing.

Like, make it expected for devs to watch as their own diffs roll out, synchronously/in real time.

Pre-generate a url that compares the current/stable version and their new version, and graphs both versions side by side. Have the deploy process spit it out and instruct them: "GO HERE".

Maybe it deploys to a canary or 10% of hosts and then requires confirmation to proceed.

I could keep going... there are endless things you can do or build to incentivize engineers to actively explore and engage with their code in prod. ☺️

Or maybe you have the opposite problem and people ship too recklessly, and prod goes down or the deploys fail/rollback all day.

In that case maybe you need to build --

* automate the process of deploying each CI/CD artifact
*... to a single canary host
* then monitor a number of health checks and thresholds over the next 30 min
* if ok, promote 10% of hosts at a time to the new version

I believe that the deployment process is a criminally undertapped resource, the most powerful tool you have for understanding the strengths and weaknesses of a team.

Carefully considered changes to deploys can improve the overall function of the team in one fell swoop.

Imagine a world where your team is the team you have today, except:

- if you merge to master, your changes will automatically go live in the next 15 minutes

- deploys are almost entirely nonevents, because the behavioral changes are governed through feature flags anyway

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.

Follow Threader