Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Mar. 05, 2019 4 min read

Yay, @lizthegrey taking the stage at @qconlondon in the huge keynote auditorium to tell us morality tales about complex systems!!

telling horror stories right off the bat. "So I spun up an infra, I did all the right things... now I have a bazillion dashboards and outages take forever and only one person actually knows how to debug anything and holy shit are my engineers getting cranky"

You're drowning in operational overload, and *no amount of tooling* is going to help if you don't have the right mental model and the right plan.

Tools can only help if you know what you're doing.

You forgot about who runs your software: people do. Your plan has to be human-centered and people-focused. This requires ... dun-dun-dun ... ✨Production Excellence✨.

You have to invest in making your systems more reliable and friendly. It's not okay to feed your systems with the blood of your humans.

Culture is _everything_. Changing the culture of the team is an intentional effort that takes everyone on the team and everyone adjacent.

Engineers: when was the last time you invited sales, marketing, execs, your customer support folks to your meetings?

Production excellence must involve all of these stakeholders, or you will leave folks out and it will be unsustainable.

A big part of production excellence is building people up: increasing their confidence so they are willing to touch prod. You have to encourage asking questions, and make people feel safe taking some time to think.

(call out to @sarahjwells keynote where developers wouldn't touch mysql to restart the db for 20 min because they were so traumatized by their ops team)

so where do we start?

* know where to start
* and be able to debug
* ... debug together, if you span services
* and pay down complexity, reduce duplication and drudgery

our systems are ✨always failing.✨ your systems should be resilient to a million different types of errors.

[ed: i feel bad about not posting all this terrific art, but i could not keep up live tweeeting from my phone. THIS IS WAY HARDER THAN IT LOOKS]

We can not and should not have to care about any and every failure or error on our systems. So how do we decide *which* to care about?

Enter Service Level Indicators!

How do you establish an SLI? Well, this is where collaboration comes in. Ask your product managers -- what delights users? What annoys them? Ask around -- ask all your stakeholders. They will have Opinions!

(ugh still no way to remove tagged people from tweet threads, sorry sarah)

You have to establish some arbitrary thresholds, so you can bucket your events into good and bad. (Non user-impacting events are *excluded* from these calculations.)

Now you can compute your SLI. Use a window (1 month, 3 month) and target percentage. If you reset the window every month, you will run into a very serious problem: users have memories, and will remember that you were down yesterday even if your monthly uptime is shiny new 100%.

A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.

You can and should drive alerting with SLOs, and drive business decision-making with SLOs.

If you are going to run out of SLO within minutes, maybe you want to wake someone up. If you aren't going to run out for days, let them sleep ffs.

If you are bleeding error budget, maybe you need to invest in more reliability (dollars, engineering effort, process)

You cannot act on what you don't measure. Just start with something and iterate, something is always better than nothing.

perfect SLO > good SLO >>> no SLO

If you have a significant outage and nobody complains, maybe you are calibrated too high.

The job of calibrating an SLO is never done, it will need to be continually revised through conversations with stakeholders.

But SLIs and SLOs are only half the picture. They will tell you when something is wrong, but will never tell you what to do about it.

Our outages are never wholly predictable, the same thing doesn't happen exactly twice.

We have to build observable systems, and collect data at a level that will let us question our systems' inner state without shipping new code to handle that question.

It's the only way to deal with new and complex failures.

But let's take it a step farther. Can you mitigate your impact immediately, and debug later? Focus on your SLI and SLO, then someone can look at the data in the next day and figure out what went wrong ... at their leisure, during daytime hours.

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.