Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Mar. 04, 2019 2 min read

When people ask me how they can convince their bosses to shell out for observability, I often toss them two links -- the DORA report by @nicolefv et al and the @stripe developer productivity paper.

And now, thanks to @jasonallen206, I have a third link:  https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed 

The link is a roundup by @sallamar of several hundred production outages, with some fascinating (and well executed) attempts at grouping by proximate cause and breaking down by impact to users.  https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed 

Pie chart-induced eye trauma aside, his findings are blunt, and resonate completely with my experiences running systems. To wit:

1) change is the trigger in 2/3 of outages
2) config drift is deadly
3) we don't know why things fail
4) infra changes are a shrinking %
5) certs lol

Due to hyperconnectedness, ripple effects, and "hope-driven releases", none of these trends are going away any time soon, and in fact *they are all going to accelerate*. Sorry.

The distributedness of systems is increasingly their most salient characteristic.

And check out that point on how the share of incidents that are infra related is small and decreasing!

Hardware isn't failing any less, it's just been successfully made into someone else's job. Ops is moving up the stack.

He has a few very sensible and obvious recommendations on how to deal with this. I will summarize them as "make your system tolerant to faults and resilient to failures"

"..except that's hard? So invest in observability and spend more time ACTUALLY UNDERSTANDING your systems."

And change safety. Apply slightly less duct tape and slightly more rigor to your change control.

Or as I keep barking, invest real developer hours into instrumentation, your CI/CD pipeline, and deployment code, and practice observability-driven development.

As requested: the DORA report  https://devops-research.com/  and stripe developer productivity paper  https://stripe.com/reports/developer-coefficient-2018 

One last thing. You notice what isn't mentioned? Better monitoring. Monitoring is ~useless to developers shipping services and debugging code.

Monitor a few high level metrics and end-to-end checks, absolutely. But aggregates and counters and metrics won't do ✨jack shit✨ when it comes to understanding emergent behaviors or high granularity outliers.

You need to debug in the language you develop. You need request-level instrumentation with ordering so you can flip back and forth between request traces and request aggregates.

Only @honeycombio gives you that. That's why our customers give such breathless happy quotes. 💖🐝

Observability is never going to be as turnkey easy as the old school black box monitoring agents, because it has to come from your code.

That said, it's pretty damn easy -- just install the gem or go get package or whatever. If you wanna try us, here are three links:

🌈Our white papers and ebooks on observability,  https://www.honeycomb.io/resources/white-papers/ 

🌈Play in a sandbox,  http://honeycomb.io/play 

🌈Sign up for a trial,  https://ui.honeycomb.io/signup 

Onward, unto the undebuggable breaches of tomorrow. 🐝📈❤️


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.