When people ask me how they can convince their bosses to shell out for observability, I often toss them two links -- the DORA report by @nicolefv et al and the @stripe developer productivity paper.
And now, thanks to @jasonallen206, I have a third link: https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed …
The link is a roundup by @sallamar of several hundred production outages, with some fascinating (and well executed) attempts at grouping by proximate cause and breaking down by impact to users. https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed …
Pie chart-induced eye trauma aside, his findings are blunt, and resonate completely with my experiences running systems. To wit:
1) change is the trigger in 2/3 of outages
2) config drift is deadly
3) we don't know why things fail
4) infra changes are a shrinking %
5) certs lol
Due to hyperconnectedness, ripple effects, and "hope-driven releases", none of these trends are going away any time soon, and in fact *they are all going to accelerate*. Sorry.
The distributedness of systems is increasingly their most salient characteristic.
And check out that point on how the share of incidents that are infra related is small and decreasing!
Hardware isn't failing any less, it's just been successfully made into someone else's job. Ops is moving up the stack.
He has a few very sensible and obvious recommendations on how to deal with this. I will summarize them as "make your system tolerant to faults and resilient to failures"
"..except that's hard? So invest in observability and spend more time ACTUALLY UNDERSTANDING your systems."
And change safety. Apply slightly less duct tape and slightly more rigor to your change control.
Or as I keep barking, invest real developer hours into instrumentation, your CI/CD pipeline, and deployment code, and practice observability-driven development.
As requested: the DORA report https://devops-research.com/ and stripe developer productivity paper https://stripe.com/reports/developer-coefficient-2018 …
One last thing. You notice what isn't mentioned? Better monitoring. Monitoring is ~useless to developers shipping services and debugging code.
Monitor a few high level metrics and end-to-end checks, absolutely. But aggregates and counters and metrics won't do ✨jack shit✨ when it comes to understanding emergent behaviors or high granularity outliers.
You need to debug in the language you develop. You need request-level instrumentation with ordering so you can flip back and forth between request traces and request aggregates.
Only @honeycombio gives you that. That's why our customers give such breathless happy quotes. 💖🐝
Observability is never going to be as turnkey easy as the old school black box monitoring agents, because it has to come from your code.
That said, it's pretty damn easy -- just install the gem or go get package or whatever. If you wanna try us, here are three links:
🌈Our white papers and ebooks on observability, https://www.honeycomb.io/resources/white-papers/ …
🌈Play in a sandbox, http://honeycomb.io/play
🌈Sign up for a trial, https://ui.honeycomb.io/signup
Onward, unto the undebuggable breaches of tomorrow. 🐝📈❤️
You can follow @mipsytipsy.
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.
Enjoy Threader? Sign up.
Since you’re here...
... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.