Charity Majors+ Your Authors @mipsytipsy CTO @honeycombio; co-wrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤Black Lives Matter🖤 Mar. 15, 2020 1 min read + Your Authors

Do I *ever* have a treat for you. 😍 @Mads_Hartmann is back with another installment in the exciting saga of how Glitch embraced observability, and their experiences rolling it out. 

I can talk about this shit all day long, it will never be as valuable to y'all as these first hand stories from users who are wrestling with it in the trenches. 📈

Technical progress is made through stories like these, so ☀️thank you☀️ -- on behalf of the community.

We start with a quick overview of the glitch systems, then zero in on why observability matters to them: in constantly-changing systems, unknown-unknowns are key.

If they are easy, you're in good shape. If you struggle to ask or answer or deal with them, you are kinda fucked.

This can be tricky to diagnose from the outside because it can manifest in so many different ways:

* hiring an ops or SRE team
* over-hiring in general
* premature specialization
* on call is a nightmare
* deploys are scary or messy
* deploys remain stubbornly manual

... and the biggest one of all, which is that nobody can actually answer specific questions about the behavior or performance for individual customers, and trying to do so remains a stubborn mix of guesswork and gluing together bits of answers from components by hand.

Basically, lots of very smart people have been creatively attacking the unknown-unknowns problem for a long time using whatever organizational, technical and financial resources they have had at hand.

And kudos to them. 👏

However, many of these solutions are very costly and add lots of overhead.

(Ever been mystified by a company that employs 1000 engineers to build a product that still looks basically the same as it did with 100, or even 10? Sadly, this is often why.)

As Mads notes, observability wraps its tentacles into every corner of how you interact with production: but incident response is often where it starts, because that's where lack of o11y will fucking *kill you*.

... I'm being informed it is time for bed, and will resume this recap + book review in the morning. 😵🌈🌛😴

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.

Follow Threader