Charity Majors+ Your Authors @mipsytipsy CTO @honeycombio; co-wrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤Black Lives Matter🖤 Nov. 04, 2019 4 min read + Your Authors

I think it's the sheer number of social, technical and organizational problems hiding under this rock. It's not just deploys, it's:

* not communicating changes well to users
* relying on users to report issues
* widespread production illiteracy
* big bang deploys, poor tooling

* the accumulated debt of years of building poorly understood features and fixes atop poorly understood systems

Last week I wrote that piece on developing at honeycomb vs Parse, and I can't stop thinking over how I might communicate this better.

It's not perfect, but one possible analogy is going to the doctor or the dentist.

Imagine going the dentist for the first time in 10 years, vs going in for your regular 6 month checkup. Those are likely to be very different conversations, right?

Or imagine going to the emergency room with the following complaints: high blood pressure, heart arrhythmia, diabetes, a broken arm, an ingrown toenail and smallpox.

What are the chances they ever get around to giving a fuck about some of those?

Yet that is how we run our production systems. We are so busy lurching from broken arm to smallpox, or at least outage to outage, that we never tend to the small problems while they're small.

If you brush your teeth regularly and get your checkups, then small problems never get the chance to grow up into big problems.

Now this is hardly a novel insight. Many of you would protest: you already do this -- you have have for years!

(oopsie.. fell asleep 😑)

And yes: absolutely. The intent was there. But the *tooling* wasn't.

What you needed was what I have called "observability". And if you think about it for a minute you will see why it matters, and why I am such a stickler for the definition.

To get observability the way I've defined it ( ) you need:

...arbitrarily wide events, emitting all the context per request, such that you get to group by any dimensions, or see what attributes any set of errors have in common, etc. All in near real time.

In the most practical terms, this gives you the ability to

* break down by build id, app id, user ID, etc, and any combination thereof
* trace the request
* know the full context of every request or error at every hop
* find out exactly what any group of errors had in common

I've said this: most of us have no idea what we are shipping most of the time. Nor can we determine where an error comes from without heroic effort.

Shipping code every day under these circumstances is like going day after day, meal after meal, without brushing your teeth.

To tie it all back together (and wind it up, because I really do keep falling asleep), a last thought.

The stripe developer report says that developers spend 41% of their time on bullshit. . Ok, that's not great.

Sometimes people ask me, "well how much time does YOUR team spend on toil and other bullshit that doesn't move the product forward?"

And I confess that we still spend maybe 25-30% of our time on this crap.

"That's not SO much of a difference, is it?" they say, frowning.

But it is. For two reasons.

First, we have a tiny fucking team for how much surface area we build and maintain. A storage engine, query planner, API, many microservices, SDKs and beelines in half a dozen languages, a UI, billing, an on prem crypto proxy...

7 engineers.

Typically when an eng team starts to have >50% of their work go to toil, they bitch and moan until the problem gets hiring more engineers to rebalance the toil.

We haven't had to do that.

Second, the stripe report buckets a lot of different kinds of toil into a single bucket. That's unfortunate.

There is a huge, huge difference between a team that is lurching breathlessly from crisis to crisis -- the technical equivalent of broken arms and smallpox --

and a team that calmly estimates, evaluates, and plans to pay down their technical debt.

My team puts in 25-30% of their time doing maintenance type work that doesn't push the product forward. But the time is planned and allocated just like any other work.

And we work on things while they're small. It's every-six-month cleaning type tech debt work, not oh-shit-ten-years-where-do-i-even-start.

We spend our cleanup cycles on hangnails and high blood pressure, so that problems aren't allowed to *become* big.

Anyway. I need to think on explaining how key observability is to a better way of living and developing software.

It isn't the only thing, of course. Observability is necessary but not sufficient. But it's the main one that most of you haven't got. ☺️😴🐝

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.

Follow Threader