Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Aug. 13, 2019 2 min read

A good question! Monitoring checks and unit tests perform exactly the same function: they regularly and automatedly check that the code or system is operating within "normal" bounds.

A laundry list of known unknowns, in other words.

So do we still need all these tests and checks, in a post-o11y world? The answer is yes...and no.

Yes, you still want to write tests and monitoring checks: to catch regressions, to catch or rule out all the dumb problems before you waste your precious curiosity on them.

But here's where tests and monitoring diverge. Tests don't (usually) wake you up when they fail, whereas the whole raison d'etre of monitoring is alerts, those every-alert-must-be-actionable fucking alerts.

So there's a cost to be borne. Is it worth it? 🤔

Here is where I would argue that in the absence of o11y tooling, team have been horribly overloading their usage of monitoring tools and alerts.

Instead of just a few top level service and e2e alerts that clearly reflect user pain, many shops have accumulated decades of

sedimentary layers of warnings and alerts and monitoring notifications. Not just to alert a human to investigate, but to *try to debug for them.*

They don't have tools to follow the bread crumbs. So they set off fireworks and town criers shouting clues on every affected block.

In a densely interconnected system, it's nearly impossible to issue a single, clean alert that is also correct about the root cause. (First of all, there is rarely "a root cause").

Instead what you get is a few hundred things squalling about getting slower --

none of which are the cause. However, your experienced sysadmin will roll over in bed, groan, skim a handful of the alerts at random; pronounce "redis again" and go back to sleep.

These squalling alerts -- that tell you details about the things you shouldn't have to care about,

but you leave them up because it's the only heuristic you have for diagnosing complex system states -- these monitoring checks can and should die off once you have observability.

With extreme prejudice. They burn you out, make you reactive, and they make you a worse engineer.

Use o11y for what it's great at -- swiftly understanding and diagnosing complex systems, from the perspective of your users.

Use monitoring for what it's great at -- errors, latency, req/sec, and e2e checks.


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.