Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Jul. 27, 2019 3 min read

(pulling this out bc it has been coming up a lot recently.

also! watch out for the @o11ycast recorded today with @rachelmyers and @eanakashima, wherein @lizthegrey and I debated to-log-or-not-to-log in 🌟exhaustive🌟 detail)

First of all: observability is not for debugging logic in your code, it's for finding _where in your system_ is the code you need to debug.

and ideally providing you with enough of the context and system state whence the error occurred, so you can repro it locally.

Old-school logging in the way you describe, spraying out lots of lines per request, printing out your execution logic, is simply unworkable at scale.

It is at least as likely to TAKE YOUR SYSTEM DOWN as it is to explain your problem.

How? Let's see:

* filling up the local volume
* saturating your iops or nas
* ddos'ing your central log store
* saturating your nic
* causing new or exacerbating existing race conditions
* filling up any number of buffers and eventually your RAM
*... shall I go on?

Logs smoosh together two very different actions and degrees of scale: debugging code and debugging systems. For one you need a microscope, for the other you need a telescope.

Now here is where liz and I begin arguing vigorously...

because I say falling back to logs is an anti pattern;

she agrees, but says that based on her experience running large multi tenant systems, logs are sometimes the only thing that will shed light on contention for shared resources.

"nonsense," I say, "based on MY experience

... running large multi tenant systems, logs are useless for diagnosing bottlenecks and lock contention"

The full argument is on the podcast. ☺️ The difference in our experience turned out to hinge on -- actually you know what, I won't spoil it. 😈

By the end we were in vigorous agreement: resorting to logs is like resorting to ssh'ing in to a host and poking around by hand.

Deprecated. Smelly. Means there is something missing, or something wrong. But useful as all fuck if you're out of clues.

If you take a step back, the argument over logs (or ssh) is just the latest incarnation of pets vs cattle.

You should be using your tools to explore and understand your systems. Inspecting a single host means your tools failed you. When you smell this, you should fix them.

Inspecting a host is often a sign that you are not actually debugging; instead you are either stabbing wildly in the dark, or leaning on intuition or scars of past trauma.

Hrm. Does that make sense to you? 🤔 I can't tell how many folks this distinction resonates with.

Debugging should be an iterative process of reducing the search space and relentlessly honing in on the answer. Ask a question, inspect the answer, ask another question based on the result.

Implicit: you do not know the answer when you set out, you go where the data takes you.

This is why I fucking hate static dashboards; you're just flipping through a pile of possible answers. You jump right past the part where you should have had an open mind.

We are far too used to solving operational problems with guesses and pattern matching.

And logs just reinforce all of our worst habits and impulses.

Instead of iterating, you search for that one magic string you remember, or grep for a user ID you remember that's a good canary for the problem you suspect is what's happening.


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.