If your tool aggregates at write time and strips context, it is not a debugging tool because you can not answer specific questions.
At best, it may get you close enough that a few lucky guesses or intuitive leaps can land you on the right answer (or some part of it)
This role of sitting between devs and their code, interpreting low level systems graphs and translating them into the language of services and endpoints and what your code is /actually/ doing, has long been filled by your harried ops professional. Ops fingers in the o11y dikes.
Ops builds up enough scar tissue over time that we can make those intuitive leaps to almost *anything*, because we've seen everything.
Not gonna lie, it feels great to be a wizard. 🔮🧹🌚
But then it all falls apart once the system is too complex and ephemeral to fit in your head and reason about, and unknown unknowns start to out pace the known unknowns.
I'll miss being a wizard, but I'm ready to stop being a translation layer. Better shit to do with our time.
Is it clear what I mean when I say you can't build a debugging tool with metrics? 🤔 Hmm.. maybe an example will help. Let's see. 🤔🤔
Imagine you're looking at your dashboards and you see a big spike in errors. So you start investigating.
You start flipping through your other dashboards, looking for other shapes that correlate to the error spike. (Right?)
You dismiss some of them because your intuition or knowledge of the system tells you they are likely effects of the errors, not caused by the errors.
But some look suspicious.
You zero in on a few in particular: it looks like the errors are only to a particular shard, and from there you can narrow it down further: the errors are only to the primary node, there was a spike in SELECT queries and queue length shortly before, and
your disk IOPS and nscanned are consistent with a pattern you've seen before where a user launches a bot and it takes a couple minutes to get auto throttled. You check the log and this user did get throttled. Satisfied, you move on.
Except, that wasn't the problem.
You jumped to the log and looked for a thing. So you didn't notice that actually LOTS of users got throttled. The actual culprit was an index running or some other write lock being held which caused everyone's queries to back up and clients to issue retries,
and the autothrottle went on a rampage.
With metrics and dashboards you we always looking for whatever you managed predict. If instead you start with the error spike, then break down by user or app etc, and follow the data where it takes you, you don't have to
lean on guesses anymore.
Another super common version of this dashboard blindness is when you have a clump of errors and you check your logs to see if they all have $x in common, and they do! ... but they also have ten other dimensions in common.
It is literally mathematically impossible to derive this information from a metrics based system. Using events it can be done. (Using honeycomb it's a breeze, just draw a bubble around the spike, we precompute the baseline vs bubble for all dimensions, and you can see
what the differences are in a second or two. Feels like magic)
So if all the errors happen to be for requests by user ID 555, shard 2, shopping cart id 10565, item 23, region us-west-1, client type iOS, version 13, app version 9099, timezone GMT...
... you can either pattern match enough from metrics dashboards to guess that you need to jump into your logs and start grepping around to find the culprit,
or you can draw a bubble around the spike and get a list of the dimensions that differ.
You always have to jump from metrics dashboards into something, to narrow it down to a real answer. Metrics only get you within guessing distance.
If you're lucky. So good luck.
You can follow @mipsytipsy.
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.
Enjoy Threader? Sign up.
Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.