Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Nov. 03, 2019 2 min read

i want to start a series called "That's It?" where we walk through horrendous outages that took many people many hours of debugging time to resolve, with all the metrics and clues they used

then show how they would have found it in honeycomb in 1-2 clicks, every goshdarn time.

a dear friend visited last week. they run a tight shop with good engineers, but have ... outgrown their tools.

he was describing this thundering herd problem, where thousands of workers would spin up and hammer the one redis cpu. it took a long time to discern this and why.

he was describing all the truly impressive heroics they underwent, and then he says skeptically "and honeycomb would help me with this ... how?"

me "oh god so easy. just sum up all the time spent by the workers, break down by backend or userid, either way it's *right there*."

me "and that's, like, the slow and manual old fashioned way! nowadays we'd say 'start with a heatmap of the latencies, then draw bubbleup around the thing you want to understand.' it computes ALL the dimensions and sifts the ones that differ to the top, no guessing necessary!"

which is super powerful since usually if you're just guessing and using your human brain, you might guess one or two of the causes, but not all of them.

e.g. when the errors are all a particular version of ios, device, language pack, region, hitting a certain endpoint, etc.

another edition of "That's It?" is what i think of as the @github problem. you have many users, and suddenly one of them gets hacked and starts emitting a stream of bot traffic.

...not enough to get up into your top 10 or 100 users, but enough to put strain on a shared service.

ok. so you can hire a team of ML or AI experts, and train them on massive datasets so they can write tools that "learn" what "normal" looks like, and then drive your ops team bananas with false positives every time a human confuses the AI...

(i said you *could*. i wouldn't.)

i would fucking sum up the resources used, then break down by user id.

"oh look, that guy's consuming 90% of the processing time and he pays us $20/month." block the fucker and go for a drink.

jesus people this isn't rocket science, just good clean fun with high cardinality.

debugging doesn't have to be that hard. we have MADE it hard by scattering all the relevant detail to the four winds.

of course it's complicated if you're trying to hop from tool to tool to tool, just to recreate what happened from log spew and metrics and traces.

it's as though you're a detective, and your file folder has been shredded and deliberately scattered around the house. and before you can read the fucking folder you have to reconstruct it, oh and it's in invisible ink so you can't even carefully re-attach the shreds by pattern.

(metaphor note: the shreds with invisible ink are metrics, and the shreds with ink that you COULD use to re-assemble them are log lines, at least if you were disciplined enough while emitting them.)

if you just had the fucking folder of data, it probably wouldn't be that hard.


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.