Let's talk... about raw events, logs, aggregates and data structures. The meat and potatoes of computering.
What *is* an event? For the purposes of this thread, let's define an event as one hop in the lifecycle of a request. (Oodles of details here: https://www.honeycomb.io/blog/2018/06/how-are-structured-logs-different-from-events/ …)
So if a request hits your edge, then API service, then calls out to 4 other internal services and makes 5 db queries, that request generated 10 events.
(If it made 20 cache lookups and 10 http calls to services we don't control and can't observe, those don't count as events...
... because this is all about *perspective*, observing from inside of the code. You can flip to the perspective of your internal services but not the external ones. And it probably isn't useful or efficient to instrument your memcache lookups. So those aren't "events")
OK. Now part of the reason people think structured data is cost-prohibitive is that they're doing it wrong. Spewing log lines from within functions, constructing log lines with just a couple nouns of data, logging the same nouns 10-100x to correlate across the life cycle.
Then they hear they should structure their logs, so they add structure to their existing shitty logs, which adds a few bytes, and then wonder why they aren't getting any benefit -- just paying more.
You need a fundamentally different approach to reap the max benefits.
(Oops meeting -- to be continued!)
<10 hours later>
OK LETS DO THIS
So let's talk about the correct level of granularity/abstraction for collecting service-level information. This is not a monitoring use case, but sure af ain't gdb either. This is *systems-level introspection* or plain ol' systems debugging.
In distributed systems, the hardest part is often not finding the bug in your code, but tracking down which component is actually the source of the problem so you know what code to look at.
Or finding the requests that exhibit the bug, and deducing what they all have in common.
Observability isn't about stepping through all the functions in your code. You can do that on your laptop, once you have a few sample requests that manifest the problem. Observability is about swiftly isolating the source of the problem.
The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.
We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.
Initialize the empty debugging event when the request enters the service. Stuff any and all interesting details while executing into that event. Ship your phat event off right before you exit or error the service. (Catch alll the signals.)
(This is how all the honeycomb beeline integrations work, btw. Plus a little magic to get you started easy with some prepopulated stuff.)
Stuff you're gonna want to track is stuff like:
🎀 Metadata like src, dst, headers
🎀 The timing stats and contents of every network call out
🎀 Every db query, normalized query, execution time etc
🎀 Infra details like AZ, instance type, provider
🎀 Language/env details like $lang version
🎀 Any and all unique identifying bits you can get your paws on: UUID, request ID, shopping cart ID, any other ID <<- HIGHEST VALUE DETAILS
🎀 Any other useful application context, starting with service name
🎀 Possibly system resource state at point in time e.g. /proc/net/ipv4
All of it. In one fat structured blob. Not sprinkled around your code in functions like satanic fairy dust. You will crush your logging system that way, and you'd need to do exhaustive post-processing to recreate the shared context by joining on request-id (if you're lucky).
And don't even with unstructured logs, you deserve what you get if you logging strings.
The difference between messy strings and one rich, dense structured event is the dif between grep and all of computer science. (Can't believe I have to explain this to software engineers.)
You're rich text searching when you should be tossing your regexps and doing read-time computations and calculations and breakdowns and filters. Are ye engineers or are ye English majors?*
(*all of the English major engineers that i know definitely know better than this)
(though if you ARE in the market for a nifty post-processor that takes youyr shitty strings and munges them into proper computer science, check out @cribl_io from @clintsharp. bonus: you can fork the output straight into honeycomb)
Lastly, I hope this makes it plain why your observability needs require 1) NOT pre-aggregating at the source, but rather sampling to control costs, and 2) the ability to drill down to at least a sample of the original raw events for your aggregations and calculations at read time
Because observability is about giving yourself the ability to ask new questions -- to debug weird behavior, describe outliers, to correlate a bunch of events that all manifest your bug -- without shipping new custom code for every question.
And aggregation is a one way trip.
You can always, always derive aggregates and rollups and pretty dashboards from your events. You can never derive raw events from your metrics. So you need to keep some sample of your raw events around for as long as you expect to need to debug in production.
(weighted samples of course -- keep all of the uncommon events, a fraction of the ultra-common events, and tweak the rest somewhere in between. it's EXTRAORDINARY how powerful this is for operational data, tho not to be used for transactions or replication data.)
... and now that we've covered logs, events, data structures, sampling, instrumentation, and debugging systems vs debugging code, i do believe that we are done. apologies for the prolonged delay!
.. fuck i shoulda done a blog post shouldn't i
You can follow @mipsytipsy.
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.
Enjoy Threader? Sign up.
Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.