Charity Majors+ Your Authors @mipsytipsy CTO @honeycombio; co-wrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤Black Lives Matter🖤 Jul. 10, 2020 3 min read + Your Authors

if you're looking for some meaty technical and philosophical material on sampling, look no further:  https://www.honeycomb.io/sampling/ 

but i will also give a quick tldr, because i just caught up on twitter for the first time in weeks and i am feeling.. benevolent☀️

FACT: systems generate reams of telemetry. but strangely nobody wants to pay more for observability than for their actual infra.

Every solution solves this by discarding, deleting, or aggregating the overwhelming majority of this data in some way(s) before they store it.

If you're using old-school metrics a la prometheus or datadog, they have an arsenal of tsdb tricks up their sleeve -- pre-aggregation, fixed-size databases, loss of fidelity as the data ages, etc.

Some of them will brag about no pre-aggregation -- signalfx used to do this a lot

-- but that's just language trickery. because ALL of those solutions discard the most important information of all -- the connective tissue of the event -- before ever writing to disk.

yeah, maybe they sync'd to disk after every counter increment, but you still can't correlate

a spike in one metric to a spike in another metric and see if those were actually the same events or not.

ever.

which makes it all but useless for asking new questions, trying to understand outliers, and other pedestrian use cases.

so metrics products delete data horizontally, and products built on traces/spans/events delete data vertically, letting you sample by trace (so you retain all the service hops for the traces you choose to keep).

you can do this sampling head-based or tail-based;

head-based means you decide which traces to keep at the start of the request, tail-based means you decide sometime shortly after the trace has completed, which means you can do a much better of spotting and retaining only the most *interesting* traces (they are slow or whatever)

obvi running tail-based sampling means you have to run something near your app to buffer it and make those decisions.

nobody wants to pay to keep all the traces for all the requests (past a medium size request volume), so you can assume all observability tools perform sampling.

even if they say they don't. again, that's sophistry (they may ANALYZE every single trace before discarding the majority of them 🙄)

sampling is an incredibly powerful scientific tool, and i am pleased that engineers seem to be warming back up to it, despite years of log vendors

fearmongering about KEEPING EVERY LOG LINE. bullshit. events are not all created equal.

- are your health checks as valuable to you as your users' requests?
- are your cache hits as valuable as your db CRUD operations?
- are 200 OK requests to / as valuable as 500s to /payments?

with honeycomb, every event has a sample_rate attribute. if you are sampling 1/10 of health checks, your sample rate is 10, and this lets the UI do the math so the volume looks right. (if you hover over it you can see the sample rate!)

that said, we sample _nothing_ by default.

ever. you do not HAVE to sample using honeycomb. plenty of folks run sizable volume (tens of thousands of requests/second) and simply retain everything. it's not that important or expensive until you get Very Large (or if you are a real tightwad).

in conclusion. most people have a visceral recoil at the idea of sampling, that it's bad, or cheating, or you won't have the unknown-unknowns when you need them.

i know i did. it took me a while at facebook to ease into it, to realize how much trash we were collecting.

this is why all vendors hasten to reassure you that they would NEVER sample, gasp!

it's sophistry. at a certain scale, everyone makes choices about what to sacrifice. we choose to store the interconnectedness that enables you to ask new questions, ones you couldn't predict.

we are simply choosing to be straight with our users, and explain why sampling is a) inevitable, b) a superpower, c) an affirmative good.

(i mean...contrarianism, education, and hard-core engineering; that's kind of the honeycomb brand, right?) ☺️  https://www.honeycomb.io/sampling/ 


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.


Follow Threader