Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Aug. 02, 2019 4 min read

i have a theory, which is that we struggle to get the time allocated to pay down technical debt (or improve deploys, etc) because to biz types we basically sound like the underpants gnomes.

step 1 ... pay down technical debt
step 2 ...
step 3 .. PROFIT

we spend a lot of time thinking about this because it's the same story for instrumenting properly, or learning a new tool, or investing in observability.

i.e. the engineers are usually dying to do it, even their managers may be board, yet it can get bumped indefinitely.

which is why you should read this new blog post by @lizthegrey, in which she connects the dots from start to finish: how infrastructure work translates directly into to dollars, and every hop between. 

another thing i keep thinking about is how observability unlocks this whole new universe of data-driven thinking. it's a utility tool you can apply not just to production, but to your processes, pipeline, capacity, users, etc.

if you're trying to win an argument, bring graphs.

(i know this sounds like such a cliche. but watching our users turn honeycomb on problem after problem that we never could have predicted is *so cool*.)

you know your own problems better than any vendor. trust the vendors who aren't trying to take you out of the driver's seat.

so much of the current enthusiasm for AI, ML, etc seem to be predicated on this desire to remove humans from the equation, to replace them with automation and machines that know better.

allspaw said it best: 

my favorite part will always be this. "anomaly detection in software is, AND ALWAYS WILL BE, an unsolved problem". 🛎🛎🛎

furthermore, problems are solved by teams, not individuals, and history is of untold value because while history may not repeat, it definitely rhymes.

this may be self-interested advice, but that doesn't mean it's not true:

if you're trying to modernize your architecture, make on call suck less, put software engineers on call, or make your systems more resilient: your very first step should be observability.

no, that doesn't mean toss some tracing onto your logs and metrics. there are no three fucking pillars, that's nonsense talk: those are just three rando data types. (sorry.. two data types and whatever "logs" is).

@el_bhs refutes it best: 

it means you need the ability to ask any question about what's happening on the inside of your systems, whether you've seen it before or not, and understand what's happening. just by using your tools from the outside.

this means read-time querying over raw events, yada yada.

the reason o11y comes first is because it's like turning the light on in the room.

before you get all fancy with your chaos engineering, or your microservices, or your orchestration and meshes, whatever -- wouldn't it be cool if you could just, like, see what you're doing?

to wrap this around to my original point: we (engineering teams) need to get better at explaining ourselves to other departments.

this means translating our needs and priorities into other languages, specifically the universal corporate language of dollars and cents.

engineering managers, in particular, need to get much better at translating their team's inputs and dependencies into financial terms.

need proof? look at how much easier it is to get a headcount than sign a $200k vendor bill, despite this being flamingly irrational.

other teams aren't malicious; they want what's best for us too. we all have a shared goal -- a whole boatload of them, presumably -- they just don't know how to evaluate statements like "we need to stop shipping features to containerize our build system".

and when your justification for doing so is basically "i have a hunch it will be better", the normal inclination will be to say "sorry, please keep shipping features."

data wins arguments. graphs win arguments. observability *gets you* the data and graphs to win the arguments.

if you haven't watched this talk by @lyddonb yet, you absolutely must. 

it's about the catastrophic consequences of our industry's failure to communicate about what it is we do and why it matters.

we need to do better. we *must* do better. but we cannot explain what we do not ourselves understand.

and that, my friends, is why you should stop putting it off.
stop thinking monitoring is the same, or good enough. and make observability a priority for your org.

oh p.s. here is a thing i have learned working w/ other teams:

stop taking numbers so seriously. other teams treat corporate-money numbers like the handwavey guesstimates that they are; engineers get like biblical literalists, all hung up on how many animals the Ark could hold.

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.