Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Nov. 05, 2018 2 min read

If you read the responses to these dev oncall threads, you might think that oncall is an onerous burden, that software engineers are FAR too valuable to have their time and energy squandered (but other engineers aren't), etc.

Let's rewind: what is this actually *about*?

Turning service developers into service owners isn't something we dreamed up to punish wayward engineers.

The evidence simply shows that it results in dramatically better quality of service for users. Not a little better: *massively* better.

Stop thinking about just the upfront costs, and start thinking about the amortized benefits.

Would you rather get woken up at 3 am once a year to watch the bug when it manifests, or spend months chasing it and tilting at windmills because you never can catch it live?

The virtuous feedback loop of software ownership: she who writes the code must be able to deploy the code, and debug the code in prod.

When you break this loop up into disconnected roles -- some can write, some deploy, some debug -- things go to shit, real fast.

If you write some code, you should get it into prod as fast as possible, and then you should LOOK AT IT through the lens of your instrumentation.

Did you ship what you think you shipped, is it doing what you thought it would? Anything else look weird?

If you make this muscle memory, you'll find 80% of the bugs before your users do.

If you're on call regularly and getting user reports about the code you recently shipped, you'll find most of the rest -- long before the context pages out of your brain.

There's nothing easier than debugging code you just wrote. Compared to debugging code you wrote weeks or months ago, that is.

Which is still easier than debugging code you never wrote or reviewed or knew was shipping...which is what most ops teams are trying to work with.

You wanna know why most companies have miserable on call shifts? Not because it's a fact of life. Because they have years and years of shipping software without software ownership. Literally nobody knows what the fuck is going on, and the pain just mounts.

I'm not trying to tell you that unwinding all that pain is super easy or fun.

But leaning on ops teams to absorb the crippling pager load is like developing an addiction to fentanyl to manage the pain of the gum disease you got by not brushing your teeth for the past decade.

It's hard, interesting work. It's super fun and rewarding because you get to see your fixes improve people's lives. Your ops teams should be your strongest allies and expert consultants in this fight.

It's hard, but come on. We're engineers. We love this shit. 🌷

Finally, I want to repeat again that *on call shouldn't be awful*. I was just talking to a co that's 100 eng, growing fast, always had devs on call.

Recently their pager rate soared from 2/month to 2/week, so they're looking to aggressively pay that down.

This is ~typical.

If you can't take getting paged a few times a year, I repeat, there is lots of software that needs to be written that isn't for 24x7 services that users rely on.

Go forth and write that.


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.