Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Apr. 24, 2019 5 min read

alright, this is a damn good question. and tbh i am surprised it doesn't come up more often, because it gets right to the beating heart of what makes any microservices architecture good or bad.

i will take a swing at answering this (and please chime in, everyone!) with the giant caveat that if you were to ask me this in real life, i would immediately turn around and pepper you with very specific questions about your team, your purpose, your stage of life and challenges.

in my experience, when people say "i tried putting my developers on call and it was a nightmare" these days, it is usually because their architecture exists in a partial state of microservice-lite decay.

like they adopted some microservices-ish solutions, but not really.

as a result, whenever anything goes wrong in ANY part of their infra, EVERY engineer gets paged due to cascading dependencies and poor edges hygiene.

this is *not* how you model software ownership.

the ideal of microservices is that each small team can be responsible for a small set of services, and they can ship code independently of each other, respond to outages independently of each other.

cool story bro. this is *hard* to roll out in real life. (sorry, </rebooted>)

independence is just a layer of abstraction. it isn't real. of *course* you depend on each other. it's not very different from the way you depend on your infrastructure providers.

the goal isn't lack of interdependency, it is resiliency and designing human-centered systems.

the hardest thing in distributed systems is not figuring out what the bug is, but where it lives. it's not figuring out why the latency is rising, but what is causing the latency to rise.

this is why tracing and event-oriented debugging tools like @honeycombio are non-optional.

all you kids who are trying to explain and debug your microservices and distributed systems using ordinary logs and monitoring tools, sigh. there is honestly only so much i can do to help you.

you *have* to upgrade your toolkit.

this game is about understanding behavior users are experiencing. so you need to be able to start with a big fuzzy picture, slice and dice to reduce the search space, find an example, zoom in and trace that example; then zoom back out to see who else was impacted.

(obligatory side note: this is a workflow that **only** @honeycombio can do right now. other vendors are telling their users it is impossible, can't be done. 🙄 but it is so fucking key to everything that they must be racing to catch up behind the scenes. they *must*.)

in terms of operability, you need loose coupling between services so that you can deploy independently and debug independently.

in terms of data changes, you need forward-backward compatibility and non-breaking schema changes. you need a decent understanding of your data store.

the most interesting challenges are at the edges. your services need to have protocols for talking to each other. your services need to be built for resiliency. you can no longer assume your databases are in a "zone of trust" and will always be available.

everything WILL die.

every {service, db, cable, user, etc} will die, and what will you do when it does?

this question is at the core of microservices observability/reliability, and this mandatory acceptance of our shared doom is why it's such an exciting opportunity.

it's not about how many nines can i prop my system up for at all costs.

it's about how many systems can be partially degraded or completely down before our users even notice, or before it's bad enough to interrupt an engineer's sleep or weekend fun? (goal being, a LOT)

but this isn't quite what you asked. you asked how to achieve ownership over parts of a microservices-based system, without drowning everyone in others' alerts -- the ones you aren't responsible for and don't own.

the answer starts with health checks, instrumentation and SLOs

as well as architectural choices that empower individual services to register and deregister themselves (and ascertain the health of each other), and client libraries with a robust, consistent approach to errors and retries, and sane limits on resources.

the business of connecting and disconnecting, serving and pausing traffic, all of this is hopefully abstracted away from your application code and you do not need to think too much about it.

e.g. data services need to be separated from stateless services, for obvious reasons.

the entire 20-year knowledge bank we have built up around "how to monitor systems" should basically be chucked with the week-old leftovers.

fuck a cpu load average, fuck allllll that shit. it is not useful to you. you don't give a shit, and you shouldn't.

what you care about are end-to-end health checks that traverse your critical code endpoints, and event-oriented instrumentation that lets you swiftly identify chokepoints or which component(s) are at fault, or what those faulty components have in common.

you also care about requests/sec, latency, and errors/sec; per high level system and per microservice or logical service. per whatever makes you money. check /payments, check /login, check /account.

you can chuck the rest of your alerts in the garbage.

as for ameliorating the situation where "the database" gets slow and suddenly everyone gets paged at once ... this is why you slap a data service in front of the storage source. then it is subject to the same governance policies as any other service.

yes ma'am: averages are utter flaming bullshit. you care about mean, 90th, 95th, 99th, 99.9th, 99.99th, MAX. (watch out, if you're using metrics you are likely getting averages OF AVERAGES 😭)

(honeycomb ofc computes these on the fly, using read-time aggregation that fans out over raw events.)

where was i..

i'm about out of random effluvia on this topic, so let's wind down. don't page people about shit that isn't urgent. auto-remediate what you can. be tolerant of most failures "til morning".

don't page all potentially-relevant humans, let humans route/escalate to each other.

and eventually, this is why most sufficiently large and complex systems end up evolving some sort of first-contact SRE team that does triage, supports mature/stable systems, and loves this kind of high-wire debugging mission.

get rid of as much state as you possibly can, because state causes problems. auto-remediate now, leave it for a human to investigate in the morning. sleep is sanctity.

monitor your humans and how often they are getting interrupted or woken up, and treat this metric as p0.

find out what your users actually care about (hint, it's not $allthethings) and invest extra into their resiliency and into the "degraded" experience.

remember that you cannot care about everything equally. so rank them, so everyone knows what to prioritize independently.

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.