Charity Majors
+ Your AuthorsArchive @mipsytipsy cofounder/CTO @honeycombio, co-wrote Database Reliability Engineering, loves whiskey, rainbows, and Friday deploys. I test in production and so do you. 🌈 Nov. 29, 2018 1 min read

Helloooo, my sweet little gumdrops. Let's talk about platforms and cotenancy problems.

And by this I mean systems where a single user or app (or whatever the customer unit is) can saturate a shared resource and deny service to everyone, and this happens on a regular basis.

Yes, in theory any multi user system is vulnerable to this. But the risk and impact are amplified if:

* the clients are other computers, not humans
* the users are other businesses, not people
* those businesses are building other business on top of you, and reselling access

* your platform involves storing and serving data, and scales up linearly with those users
* you allow your users and their users to write custom code and/or queries to run on *your* platform,
* especially (but not only) if their code or queries run on shared resources

oopsie sorry have a meeting -- story time will return in 30 min


I AM RETURNED. Now where was I...

Platforms. Right. So to me, as a backend engineer, a platform is basically defined by this practice of *inviting a user's chaos to reside on your systems instead of theirs*. Making their unknown-unknowns YOUR ops problem, not theirs.

Now you might think that as a platform, your ops burden should scaleup linearly with each customer who outsources their operations to you.

HAHAHAHA. Sweet child, if if does you have bigger problems.

No, platforms don't work that way. More likely you have two kinds of customers: the "Effectively Free" category and the "You Died For Their Sins" category.

One app, two apps, hundreds and thousands of apps... "EF" apps are ones you never even have to think about as individuals.

This, btw, is the dream of Platformlandia. That you can put forth a framework and composable bits, then leverage automation and abstraction and economies of scale so the marginal cost of each new customer bends down to 0.**

(**0 extra time and attention, above all.)

Hopefully, the majority of your customers are EF apps and you can treat them all naively. What works for one, works for all equally.

More likely you have some of these in your code:

if (uuid.equals("CA761232-ED42-11CE-BACD-00AA0057B223")){ do ..

These are the "YDFTS" customers. They are disproportionately likely to be larger, more unusual, and (one hopes) paying you buckets of money.

Which doesn't change the fact that every time you do custom work for one of them, you are mortgaging your future. It's not sustainable.

What were we talking about? ... Oh right! A core difficulty of these large multi-tenant systems -- perhaps THE core difficulty -- is the problem of cotenancy on shared resources.

When you have unpredictable traffic patterns and users prone to independent burstiness,

you're also gonna have a situation where any one of your bazillions of users can likely take the entire system down for ALL users, just by saturating any one of the many shared components.

(assuming you have shared components; but not sharing any is hard and $$expensive$$)

This why I care so much about investing in various throttles, blacklists and filters; by user, component, service and db: to protect your platform for the 99% who are good actors from the 1% who aren't.

And it's why your platform should be written in a multithreaded language.

Otherwise, if you're stuck using the request-per-process model, any time ANY component behind that service gets slow the pool of workers will begin to fill up with requests waiting for that slow component.

It can take your entire system down in seconds flat.. for *everyone.*

I'm not done. It gets worse. 😁

It can be insanely difficult to figure out which of your thousands of active users actually took down the site. (Raise your hand if you've ever just started bisecting users or shards or services to narrow down the culprit... ✋)

It may very well not be one of your top users by request count. It may very well not be one of your slowest users by latency. It may very well not be one of the top users by error count.

It might not even be the fault of any *users*; it could be networking or db hardware or..

The only way I know of to diagnose this category of problem rapidly and reliably is the honeycomb method. You sum up the requests that are in flight, then break down by a dimension, then another, until you find the predominant thing all the requests timing out have in common.

A related problem is the "ugh there's a new bot" for databases. Your db is groaning but the problem isn't the apps with the longest queries, or the most queries by count, etc.

You just need to sum up the lock time held, then break down by app id to see what's holding the lock.

The one characteristic of platforms that matters more than any other is that the health of the "system" doesn't mean shit. It literally doesn't matter.

All that matters is each and every customer's experience. No matter how large or small they are, or how weird their use case.

And every single one of their experiences matter. You have to care, and you have to be able to **see** exactly what their performance and experience is like. This isn't like mass media or ads, where you just spray and pray: platforms are the bedrock of your users' businesses.

So traditional monitoring tools that gather or aggregate metrics at the top are anywhere from useless for platforms to actively misleading. And the hacks necessary to get around this are appalling. Pre-generating dashboards mocks for individual users, etc. shudder.

Back when we started @honeycombio, we thought we were building a toolset for platform problems. It was only much later that we realized it was useful for other things, too.

But if you run a platform, you should try it. You'll wonder how you ever computered without it.

Okay this thread is already 36 hours old, not too late to pile a few more on right? 🥰 So back to the part about one-offs, and how they are the kiss of death for any platform.

A one-off is never a one-off: never.


Any one-off act points a break in reality; a jagged edge that marks the discontinuous drop between the flexibility your users crave and the limited, cautiously bounded model you can naively support.

If nothing else... Your product team should keep an eagle eye on those oneoffs.

It's really really hard to get the level of abstraction right. You have to be willing to say no, no I'm sorry but we can't support this feature as a platform component right now. Sometimes.

Either that or build it and hemorrhage your engineering lifeblood trying to prop it up.

You can follow @mipsytipsy.


Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Since you’re here...

... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.

Follow Threader