I've begun to see the inexorable sprawl of alerts, monitoring checks and dashboards as a deep well of technical debt.
you have an outage, or some system impacting event. you resolve it. you call a postmortem or retrospective. at the end, someone asks:
"can we make a dashboard so we can find the problem immediately next time?" and
"what alert can we set up, to notify us when this happens?"
fast-forward a year. how many dashboards does your team have to wade through? how many alerts wake you up in the dark of night? how much time do you spend tending and curating them or tweaking thresholds?
does everyone even use the same ones, or are you fragmented?
monitoring checks, alerts on symptoms, pane-of-glass dashboards -- the trusty-rusty tools of yore are powerful tools for stable systems of knowable scope.
the frame you want for modern systems is debugging, not monitoring. if you can't spot a problem in a glance, shouldn't try.
think about BI tools. they would think you batshit crazy if you said "here's a pile of dashboards, which one represents the user behavior you are currently trying to understand?"
it's not hard. but it's an exploration game. you follow the breadcrumbs where they take you.
no, you should not add a paging alert for every symptom that may or may not signal a problem worth escalating about.
no, you should not add a monitoring check for every system state that sometimes represents a problem worth escalating about
no, not another dashboard. just no.
here's the truth about alerts: the overwhelming majority of problems that ever happen in a system do not AND SHOULD NOT generate an alert. esp during off hours. esp paging alerts.
in a distributed system, innumerable bugs and catastrophic states exist at any time. i.e. in your system, right now.
but you aren't going to notice many (if not most) of the bugs or problems, and the badness needs to rise to a certain level to even be worth your time.
the only paging alerts you really need are request rate, errors, latency, and some end-to-end checks that traverse the critical code paths, probably around what makes you money. (if you're larger, this set for each service.)
all your other paging alerts are technical debt. they're a symptom of your inability to explore your systems and ask simple questions in an effective and timely way. they're a bandage over your archaic tooling.
(god, the number of times i remember relying on a cluster of paging alerts to go off ... to signal a problem in a COMPLETELY UNRELATED COMPONENT. glaarrrgh.)
likewise debugging with static dashboards isn't debugging. it's pattern-matching with your eyeballs.
dashboard-flipping isn't science. with science you ask questions -- you formulate a hypothesis, you test it. you follow your bread crumbs where they lead you.
and once you have forty thousand static dashboards, you're just drowning in them
every dashboard is an artifact of some past failure, and the data sources may or may not even be working, and your team's entire view of the world has fragmented. so just fuck dashboards.
you can't model the system in your head any more, and you shouldn't try. get that shit out of your head and in to a tool, where you can interact with it. ... and more importantly so can your team.
have a few blessed entry points that are maintained and shared by the team. make exploration the expectation, debug by interacting with the problem not by flipping through dashboards.
send nearly all "alerts" to a non paging source with an SLA of hours, not minutes.
(honestly just set all your alerts to email only and see what happens 😈
it probably won't be worse. it might be better.)
You can follow @mipsytipsy.
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.
Enjoy Threader? Sign up.
Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.