honestly, i'm gonna disagree with you here. ☺️ the right tool for the job of *infrastructure* monitoring is ... metrics-based, datadog-style aggregates and monitoring.

if you are tasked with 'the health of the infra' you care about aggregate errors and trending capacity.

if you are tasked with "the health of the infra" then the last thing you want is to have your errors dashboard cluttered with all the self-inflicted 504s and 500s that app developers generate. that's on them, not on you.

you need to care about things like ... we're about to run out of capacity for this instance type, let's bring up more racks

you need to care about things like ... is this switch saturated, are packets dropping and latency rising, and what is it clustered around

monitoring tools like datadog and prometheus are the right tool for the job. i'm serious.

observability is for people who are writing code and shipping code and trying to understand their code in production. thus everything is oriented around the request context. it's different.

which is not to say that observability is completely unconcerned with system resources.

if you just shipped a change that caused memory usage to balloon by 3x... you need to know that. but you don't actually care about whether your infra is about to be out of memcache capacity!

you care about the ability of each request to acquire the resources it needs to execute. and if it couldn't, you want to know what resource and why. that is _it_.

fixing memcache capacity issues, that's up to AWS or whoever runs your infra for you.

does that make sense to people? the separation of concerns, the different tools for infra teams concerned with overall infra health, vs people writing and shipping code?

