A lot of comments on our low CPU usage at Stack Overflow the past few days. But I haven't seen something critically important to understand come up at all.
Let's talk about CPU usage and how it's measured. It's often not what you think.
When we want to get a metric from a system, "% CPU" isn't a metric. There is no way at any point in time ever to get this data. A CPU is a complex construct of parallel pipelines and stages and things are always at various points along the way.
We can't get point in time data. It's not possible. So what we do is look at a window of data. You take some slice of time (let's pick one and say a second) and you measure how long was spent and how much was done.
These are counters in every modern OS. Then we divide.
Why does this matter? Because that time slice matters. Using us as an example, we render pages in < 20ms. Did we have a pegged CPU for half a second averaging 50% CPU with that divide? Or did we have a constant ~50% with capacity overhead the whole time?
You can't tell.
The important thing to remember about counters and data collection is that they are only valid observations down to how finite the counter in. Within that windows is *an average*. With all the caveats and mysteries that an average comes with.
Now, on recording...
Counters aren't trivial, there's some cost to accessing them. Recording them (and sending their data somewhere) takes:
- CPU on the host to read
- network bandwidth to send
- Storage to store
- More CPU capacity to process and view
So more often: more expensive. It's a balance.
At Stack Overflow we default to 15 second intervals on system metrics collection. It's the best balance for us. Monitoring is great, but if you dial it to 11 and the CPU you're wanting to monitor is now being eaten primarily by the monitoring...well, yeah...don't do that.
When I try to explain intervals, I find it helps to explain bandwidth instead.
Is your network connection 1gbps?
Is it 100mb/100ms?
Is it 1mb/1ms?
The answer is yes. While we look at a second to be reasonable, the limits and throughput are much more finite in practice.
Modern CPUs are starting to push 5 BILLION operations per second per core. For us to measure it in seconds is in some ways laughably silly. But, it's also reasonable. Measuring in billionths of a second is far sillier.
Anyway, keep in mind: you're often looking at an average.
To wrap up:
Having headroom *in an average* doesn't mean you can get away with less CPU and maintain the same performance. When your units of work are small, those tiny-time 100% spikes are averaging out.
It's good to trust metrics, but only once you understand what they mean.
You can follow @Nick_Craver.