Utilization, Saturation, and Errors
Observability | Technical Operations Excellence
| Metric | Definition | Target |
|---|---|---|
| Utilization | % time resource is busy | <80% |
| Saturation | Queued work beyond capacity | 0 |
| Errors | Error count/rate | 0 |
Created by Brendan Gregg for systematic resource analysis
| Resource | U Metric | S Metric |
|---|---|---|
| CPU | % busy | Run queue length |
| Memory | % used | OOM events, swap |
| Disk I/O | % busy | Queue depth |
| Network | % bandwidth | Drop/retransmit |
| Storage | % capacity | Out of space |
| Threads | Pool usage | Blocked threads |
| File Handles | Open FDs | FD exhaustion |
| Method | Focus | Best For |
|---|---|---|
| USE | Resources | Infrastructure, VMs |
| RED | Requests | Services, APIs |
| Golden Signals | User experience | Customer-facing |
For every resource, check utilization, saturation, and errors. Start here for performance issues.
- Brendan Gregg
| Metric | Query Pattern |
|---|---|
| CPU Util | rate(cpu_seconds[5m]) |
| Mem Util | used / total * 100 |
| Disk Util | rate(io_time[5m]) |
| Net Util | rate(bytes[5m]) / bw |
Tools: perf, bcc, bpftrace, async-profiler, pprof
| Issue | Symptom |
|---|---|
| Resource leak | Gradual degradation |
| Lock contention | High CPU, low throughput |
| Thundering herd | Bursty overload |
| N+1 queries | Linear database calls |
Measure First
Never guess; always profile before optimizing.