Observability 2.0 | Bot Army SRE

100s

Fields Per Event

10⁶

High Cardinality

1

Unified Format

<10s

Query Response

Observability 1.0 vs 2.0

Aspect	1.0	2.0
Data	3 pillars (siloed)	Wide structured events
Cardinality	Low (pre-aggregated)	High (millions)
Questions	Known unknowns	Unknown unknowns
Debug	Correlate across tools	Single pane of glass

Wide Structured Events

Emit one wide event per unit of work, with all relevant context attached.

- Charity Majors

Request context: user_id, tenant_id, request_id
Timing: duration, queue_time, db_time
Result: status, error_type, cache_hit
Environment: version, host, region, pod

High Cardinality Fields

Field	Cardinality
user_id	Millions
trace_id	Billions
request_id	Billions
build_id	Thousands
endpoint	Hundreds

Traditional metrics explode with high cardinality

Charity Majors Principles

Observability is about understanding new problems
Debug from production, not staging
Instrument at the code level, not infrastructure
Exploratory investigation over dashboards

Core Practices

Instrument Everything

Every service emits structured events on every request

Query Interactively

Ad-hoc questions, slice and dice by any field

SLO Integration

Events feed SLI calculations directly

Event Schema Example

Field	Example Value
service	api-gateway
endpoint	/v2/users/:id
duration_ms	47.3
status_code	200
user_id	u_abc123
cache_hit	true
db_queries	3

Tools for Observability 2.0

Tool	Strength
Honeycomb	Query-first, high cardinality
Grafana + Loki	Open-source ecosystem
OpenTelemetry	Vendor-neutral instrumentation

Key Question

Can you debug problems you've never seen before, without adding new instrumentation?

Ask New Questions

True observability answers questions you haven't thought to ask yet.