Observability Is Not Monitoring — Here's the Difference That Matters
"We have observability — we use Grafana dashboards."
No. You have monitoring. And there's a critical difference that most teams miss until their first major incident.
Monitoring vs Observability
Monitoring answers: "Is this thing working?"
- CPU at 80%? Alert.
- Response time > 2s? Alert.
- Error rate > 1%? Alert.
Observability answers: "WHY is this thing broken?"
- Which specific requests are slow?
- What changed between yesterday (working) and today (broken)?
- Which downstream dependency is causing the cascade?
Monitoring tells you there's a fire. Observability helps you find the room it's in.
The Three Pillars (And Why They're Not Enough)
You've heard it: Logs, Metrics, Traces. The three pillars. But having all three doesn't automatically give you observability.
Logs Without Structure = Noise
INFO: Processing request
ERROR: Something went wrong
vs
{"level": "error", "service": "payment-api", "trace_id": "abc123", "user_id": "u456", "error": "timeout calling inventory-service", "latency_ms": 5002}
Structured logs with correlation IDs. That's the difference between searchable context and a wall of text.
Metrics Without Cardinality = Blind Spots
Average response time: 200ms. Looks fine. But P99 is 8 seconds. 1% of your users are having a terrible experience and your dashboard says everything's green.
Always track: P50, P95, P99. Averages lie.
Traces Without Propagation = Fragments
A trace that covers one service but stops at the boundary is useless for debugging distributed issues. Context propagation across every service boundary is essential.
What I Set Up for Every Platform
- OpenTelemetry SDK baked into every service template — not optional
- Structured logging with mandatory fields: trace_id, service, environment
- RED metrics for every service: Rate, Errors, Duration
- Alerting on symptoms (error rate, latency) not causes (CPU, memory)
- Runbooks linked to alerts — every alert has a "what to do" document
The Culture Part
The best observability setup is useless if engineers don't use it. Make it part of the development workflow: "Before you merge, can you trace a request through your change?"
How mature is your team's observability practice? Let's discuss on LinkedIn.
