Monitoring vs. Observability: The Real Tradeoffs

Observability isn’t about asking "is the system up?", it’s about asking "why is the system behaving this way?".

Let’s see this in action. Imagine we have a microservice architecture. We’ve got a user-service that talks to an order-service, which in turn talks to a payment-service.

Here’s a simplified view of the logs from user-service when a user requests their order history:

[2023-10-27 10:00:01 INFO] User 123 requested order history.
[2023-10-27 10:00:02 DEBUG] Calling order-service for user 123.
[2023-10-27 10:00:05 ERROR] order-service responded with status 503.
[2023-10-27 10:00:06 INFO] Order history request failed for user 123.

Now, if we were just monitoring, we’d see a spike in 5xx errors for user-service and maybe an alert would fire. We know something is wrong.

Observability lets us dive deeper. We can look at the distributed traces. We’d see the request originating in user-service, flowing to order-service, and then perhaps order-service is making a call to payment-service that’s timing out or failing. The trace would visually show us the latency and errors at each hop.

We can also correlate this with metrics. We’d look at the order-service’s error rate, its CPU and memory usage, and maybe the latency of its calls to payment-service. If order-service suddenly shows high CPU and is erroring out, we’ve got a strong signal.

The core problem observability solves is debugging complex, distributed systems where the root cause of a failure isn’t always obvious. In a monolithic application, you could often tail -f a single log file and see the whole story. In microservices, a request might touch dozens of services, each with its own logs, metrics, and traces. Without a way to tie these together, finding the source of a problem becomes like finding a needle in a haystack.

Observability provides the tools to ask arbitrary questions about your system’s state without knowing in advance what you’re looking for. It’s built on three pillars:

Logs: Timestamped records of events. Think of them as the journal entries of your services.
Metrics: Numerical measurements over time. These are the system’s vital signs (e.g., request rate, error rate, CPU usage).
Traces: A representation of the journey a request takes through your distributed system. This shows the path, timing, and dependencies.

By combining these, we can build a comprehensive picture. For example, if we see a spike in latency for order-service (metric), we can then query its logs for errors occurring around that time, and if the latency spike correlates with a specific type of database query, we can examine traces to see if that query is the bottleneck.

The mental model is that you’re not just checking if the lights are on (monitoring), you’re actively diagnosing why a light might be flickering or dim (observability). You can ask "What percentage of requests to order-service are failing when the payment-service latency is above 500ms?" and get an answer.

A crucial, often overlooked, aspect of effective tracing is ensuring your trace IDs are propagated correctly across all network calls, asynchronous queues, and even background job processing. If a trace ID is lost when a request is handed off from an HTTP server to a message queue consumer, the trace will be fragmented, rendering it useless for understanding the full request lifecycle and pinpointing cross-service bottlenecks. This means not just injecting the ID into outgoing HTTP headers, but also ensuring it’s part of the message payload for queues and the context for background tasks.

The next step is often understanding how to set up effective alerting based on these observable signals.