Monitoring vs. Observability: The "Why" Explained

OpenTelemetry is the only observability standard designed to be collected after your system is already broken.

Let’s see it in action. Imagine a simple microservice, user-service, that fetches user data and then calls an order-service to get their recent orders.

Here’s a snippet of the user-service code, instrumented with OpenTelemetry:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.propagate import inject
import requests
import time

# Configure Tracer
trace.set_tracer_provider(TracerProvider(
    resource=Resource(attributes={"service.name": "user-service"})
))
tracer = trace.get_tracer(__name__)
span_exporter = OTLPSpanExporter(endpoint="http://localhost:4317") # OTLP collector
trace_provider = trace.get_tracer_provider()
trace_provider.add_span_processor(BatchSpanProcessor(span_exporter))

# Configure Meter
metrics.set_meter_provider(MeterProvider(
    resource=Resource(attributes={"service.name": "user-service"})
))
meter = metrics.get_meter(__name__)
request_count = meter.create_counter("http_requests_total", description="Total HTTP requests")
request_latency = meter.create_histogram("http_request_latency_seconds", description="HTTP request latency in seconds")
metric_reader = PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="http://localhost:4317"))
meter_provider = metrics.get_meter_provider()
meter_provider.add_reader(metric_reader)

def get_user_data(user_id):
    start_time = time.perf_counter()
    with tracer.start_as_current_span("get_user_data") as span:
        span.set_attribute("user.id", user_id)
        try:
            # Simulate fetching user data
            time.sleep(0.1)
            user_data = {"id": user_id, "name": f"User {user_id}"}
            request_count.add(1, attributes={"operation": "get_user_data"})
            return user_data
        except Exception as e:
            span.record_exception(e)
            raise

def get_user_orders(user_id, trace_context):
    start_time = time.perf_counter()
    with tracer.start_as_current_span("get_user_orders") as span:
        span.set_attribute("user.id", user_id)
        # Inject current trace context into outgoing request headers
        headers = {}
        inject(trace_context, headers)
        try:
            response = requests.get(f"http://localhost:8081/orders/{user_id}", headers=headers) # Call order-service
            response.raise_for_status()
            request_count.add(1, attributes={"operation": "get_user_orders", "status_code": response.status_code})
            return response.json()
        except requests.exceptions.RequestException as e:
            span.record_exception(e)
            request_latency.record(time.perf_counter() - start_time, attributes={"operation": "get_user_orders", "status": "error"})
            raise

def get_user_profile(user_id):
    start_time = time.perf_counter()
    with tracer.start_as_current_span("get_user_profile") as span:
        span.set_attribute("user.id", user_id)
        try:
            user_data = get_user_data(user_id)
            trace_context = trace.get_context() # Get current trace context
            user_orders = get_user_orders(user_id, trace_context)
            span.set_attribute("user.orders_count", len(user_orders))
            request_latency.record(time.perf_counter() - start_time, attributes={"operation": "get_user_profile"})
            return {"user": user_data, "orders": user_orders}
        except Exception as e:
            span.record_exception(e)
            raise

if __name__ == "__main__":
    # This is a simplified example. In a real app, this would be an HTTP server.
    try:
        profile = get_user_profile(123)
        print(profile)
    except Exception as e:
        print(f"Error: {e}")

The user-service sends its spans and metrics to localhost:4317, where an OpenTelemetry Collector is listening. The collector then forwards these to a backend like Jaeger for traces and Prometheus for metrics.

The magic happens when you look at a trace. If order-service fails to respond, you’ll see a trace like this:

Trace ID: a1b2c3d4e5f67890
  - Span: get_user_profile (user-service) [100ms]
    - Span: get_user_data (user-service) [100ms]
    - Span: get_user_orders (user-service) [5s] <--- Timeout!
      - Error: ConnectionRefusedError: [Errno 111] Connection refused

You immediately see that user-service tried to call order-service and it timed out. The Error: ConnectionRefusedError is directly attached to the span representing that call. This is unified observability: the trace shows the what (failed call), the when (duration), and the why (error message), all linked.

The problem this solves is the "distributed tracing dark ages." Before OpenTelemetry, correlating requests across services was a painful, manual process. You’d have logs in user-service with a request ID, then search logs in order-service for that ID. If you had metrics, they were siloed. OpenTelemetry’s context propagation means that trace_id and span_id are automatically passed along in HTTP headers (or gRPC metadata) with every outgoing request. The inject(trace_context, headers) line is doing this. When order-service receives the request, it can extract this context and start its own spans that are children of the user-service’s span.

Internally, OpenTelemetry works by:

Instrumentation: Libraries or manual code add hooks to your application. These hooks capture events (like function calls, HTTP requests, database queries).
Context Propagation: A unique trace_id is generated for the initial request. This ID, along with a span_id for the current operation, is injected into outgoing requests. Downstream services extract this context, ensuring all related operations share the same trace_id.
Exporting: Captured data (spans, metrics, logs) is batched and sent to an OpenTelemetry Collector or directly to a backend via an exporter.

The exact levers you control are primarily in the instrumentation. For Python, you can:

Configure the TracerProvider and MeterProvider: This sets up the resource attributes (like service.name) and the exporters.
Get Tracer and Meter instances: You use these to create spans and record metrics.
Start spans: tracer.start_as_current_span("operation_name") creates a new span. with statements ensure spans are ended correctly, even if errors occur.
Add attributes: span.set_attribute("key", value) adds metadata to a span.
Record exceptions: span.record_exception(exception_object) attaches error details to a span.
Record metrics: counter.add(value, attributes={}) or histogram.record(value, attributes={}).

The most surprising thing is how much automatic correlation happens with minimal code. You don’t need to manually stitch together trace_ids. The context propagation mechanism handles it. The key is ensuring that every service participating in a request chain is instrumented and that the context is propagated. If one service in the middle doesn’t propagate the context, the trace breaks, and you lose end-to-end visibility.

The next thing you’ll run into is managing the sheer volume of data and ensuring your collector and backend can handle it efficiently.