Observability isn’t just about seeing what your system is doing; it’s about understanding why it’s doing it, even when you’ve never seen that specific behavior before.

Imagine a distributed system as a bustling city. Logs are like individual police reports from specific intersections, telling you about a single incident. Metrics are like the city’s overall traffic flow data – how many cars are on the road, average speed. Traces are like following a single car’s journey through the city, from its starting point to its destination, showing every street it took and every stop it made. Observability is the ability for the city’s chief of police to, without prior warning, understand why traffic is snarled in a specific district, by correlating reports from various intersections, traffic camera feeds, and even the GPS data of individual vehicles.

Let’s say you’re running a microservices application. A user reports that their order is taking an unusually long time to process.

Service A (Order Creation) receives the request. Service B (Inventory Check) is called. Service C (Payment Processing) is called. Service D (Notification Service) is called.

Metrics might show that Service B’s request latency has spiked to 500ms, while normally it’s 50ms. This tells you something is slow, but not why.

Logs from Service B might show repeated "Connection refused" errors when trying to reach a downstream database. This is getting closer, but it doesn’t tell you the full picture of the user’s request.

Distributed Tracing would show the entire journey of that specific user’s order. You’d see Service A call Service B, Service B call the database, and the database taking 450ms to respond. You’d see Service B then call Service C, which is also slow, but for a different reason – perhaps a timeout waiting for an external API. The trace connects these events, showing the causal chain for that single request.

Here’s how you’d typically set this up. You’re instrumenting your code to emit these signals.

For metrics, you might use a library like Prometheus client. In your Go service:

import "github.com/prometheus/client_golang/prometheus"

var (
	orderProcessingDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name: "order_processing_duration_seconds",
			Help: "Duration of order processing in seconds.",
		},
		[]string{"service"},
	)
)

func init() {
	prometheus.MustRegister(orderProcessingDuration)
}

func processOrder(ctx context.Context) {
	start := time.Now()
	// ... actual processing ...
	duration := time.Since(start)
	orderProcessingDuration.WithLabelValues("order_service").Observe(duration.Seconds())
}

This metric, order_processing_duration_seconds, would be scraped by Prometheus and visualized in Grafana, allowing you to see trends and anomalies over time.

For logs, you’d want structured logging, typically JSON, with correlation IDs. In Python using structlog:

import structlog
import uuid

logger = structlog.get_logger()

def process_order(request_id: str):
    logger.info("Starting order processing", request_id=request_id, user_id=123)
    try:
        # ... process ...
        logger.info("Order processed successfully", request_id=request_id)
    except Exception as e:
        logger.error("Error processing order", request_id=request_id, exc_info=True)

# When a request comes in:
request_id = str(uuid.uuid4())
process_order(request_id)

This request_id is crucial. It’s passed from service to service, allowing you to filter logs for a specific request across your entire distributed system.

For distributed tracing, you’d typically use OpenTelemetry. In Java with Spring Boot:

Add the opentelemetry-instrumentation-api and opentelemetry-exporter-otlp dependencies.

In your application.properties:

otel.service.name=order-service
otel.exporter.otlp.endpoint=http://localhost:4318/v1/traces

The libraries automatically instrument common frameworks like Spring Web MVC. When Service A calls Service B via HTTP, the tracing information (trace ID, span ID) is propagated in the request headers. Each service then creates its own "span" within the overall trace.

The core problem observability solves is the inherent complexity of modern distributed systems. When you have dozens or hundreds of services, each with its own state and dependencies, a simple bug can manifest in a multitude of ways. You can’t just SSH into a server and grep logs anymore. You need to understand the interactions between services.

The most surprising thing about distributed tracing is how it fundamentally changes your debugging mindset. Instead of hypothesizing about what might be wrong and then digging for evidence, tracing often presents you with the answer directly. You see a specific span that’s red (indicating an error) or excessively long, and you drill down into that specific operation for that specific request. It shifts you from a detective looking for clues to a surgeon identifying a precise point of failure.

What most people don’t realize is that the context propagation is the magic glue. When Service A makes an HTTP call to Service B, it doesn’t just send the payload; it injects headers like traceparent and tracestate (defined by the W3C Trace Context standard). These headers contain the trace ID, the parent span ID, and sampling information, allowing Service B to create its span as a child of Service A’s span, and so on down the line. Without this, each service would generate independent traces, and you’d lose the crucial causal relationship.

Once you have this foundation, the next logical step is understanding how to effectively alert on anomalies within this rich data, rather than just reacting to incidents.

Want structured learning?

Take the full Observability & Monitoring course →