Observability isn’t just about seeing what your system is doing; it’s about understanding why it’s doing it.

Let’s see these pillars in action. Imagine a microservice architecture. A user request comes in, hits service A, which calls service B, then service C, and finally returns a response.

sequenceDiagram
    User->>Service A: GET /data
    Service A->>Service B: GET /process
    Service B->>Service C: POST /transform
    Service C-->>Service B: 200 OK
    Service B-->>Service A: 200 OK
    Service A-->>User: 200 OK

Metrics: The Pulse

Metrics are your system’s vital signs, aggregated numerical data points collected over time. Think of them as the dashboard gauges.

Example: A Prometheus server scraping metrics from our services.

# prometheus.yml
scrape_configs:
  - job_name: 'my-services'
    static_configs:
      - targets: ['service-a:9090', 'service-b:9090', 'service-c:9090']

In Service A, we might expose metrics like http_requests_total and request_duration_seconds.

// In Service A's main.go
import (
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"net/http"
)

var (
	httpRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests.",
		},
		[]string{"method", "path", "status"},
	)
	requestDurationSeconds = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name: "request_duration_seconds",
			Help: "Duration of HTTP requests.",
			Buckets: prometheus.DefBuckets, // Default buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
		},
		[]string{"path"},
	)
)

func init() {
	// Register metrics
	prometheus.MustRegister(httpRequestsTotal)
	prometheus.MustRegister(requestDurationSeconds)
}

func main() {
	http.HandleFunc("/data", func(w http.ResponseWriter, r *http.Request) {
		timer := prometheus.NewTimer(requestDurationSeconds.WithLabelValues("/data"))
		defer timer.ObserveDuration()

		// Simulate work
		// ...

		httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
		w.WriteHeader(http.StatusOK)
		w.Write([]byte("Data retrieved"))
	})
	http.ListenAndServe(":9090", nil)
}

With these, we can ask: "What’s the average duration of requests to /data over the last hour?"

avg_over_time(request_duration_seconds_sum{path="/data"}[1h]) / avg_over_time(request_duration_seconds_count{path="/data"}[1h])

Or, "What’s the 95th percentile latency for requests to /data?"

histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{path="/data"}[5m])) by (le))

Metrics tell you that something is slow or failing, but not where or why.

Logs: The Narration

Logs are discrete, timestamped events that record what happened at a specific point in time. They’re the storytellers, providing context.

Example: In Service B, we log the incoming request and the call to Service C.

2023-10-27T10:00:01.123Z INFO [service-b] Received request for /process from Service A. TraceID: abcdef12345
2023-10-27T10:00:01.234Z INFO [service-b] Calling Service C: POST /transform. TraceID: abcdef12345
2023-10-27T10:00:01.456Z INFO [service-b] Received response from Service C: 200 OK. TraceID: abcdef12345
2023-10-27T10:00:01.567Z INFO [service-b] Responding to Service A with 200 OK. TraceID: abcdef12345

Notice the TraceID. This is crucial for linking logs across services. If Service C starts taking too long, we can filter logs for that specific TraceID in Service B to see exactly when the slowdown occurred and what Service B was doing.

A common setup is to use a log aggregation tool like Elasticsearch/Fluentd/Kibana (EFK) or Loki. You’d configure your services to send logs to a central point.

# Example fluentd configuration snippet
<source>
  @type tail
  path /var/log/my-app/service-b.log
  pos_file /var/log/td-agent/service-b.log.pos
  tag service.b
  <parse>
    @type json # or regexp, etc.
  </parse>
</source>

<match service.b>
  @type elasticsearch
  host elasticsearch.example.com
  port 9200
  logstash_format true
  logstash_prefix service-b-logs
  include_tag_key true
  tag_key @log_name
</match>

Logs are great for digging into specific incidents. You can search for errors, filter by TraceID, or look at the sequence of events leading up to a problem. However, sifting through millions of log lines to find the needle in the haystack can be painful.

Traces: The Journey

Distributed tracing captures the end-to-end journey of a request as it travels through multiple services. It’s the map and the GPS, showing you the path and timing at each step.

Example: Using OpenTelemetry with a collector and a backend like Jaeger.

Each service needs to be instrumented to generate and propagate trace context (like the TraceID and a SpanID for each operation).

// In Service A, using OpenTelemetry Go SDK
import (
	"context"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/exporters/stdout/stdouttrace" // For demonstration, usually Jaeger/OTLP
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
	"go.opentelemetry.io/otel/trace"
	"net/http"
)

var tracer trace.Tracer

func initTracer() {
	// For demonstration, export to stdout. In production, use OTLP exporter to a collector.
	exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
	if err != nil {
		panic(err)
	}

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceName("service-a"),
		)),
	)
	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
	tracer = otel.Tracer("my-app/service-a")
}

func main() {
	initTracer()
	http.HandleFunc("/data", handleData)
	http.ListenAndServe(":9090", nil)
}

func handleData(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	// Extract trace context from incoming request headers (propagator does this)
	ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(r.Header))

	// Start a new span for this operation
	ctx, span := tracer.Start(ctx, "handleData")
	defer span.End()

	span.SetAttributes(attribute.String("http.method", r.Method), attribute.String("http.url", r.URL.Path))

	// Simulate calling Service B
	req, _ := http.NewRequestWithContext(ctx, "GET", "http://service-b:9090/process", nil)
	// Inject trace context into outgoing request headers
	otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		span.RecordError(err)
		http.Error(w, "Error calling service B", http.StatusInternalServerError)
		return
	}
	defer resp.Body.Close()

	// Process response from Service B...

	w.WriteHeader(http.StatusOK)
	w.Write([]byte("Data retrieved"))
}

When Service A calls Service B, it injects trace headers. Service B receives these headers, extracts the trace context, and starts its own spans, passing the context along when it calls Service C. The trace backend (like Jaeger) then stitches all these spans together.

This allows you to visualize the entire request flow, see the duration of each service call, and pinpoint exactly which service or operation is causing latency. If Service B is slow, you see a long span for its /process operation. You can then drill into Service B’s trace to see if it’s waiting on Service C.

The most surprising thing about distributed tracing is how much overhead is actually needed to make it useful. Many teams try to do it with minimal instrumentation, missing the crucial pieces that propagate context across network boundaries. You can’t just start a new trace for every service call; you need to continue the existing trace, linking child spans to parent spans, and ensuring the trace ID flows through every hop. This context propagation is the bedrock, and without it, you just have disconnected pieces of data.

The next logical step is understanding how to correlate these three pillars. If a metric shows high latency, you use traces to find the slow service, and then logs to debug the specific error within that service.

Want structured learning?

Take the full Observability & Monitoring course →