Jaeger Tracing: Debugging Microservices Made Easy

OpenTelemetry tracing can actually make your system slower if you don’t configure it correctly, because every single operation gets wrapped in a span.

Let’s see how this all fits together. Imagine a request hits your web server. That’s the start of a trace. The web server then calls a user service, which calls a database. Each of these steps is a "span" within the trace. Jaeger is the backend that stores and visualizes these traces, showing you the entire journey of that request across your services.

Here’s a simple Go application demonstrating this:

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/jaeger"
	"go.opentelemetry.io/otel/sdk/resource"
	"go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.20.0"
	"go.opentelemetry.io/otel/trace"
)

var tracer trace.Tracer

func initTracer() {
	// Replace with your Jaeger agent's address
	url := "http://localhost:14268/api/traces"
	// You can also use UDP: "localhost:6831"

	exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpointURL(url)))
	if err != nil {
		log.Fatal(err)
	}

	tp := trace.NewTracerProvider(
		trace.WithBatcher(exporter),
		trace.WithResource(resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceName("my-app"),
		)),
	)
	otel.SetTracerProvider(tp)
	tracer = otel.Tracer("my-app-tracer")
}

func main() {
	initTracer()

	http.HandleFunc("/hello", helloHandler)
	fmt.Println("Server starting on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	ctx, span := tracer.Start(ctx, "helloHandler")
	defer span.End()

	sleepDuration := 50 * time.Millisecond
	time.Sleep(sleepDuration) // Simulate work

	name := "World"
	if r.URL.Query().Get("name") != "" {
		name = r.URL.Query().Get("name")
	}

	// Simulate calling another service
	ctx, dbSpan := tracer.Start(ctx, "databaseCall")
	time.Sleep(100 * time.Millisecond) // Simulate database latency
	dbSpan.End()

	message := fmt.Sprintf("Hello, %s!", name)
	w.Write([]byte(message))
	span.AddEvent("Response sent", trace.WithTimestamp(time.Now()))
}

To run this, you’ll need Jaeger running. A simple way is using Docker:

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  jaegertracing/all-in-one:latest

Then, build and run the Go application:

go mod init myapp
go mod tidy
go run main.go

Now, visit http://localhost:8080/hello?name=Jaeger in your browser. After a few seconds, go to your Jaeger UI at http://localhost:16686, select my-app from the service dropdown, and click "Find Traces". You should see your helloHandler trace, with nested spans for databaseCall and the span itself.

The core problem OpenTelemetry and Jaeger solve is the "black box" problem in microservices. When a request fails, or is slow, you don’t know where it failed or why it’s slow. This system lets you see the entire request path, pinpointing latency and errors.

The tracer.Start(ctx, "spanName") call is where the magic happens. It creates a new span, attaches it to the current trace context (ctx), and returns an updated context that includes this new span. This context is then passed down to subsequent calls, linking them together. defer span.End() is crucial; it marks the span as complete and sends its data.

The resource.NewWithAttributes part tells Jaeger about your service. The semconv.ServiceName("my-app") is how Jaeger will group traces from this particular application. Without it, all your services would just appear as "unknown."

The trace.WithBatcher(exporter) configures OpenTelemetry to send spans in batches to the Jaeger exporter. This is a performance optimization. Sending each span individually would be very inefficient.

The most surprising thing is how much instrumentation can hide performance issues if you’re not careful about sampling. By default, many tracing systems sample 100% of traces. This means every single request adds overhead and generates data. If you have a high-throughput service, this can become a significant performance bottleneck itself, masking the very issues you’re trying to diagnose. You might see a spike in latency and attribute it to a database call, when in reality, it’s the tracing overhead on every request.

If you want to reduce overhead, you’d typically implement adaptive sampling. For example, you might sample 100% of errors but only 1% of successful requests. This gives you visibility into problems without drowning in data or performance impact.

The next thing you’ll likely explore is adding metrics and logs to your traces, correlating them to get an even richer picture of your system’s behavior.