Microservice Observability: Beyond Basic Metrics

Prometheus doesn’t actually care if your service is "micro" or a monolith; it just wants data.

Let’s say you’ve got a couple of services, user-service and order-service, and they talk to each other. You’ve instrumented them with Prometheus client libraries, and now you’re staring at a dashboard. What you want to see is how user-service is performing, how order-service is doing, and crucially, how their interaction affects things.

Here’s a simplified setup:

user-service (Go)

package main

import (
	"net/http"
	"strconv"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	usersProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "user_service_users_processed_total",
		Help: "Total number of users processed.",
	}, []string{"status"})
)

func processUser(w http.ResponseWriter, r *http.Request) {
	userID := r.URL.Query().Get("id")
	// Simulate processing
	if userID == "" {
		usersProcessed.WithLabelValues("error_no_id").Inc()
		http.Error(w, "User ID is required", http.StatusBadRequest)
		return
	}
	usersProcessed.WithLabelValues("success").Inc()
	w.Write([]byte("Processed user: " + userID))
}

func main() {
	http.HandleFunc("/process-user", processUser)
	http.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":8080", nil)
}

order-service (Python)

from flask import Flask, request
from prometheus_flask_exporter import PrometheusMetrics
import random

app = Flask(__name__)
metrics = PrometheusMetrics(app)

# Explicitly register a counter with labels
user_requests_total = metrics.counter(
    'order_service_user_requests_total', 'Total user requests received',
    labels={'status': lambda: None} # Placeholder for dynamic labels
)

@app.route('/create-order', methods=['POST'])
def create_order():
    user_id = request.args.get('userId')
    if not user_id:
        user_requests_total(status='error_no_user_id').inc()
        return "User ID is required", 400

    # Simulate calling user-service (this is where we'd see inter-service metrics)
    # For now, just simulate success/failure
    if random.random() < 0.1: # 10% chance of failure
        user_requests_total(status='error_processing').inc()
        return "Failed to create order for user", 500
    else:
        user_requests_total(status='success').inc()
        return f"Order created for user {user_id}", 200

if __name__ == '__main__':
    app.run(port=5000)

Now, let’s say Prometheus is configured to scrape http://localhost:8080/metrics and http://localhost:5000/metrics.

Your prometheus.yml might look like this:

scrape_configs:
  - job_name: 'user-service'
    static_configs:
      - targets: ['localhost:8080']

  - job_name: 'order-service'
    static_configs:
      - targets: ['localhost:5000']

When you hit http://localhost:8080/process-user?id=123 and then http://localhost:5000/create-order?userId=123, Prometheus will collect metrics like:

user_service_users_processed_total{status="success"}
order_service_user_requests_total{status="success"}

The core strategy here is instrumentation and labeling. Every metric you expose should have labels that allow you to slice and dice the data. For microservices, this means labels for:

Service name: Implicitly handled by job_name in Prometheus, but can also be a label if you have multiple instances of the same service type.
Instance: The specific host:port of the service.
Request/Operation type: e.g., http_method, endpoint.
Status codes: 2xx, 4xx, 5xx.
Internal identifiers: user_id, order_id (use sparingly due to cardinality).
Dependencies: If order-service calls user-service, order-service should expose metrics like order_service_user_service_requests_total{status="success"}.

The real power comes when you start combining these. You can ask: "What’s the error rate of order-service calls that are preceded by a 4xx response from user-service?" This requires user-service to expose its status codes and order-service to record the status of its call to user-service.

A common pattern is to use request duration histograms for latency analysis.

user-service (updated Go)

package main

import (
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	usersProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "user_service_users_processed_total",
		Help: "Total number of users processed.",
	}, []string{"status"})

	processDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
		Name: "user_service_process_duration_seconds",
		Help: "Histogram of user processing durations.",
		Buckets: prometheus.DefBuckets, // Default buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
	}, []string{"status"})
)

func processUser(w http.ResponseWriter, r *http.Request) {
	start := time.Now()
	userID := r.URL.Query().Get("id")

	status := "success"
	if userID == "" {
		status = "error_no_id"
		usersProcessed.WithLabelValues(status).Inc()
		processDuration.WithLabelValues(status).Observe(time.Since(start).Seconds())
		http.Error(w, "User ID is required", http.StatusBadRequest)
		return
	}

	// Simulate processing time
	time.Sleep(time.Duration(rand.Intn(500)) * time.Millisecond)


	usersProcessed.WithLabelValues(status).Inc()
	processDuration.WithLabelValues(status).Observe(time.Since(start).Seconds())
	w.Write([]byte("Processed user: " + userID))
}

func main() {
	http.HandleFunc("/process-user", processUser)
	http.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":8080", nil)
}

With this, you can query Prometheus for histogram_quantile(0.95, sum(rate(user_service_process_duration_seconds_bucket{status="success"}[5m])) by (le)). This tells you the 95th percentile latency for successful user processing over the last 5 minutes.

The most surprising thing about Prometheus metrics strategy is how much you can infer about system behavior just from aggregated counters and histograms, without needing to log every single event.

The core mental model for microservices observability with Prometheus is to think of each service as a data producer that exposes a stream of time-series data. Prometheus acts as the collector and storage engine, and PromQL is your query language to analyze these streams. You control the granularity and richness of your insights by deciding which metrics to expose and how to label them.

A common pitfall is the temptation to expose unique identifiers for every request or user in your labels. This leads to extremely high cardinality, which can overwhelm Prometheus (especially older versions) and make queries slow or impossible. Instead, focus on aggregatable dimensions like status codes, operation types, and resource categories. If you need to trace a specific request across services, that’s often a job for distributed tracing systems (like Jaeger or Zipkin), which can be correlated with Prometheus metrics by using shared trace IDs.

The next logical step is to explore how to implement distributed tracing alongside your Prometheus metrics for end-to-end request visibility.