The most surprising thing about Rancher Monitoring with Prometheus and Grafana is that it’s not just about seeing metrics; it’s about controlling your Kubernetes cluster’s behavior based on those metrics.

Let’s see it in action. Imagine you have a deployment called my-app in the default namespace.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: nginx:latest
        ports:
        - containerPort: 80

With Rancher Monitoring installed, Prometheus will start scraping metrics from this deployment. You can then query these metrics in Grafana.

First, let’s check if Prometheus is scraping the my-app pods. In your Rancher UI, navigate to the Monitoring section, then Prometheus. You can use the Graph tab to run PromQL queries.

To see all targets Prometheus is scraping, you can query: up{job="kube-state-metrics"} This will show you if the kube-state-metrics service, which exposes cluster-level information, is being scraped.

To see metrics for your specific application, you’d typically look for metrics prefixed with container_ or kube_. For example, to see the CPU usage of your my-app pods: rate(container_cpu_usage_seconds_total{namespace="default", pod=~"my-app-.*"}[5m])

This query rate(container_cpu_usage_seconds_total{namespace="default", pod=~"my-app-.*"}[5m]) calculates the per-second average CPU usage over the last 5 minutes for all pods in the default namespace whose names start with my-app-. The container_cpu_usage_seconds_total is a counter that increments with every CPU second the container uses. The rate function calculates the per-second increase of this counter, giving you the CPU utilization.

Now, let’s build the mental model. Rancher Monitoring deploys Prometheus and Grafana as Kubernetes resources. Prometheus is the engine that collects time-series data (metrics) from various sources in your cluster, including nodes, pods, and Kubernetes components themselves. It does this by "scraping" HTTP endpoints exposed by these components. Grafana is the visualization layer; it queries Prometheus and presents the data in dashboards with graphs, tables, and other visual aids.

The problem this solves is the "black box" nature of Kubernetes. Without monitoring, you don’t know why your applications are slow, why pods are restarting, or what resources are being consumed. Rancher Monitoring provides the visibility needed to diagnose and optimize.

The exact levers you control are:

  1. Scrape Configuration: You define which targets Prometheus should scrape. This is often managed via ServiceMonitor and PodMonitor custom resources in Rancher. You can specify namespaces, labels to match, and endpoints to scrape. For example, a ServiceMonitor might look like this:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: my-app-monitor
      namespace: cattle-monitoring-system # Prometheus's namespace
    spec:
      selector:
        matchLabels:
          app: my-app # Matches the label on your Service
      namespaceSelector:
        matchNames:
          - default # Scrape services in the 'default' namespace
      endpoints:
      - port: web # Matches the port name in your Service
        interval: 30s # Scrape every 30 seconds
        path: /metrics # If your app exposes metrics on a /metrics endpoint
    
  2. PromQL Queries: You write PromQL queries to select, aggregate, and transform the collected metrics. This is the core of data analysis. For instance, to calculate the average memory usage per pod for my-app: avg by (pod) (container_memory_working_set_bytes{namespace="default", pod=~"my-app-.*"})

  3. Grafana Dashboards: You build or import Grafana dashboards to visualize your PromQL queries. Rancher Monitoring often pre-populates useful dashboards for Kubernetes components. You can create custom dashboards by adding panels and configuring them to use your PromQL queries.

  4. Alerting Rules: You define alerting rules in Prometheus (managed via PrometheusRule custom resources) that trigger alerts when certain metric conditions are met. For example, an alert for high CPU:

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: my-app-alerts
      namespace: cattle-monitoring-system
    spec:
      groups:
      - name: my-app.rules
        rules:
        - alert: HighCPUUsage
          expr: |
            sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="default", pod=~"my-app-.*"}[5m]))
            /
            sum by (pod) (kube_pod_container_resource_limits{namespace="default", pod=~"my-app-.*", resource="cpu"})
            * 100 > 80
          for: 5m
          labels:
            severity: warning
          annotations:
    
            summary: "High CPU usage on pod {{ $labels.pod }}"
    
    
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using {{ $value | printf \"%.2f\" }}% CPU."
    
    

    This rule fires if a my-app pod’s CPU usage exceeds 80% of its defined limit for 5 consecutive minutes. The expr calculates the percentage of CPU limit being used. The for: 5m clause ensures the condition must be true for that duration before an alert is fired, preventing flapping alerts.

The most powerful aspect of this system, and what most people miss, is how Prometheus’s label-based indexing fundamentally changes how you query data compared to traditional databases. Every metric and every label form a unique identifier. You don’t just query for "CPU usage"; you query for container_cpu_usage_seconds_total{namespace="default", pod="my-app-xyz", container="my-app-container"}. This granular, indexed approach allows for extremely fast and flexible aggregation and filtering on the fly, enabling complex analyses and dynamic dashboards without pre-defining specific views.

The next step after setting up basic dashboards and alerts is exploring distributed tracing integration, typically with Jaeger or Tempo, to correlate metrics with request traces.

Want structured learning?

Take the full Rancher course →