Prometheus Operator doesn’t just manage Prometheus; it fundamentally changes how Prometheus is deployed and configured within Kubernetes, treating Prometheus instances and their associated alerting and service discovery rules as first-class Kubernetes resources.
Let’s watch it in action. Imagine you have a Kubernetes cluster and you want to deploy Prometheus to scrape metrics from your applications.
# prometheus-operator-deployment.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: main-prometheus
namespace: monitoring
spec:
replicas: 2
serviceAccountName: prometheus
serviceMonitorSelector: {} # Scrape all ServiceMonitors in the namespace
resources:
requests:
memory: 400Mi
cpu: 100m
limits:
memory: 800Mi
cpu: 200m
# ... other Prometheus configuration ...
When you kubectl apply -f prometheus-operator-deployment.yaml, the Prometheus Operator watches for Prometheus custom resources. It sees main-prometheus and, if it’s not already running, it spins up a StatefulSet and a Service for Prometheus. It configures Prometheus to use the specified serviceAccountName and sets resource requests/limits. Crucially, the serviceMonitorSelector: {} tells this Prometheus instance to automatically discover and scrape any ServiceMonitor resources present in the monitoring namespace.
The core problem Prometheus Operator solves is the boilerplate and manual management of Prometheus configurations in dynamic Kubernetes environments. Before the operator, you’d typically manage a monolithic Prometheus configuration file, manually update it, and redeploy Prometheus whenever you added or removed applications to monitor. This quickly becomes untenable.
The Operator introduces several Custom Resource Definitions (CRDs) that represent different aspects of Prometheus monitoring:
Prometheus: Defines an instance of Prometheus, its configuration, and whatServiceMonitorsorPodMonitorsit should scrape.ServiceMonitor: Describes how to scrape metrics from a set of Kubernetes Services. It specifies the port, path, and interval, and uses label selectors to match against Services.PodMonitor: Similar toServiceMonitor, but targets Pods directly based on label selectors. This is useful for targets that don’t expose a Kubernetes Service or when you need to scrape directly from a pod’s IP.PrometheusRule: Defines Prometheus alerting and recording rules. These rules are applied to the Prometheus instances managed by the operator.Alertmanager: Defines an instance of Alertmanager, responsible for handling alerts sent by Prometheus.
Here’s how ServiceMonitor works:
# my-app-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-metrics
namespace: monitoring # Must match the Prometheus instance's namespace or be selected
spec:
selector:
matchLabels:
app: my-app # Selects Kubernetes Services with the label 'app: my-app'
namespaceSelector:
matchNames:
- default # Only look for services in the 'default' namespace
endpoints:
- port: web # The name of the port in the Service definition
interval: 30s
path: /metrics # The HTTP path where metrics are exposed
When you apply this ServiceMonitor, the Prometheus Operator, which is already running and watching for these resources, sees my-app-metrics. It then modifies the Prometheus configuration (specifically, the scrape_configs section within Prometheus’s prometheus.yml) for the main-prometheus instance to include a new job that targets the Services matching app: my-app in the default namespace, scraping from the web port at /metrics every 30 seconds. You don’t touch Prometheus’s config file; the operator does it for you.
The PrometheusRule CRD allows you to define rules declaratively:
# my-app-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: monitoring
spec:
groups:
- name: my-app.rules
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="my-app"} > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency detected for my-app"
description: "{{ $value }}s of latency detected for {{ $labels.instance }} over the last 5 minutes."
The operator takes these rules and adds them to the rule_files section of the Prometheus configuration. Prometheus then loads and evaluates them.
The most surprising aspect of Prometheus Operator is how it abstracts away the underlying Prometheus configuration files entirely, allowing you to manage Prometheus, its scraping targets, and its alerting rules using Kubernetes’ native API and kubectl. This means you can use GitOps workflows, RBAC, and other Kubernetes-native tooling to manage your entire monitoring stack. The operator acts as a reconciliation loop, constantly ensuring the desired state defined by your CRDs matches the actual state of Prometheus and Alertmanager deployments.
The next challenge you’ll likely encounter is configuring Prometheus to scrape metrics from targets outside of Kubernetes Services, such as external databases or third-party APIs.