Prometheus StatefulSets, by default, don’t persist any data across pod restarts.

Let’s see how Prometheus stores its time-series data and why that’s a problem when its pods die.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-server
spec:
  serviceName: "prometheus-server"
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.30.3
        ports:
        - containerPort: 9090
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: prometheus-storage-volume
          mountPath: /prometheus
  # ... volumes section will follow

In this basic StatefulSet, the prometheus-storage-volume is defined but not attached to persistent storage. When the prometheus-server pod restarts, the /prometheus directory is wiped clean, and all historical metrics are lost. This is critical because Prometheus’s core function is to retain metrics for analysis and alerting. Losing this data means losing your monitoring history, breaking down alerting rules that rely on historical trends, and needing to re-collect data from scratch.

To fix this, we need to bind the prometheus-storage-volume to a PersistentVolumeClaim (PVC) that will, in turn, be dynamically provisioned or statically bound to a PersistentVolume (PV).

Common Causes and Fixes

  1. No PersistentVolumeClaim (PVC) defined: The most frequent issue is simply forgetting to define the PVC that the StatefulSet’s volume should use.

    • Diagnosis: Look at the volumeClaimTemplates section of your StatefulSet definition. If it’s missing, that’s your problem.

    • Fix: Add a volumeClaimTemplates section to your StatefulSet. This tells Kubernetes to create a PVC for each replica (in this case, one PVC for the single replica).

      # ... inside the StatefulSet spec
      volumeClaimTemplates:
      - metadata:
          name: prometheus-storage-volume # This name MUST match the volumeMounts name
        spec:
          accessModes: [ "ReadWriteOnce" ] # Common for single-node storage
          resources:
            requests:
              storage: 50Gi # Request 50 Gigabytes of storage
          # storageClassName: standard # Optional: specify a StorageClass if not using the default
      
    • Why it works: volumeClaimTemplates is a Kubernetes mechanism for StatefulSets that automatically creates a PVC for each pod. This PVC then requests storage from your cluster’s storage provisioner.

  2. Incorrect volumeMounts.name / volumeClaimTemplates.metadata.name Mismatch: The name used in the container’s volumeMounts must exactly match the name defined in volumeClaimTemplates.

    • Diagnosis: Check spec.template.spec.containers[0].volumeMounts[1].name and spec.volumeClaimTemplates[0].metadata.name. They must be identical.

    • Fix: Ensure the names match. For example, if your volumeMounts uses prometheus-data-volume, your volumeClaimTemplates must use prometheus-data-volume under metadata.name.

      # ... StatefulSet spec
      template:
        spec:
          containers:
          - name: prometheus
            # ... other container spec
            volumeMounts:
            - name: prometheus-storage-volume # Matches below
              mountPath: /prometheus
          volumes: # This 'volumes' section is for non-PVC backed volumes like config maps
          - name: config-volume
            configMap:
              name: prometheus-config
      volumeClaimTemplates:
      - metadata:
          name: prometheus-storage-volume # Matches above
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 50Gi
      
    • Why it works: Kubernetes uses these names to link the persistent storage defined by the PVC to the specific directory within the container where Prometheus expects to find its data.

  3. No StorageClass Configured or Specified: If your Kubernetes cluster doesn’t have a default StorageClass set up, or if you don’t explicitly specify one in your volumeClaimTemplates, dynamic provisioning will fail.

    • Diagnosis: Run kubectl get storageclass to see available StorageClasses. Check your volumeClaimTemplates for a storageClassName field.

    • Fix: Either set a default StorageClass in your cluster (if you have control) or specify an existing one in your volumeClaimTemplates.

      # ... inside volumeClaimTemplates spec
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 50Gi
        storageClassName: gp2 # Example for AWS EBS, or 'standard' for GKE, etc.
      
    • Why it works: The StorageClass tells Kubernetes how to provision storage (e.g., using AWS EBS, GCE Persistent Disks, Ceph, NFS). Without it, Kubernetes doesn’t know what type of underlying storage to create.

  4. Insufficient Available Storage Capacity: The requested storage size might be too large for the available capacity in your cluster’s storage provisioner.

    • Diagnosis: Check the PVC status with kubectl get pvc. Look for a STATUS other than Bound (e.g., Pending). Then, check the events for the PVC: kubectl describe pvc <pvc-name>. You’ll likely see messages about insufficient capacity.

    • Fix: Reduce the requested storage size in volumeClaimTemplates to a value that can be satisfied by your provisioner, or add more capacity to your underlying storage system.

      # ... inside volumeClaimTemplates spec
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 20Gi # Reduced from 50Gi
        storageClassName: standard
      
    • Why it works: The storage provisioner cannot fulfill a request it doesn’t have the resources for. Adjusting the request to fit available capacity allows the PVC to bind.

  5. Incorrect accessModes for the Storage Provider: Some storage providers only support certain access modes. ReadWriteOnce (RWO) is typical for single-pod access, but if you were trying to use ReadWriteMany (RWX) with a provider that only supports RWO, it would fail.

    • Diagnosis: Check kubectl describe pvc <pvc-name> for events indicating access mode issues. Consult your cloud provider’s or storage system’s documentation for supported accessModes.

    • Fix: Ensure the accessModes in your volumeClaimTemplates are compatible with your chosen StorageClass and underlying storage. For a single Prometheus instance, ReadWriteOnce is almost always correct.

      # ... inside volumeClaimTemplates spec
      spec:
        accessModes: [ "ReadWriteOnce" ] # This is usually the correct mode for a single Prometheus pod
        resources:
          requests:
            storage: 50Gi
        storageClassName: standard
      
    • Why it works: accessModes dictate how the storage volume can be mounted by pods. Mismatched modes prevent the volume from being attached correctly.

  6. Storage Provisioner Not Ready or Misconfigured: The dynamic storage provisioner itself might be having issues.

    • Diagnosis: Check the logs of your cluster’s storage provisioner pods (e.g., aws-ebs-csi-driver, gcp-pd-csi-driver, nfs-client-provisioner). Look for errors related to authentication, network connectivity, or API calls to the underlying storage service.
    • Fix: Troubleshoot and reconfigure the storage provisioner according to its specific documentation. This is highly environment-dependent.
    • Why it works: The provisioner is the component that actually creates the PersistentVolume when a PVC requests it. If it’s broken, no storage will be created.

Once these steps are complete, your Prometheus StatefulSet will create a PVC, which will bind to a PersistentVolume, and Prometheus will then store its metrics data in that persistent location. When the pod restarts, it will reattach to the existing PersistentVolume, and its data will be intact.

The next error you might encounter is related to Prometheus configuration itself, or potentially resource limits if your metrics load increases significantly.

Want structured learning?

Take the full Prometheus course →