Prometheus StatefulSets, by default, don’t persist any data across pod restarts.
Let’s see how Prometheus stores its time-series data and why that’s a problem when its pods die.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-server
spec:
serviceName: "prometheus-server"
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.30.3
ports:
- containerPort: 9090
args:
- "--config.file=/etc/prometheus/prometheus.yml"
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: prometheus-storage-volume
mountPath: /prometheus
# ... volumes section will follow
In this basic StatefulSet, the prometheus-storage-volume is defined but not attached to persistent storage. When the prometheus-server pod restarts, the /prometheus directory is wiped clean, and all historical metrics are lost. This is critical because Prometheus’s core function is to retain metrics for analysis and alerting. Losing this data means losing your monitoring history, breaking down alerting rules that rely on historical trends, and needing to re-collect data from scratch.
To fix this, we need to bind the prometheus-storage-volume to a PersistentVolumeClaim (PVC) that will, in turn, be dynamically provisioned or statically bound to a PersistentVolume (PV).
Common Causes and Fixes
-
No PersistentVolumeClaim (PVC) defined: The most frequent issue is simply forgetting to define the PVC that the StatefulSet’s volume should use.
-
Diagnosis: Look at the
volumeClaimTemplatessection of your StatefulSet definition. If it’s missing, that’s your problem. -
Fix: Add a
volumeClaimTemplatessection to your StatefulSet. This tells Kubernetes to create a PVC for each replica (in this case, one PVC for the single replica).# ... inside the StatefulSet spec volumeClaimTemplates: - metadata: name: prometheus-storage-volume # This name MUST match the volumeMounts name spec: accessModes: [ "ReadWriteOnce" ] # Common for single-node storage resources: requests: storage: 50Gi # Request 50 Gigabytes of storage # storageClassName: standard # Optional: specify a StorageClass if not using the default -
Why it works:
volumeClaimTemplatesis a Kubernetes mechanism for StatefulSets that automatically creates a PVC for each pod. This PVC then requests storage from your cluster’s storage provisioner.
-
-
Incorrect
volumeMounts.name/volumeClaimTemplates.metadata.nameMismatch: The name used in the container’svolumeMountsmust exactly match the name defined involumeClaimTemplates.-
Diagnosis: Check
spec.template.spec.containers[0].volumeMounts[1].nameandspec.volumeClaimTemplates[0].metadata.name. They must be identical. -
Fix: Ensure the names match. For example, if your
volumeMountsusesprometheus-data-volume, yourvolumeClaimTemplatesmust useprometheus-data-volumeundermetadata.name.# ... StatefulSet spec template: spec: containers: - name: prometheus # ... other container spec volumeMounts: - name: prometheus-storage-volume # Matches below mountPath: /prometheus volumes: # This 'volumes' section is for non-PVC backed volumes like config maps - name: config-volume configMap: name: prometheus-config volumeClaimTemplates: - metadata: name: prometheus-storage-volume # Matches above spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 50Gi -
Why it works: Kubernetes uses these names to link the persistent storage defined by the PVC to the specific directory within the container where Prometheus expects to find its data.
-
-
No StorageClass Configured or Specified: If your Kubernetes cluster doesn’t have a default
StorageClassset up, or if you don’t explicitly specify one in yourvolumeClaimTemplates, dynamic provisioning will fail.-
Diagnosis: Run
kubectl get storageclassto see available StorageClasses. Check yourvolumeClaimTemplatesfor astorageClassNamefield. -
Fix: Either set a default
StorageClassin your cluster (if you have control) or specify an existing one in yourvolumeClaimTemplates.# ... inside volumeClaimTemplates spec spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 50Gi storageClassName: gp2 # Example for AWS EBS, or 'standard' for GKE, etc. -
Why it works: The
StorageClasstells Kubernetes how to provision storage (e.g., using AWS EBS, GCE Persistent Disks, Ceph, NFS). Without it, Kubernetes doesn’t know what type of underlying storage to create.
-
-
Insufficient Available Storage Capacity: The requested storage size might be too large for the available capacity in your cluster’s storage provisioner.
-
Diagnosis: Check the PVC status with
kubectl get pvc. Look for aSTATUSother thanBound(e.g.,Pending). Then, check the events for the PVC:kubectl describe pvc <pvc-name>. You’ll likely see messages about insufficient capacity. -
Fix: Reduce the requested
storagesize involumeClaimTemplatesto a value that can be satisfied by your provisioner, or add more capacity to your underlying storage system.# ... inside volumeClaimTemplates spec spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 20Gi # Reduced from 50Gi storageClassName: standard -
Why it works: The storage provisioner cannot fulfill a request it doesn’t have the resources for. Adjusting the request to fit available capacity allows the PVC to bind.
-
-
Incorrect
accessModesfor the Storage Provider: Some storage providers only support certain access modes.ReadWriteOnce(RWO) is typical for single-pod access, but if you were trying to useReadWriteMany(RWX) with a provider that only supports RWO, it would fail.-
Diagnosis: Check
kubectl describe pvc <pvc-name>for events indicating access mode issues. Consult your cloud provider’s or storage system’s documentation for supportedaccessModes. -
Fix: Ensure the
accessModesin yourvolumeClaimTemplatesare compatible with your chosenStorageClassand underlying storage. For a single Prometheus instance,ReadWriteOnceis almost always correct.# ... inside volumeClaimTemplates spec spec: accessModes: [ "ReadWriteOnce" ] # This is usually the correct mode for a single Prometheus pod resources: requests: storage: 50Gi storageClassName: standard -
Why it works:
accessModesdictate how the storage volume can be mounted by pods. Mismatched modes prevent the volume from being attached correctly.
-
-
Storage Provisioner Not Ready or Misconfigured: The dynamic storage provisioner itself might be having issues.
- Diagnosis: Check the logs of your cluster’s storage provisioner pods (e.g.,
aws-ebs-csi-driver,gcp-pd-csi-driver,nfs-client-provisioner). Look for errors related to authentication, network connectivity, or API calls to the underlying storage service. - Fix: Troubleshoot and reconfigure the storage provisioner according to its specific documentation. This is highly environment-dependent.
- Why it works: The provisioner is the component that actually creates the PersistentVolume when a PVC requests it. If it’s broken, no storage will be created.
- Diagnosis: Check the logs of your cluster’s storage provisioner pods (e.g.,
Once these steps are complete, your Prometheus StatefulSet will create a PVC, which will bind to a PersistentVolume, and Prometheus will then store its metrics data in that persistent location. When the pod restarts, it will reattach to the existing PersistentVolume, and its data will be intact.
The next error you might encounter is related to Prometheus configuration itself, or potentially resource limits if your metrics load increases significantly.