The Prometheus server failed to write new exemplar data because its disk storage for exemplars became completely full, preventing any further data ingestion for that specific metric type.
Common Causes and Fixes
-
Exemplar Retention Too High for Disk Space:
- Diagnosis: Check your Prometheus configuration file (
prometheus.yml) for theexemplar.retentionsetting. Then, check the available disk space on the partition where Prometheus stores its data (typically/var/lib/prometheusor similar). Usedf -hto see disk usage. Ifexemplar.retentionis set to a very large value (e.g.,24hor more) and your disk is nearly full, this is the likely culprit. - Fix: Reduce the
exemplar.retentionvalue in yourprometheus.ymlfile. For instance, if you’re seeing this error and only need exemplars for the last hour, change it toexemplar.retention: 1h. After saving the configuration, restart Prometheus:sudo systemctl restart prometheus. - Why it works: Prometheus stores exemplars in memory for a configured duration (
exemplar.retention). When this duration expires, the exemplars are flushed to disk. If the disk is too small to hold the volume of exemplars generated within that retention period, it fills up. Reducing retention means fewer exemplars are held in memory and eventually written to disk, freeing up space.
- Diagnosis: Check your Prometheus configuration file (
-
Excessive Exemplar Generation Rate:
- Diagnosis: Identify which metrics are generating an unusually high volume of exemplars. You can do this by looking at Prometheus’s own metrics, specifically
prometheus_tsdb_exemplar_uploads_totalandprometheus_tsdb_exemplar_uploads_failed_total. Ifprometheus_tsdb_exemplar_uploads_failed_totalis increasing rapidly, it indicates a problem. Also, check your Prometheus scrape configuration forenable_exemplar_trace_contexton high-cardinality metrics. - Fix: Configure Prometheus to not collect exemplars for high-cardinality metrics or metrics that don’t benefit from exemplar tracing. Edit your
prometheus.ymlscrape configuration. For example, to disable exemplars for a specific job:
Or, more granularly, exclude specific metrics from exemplar collection:scrape_configs: - job_name: 'my_app' static_configs: - targets: ['localhost:9090'] exemplar_config: # Disable for this job enabled: false
Restart Prometheus after applying changes.scrape_configs: - job_name: 'my_app' static_configs: - targets: ['localhost:9090'] metric_relabel_configs: - source_labels: [__name__] regex: 'http_requests_total' # Exclude this metric action: drop target_label: __exemplar_enabled__ replacement: 'false' - Why it works: Exemplars are most useful for metrics with low cardinality that are associated with specific traces. If you enable exemplar collection on metrics with thousands or millions of unique label combinations (e.g.,
http_requests_total{path="/api/v1/users/<user_id>/..."}), the sheer volume of unique exemplar data can overwhelm your storage. Disabling collection for these metrics dramatically reduces the write load.
- Diagnosis: Identify which metrics are generating an unusually high volume of exemplars. You can do this by looking at Prometheus’s own metrics, specifically
-
Insufficient Disk Space for Data Directory:
- Diagnosis: Even if exemplar retention is reasonable, the overall disk partition where Prometheus stores its TSDB data (
/var/lib/prometheusby default) might be too small. Usedf -hto inspect the available space on the relevant partition. If it’s consistently above 90-95% full, Prometheus will struggle to write any new data, including exemplars. - Fix: Increase the size of the disk partition or move Prometheus’s data directory to a larger partition. This is often an infrastructure-level task. For cloud environments, you might resize an EBS volume or attach a new, larger disk. On-premises, you’d add physical storage. After resizing or moving, ensure Prometheus is configured to use the new path if necessary (via
--storage.tsdb.pathcommand-line flag orstorage.tsdb.pathinprometheus.yml) and restart Prometheus. - Why it works: Prometheus writes all its time-series data, including WAL (Write-Ahead Log) files and compacted blocks, to its data directory. If this directory’s containing filesystem runs out of space, no new data can be written, leading to errors like this. Providing more disk space allows Prometheus to operate normally.
- Diagnosis: Even if exemplar retention is reasonable, the overall disk partition where Prometheus stores its TSDB data (
-
WAL Corruption or Incomplete Compaction:
- Diagnosis: Sometimes, Prometheus might have issues writing to its Write-Ahead Log (WAL) or completing background compaction processes, which can indirectly lead to storage being marked as full or unavailable for new writes. Check Prometheus logs for any errors related to
tsdb,WAL, orcompaction. You might see messages likeWAL segment corruptedorcompaction failed. - Fix: In rare cases, the WAL might need to be reset. This is a destructive operation and will cause data loss for samples that were only in the WAL and not yet flushed to blocks. Stop Prometheus, navigate to the data directory (e.g.,
/var/lib/prometheus), and delete thewalsubdirectory:rm -rf /var/lib/prometheus/wal. Then, restart Prometheus. Prometheus will rebuild the WAL from existing blocks. - Why it works: The WAL is crucial for durability. If it becomes corrupted, Prometheus might refuse to start or write new data. Removing it forces Prometheus to reconstruct its state from the last successfully flushed blocks, allowing it to resume normal operation, albeit with potential minor data loss.
- Diagnosis: Sometimes, Prometheus might have issues writing to its Write-Ahead Log (WAL) or completing background compaction processes, which can indirectly lead to storage being marked as full or unavailable for new writes. Check Prometheus logs for any errors related to
-
Incorrect Permissions on Data Directory:
- Diagnosis: The user running the Prometheus process might not have write permissions to its data directory (e.g.,
/var/lib/prometheus). Check the ownership and permissions of the directory:ls -ld /var/lib/prometheus. If the userprometheus(or whatever user it runs as) doesn’t havewpermissions, it won’t be able to write new data. - Fix: Ensure the Prometheus user owns the data directory and has write permissions. If Prometheus runs as user
prometheusand groupprometheus:
Then restart Prometheus:sudo chown -R prometheus:prometheus /var/lib/prometheus sudo chmod -R u+w /var/lib/prometheussudo systemctl restart prometheus. - Why it works: Operating systems enforce file permissions. If the Prometheus process doesn’t have explicit write access to the directory where it needs to store data, it will fail to write any new information, manifesting as a storage full error.
- Diagnosis: The user running the Prometheus process might not have write permissions to its data directory (e.g.,
-
External Storage Issues (e.g., NFS, Ceph):
- Diagnosis: If Prometheus is configured to store its data on a network file system (NFS) or a distributed storage system (like Ceph), the issue might lie with the external storage’s capacity, connectivity, or quotas. Check the available space and status of the underlying storage system. Look for errors in
dmesgor system logs related to the mount point. - Fix: Address the issue on the external storage system. This could involve increasing quotas, freeing up space on the storage server, or resolving network connectivity problems to the storage. Once the external storage is healthy and has available space, Prometheus should be able to resume writing.
- Why it works: Prometheus relies on the underlying filesystem to persist data. If that filesystem is unavailable, full, or experiencing errors, Prometheus cannot write its data, even if its own configuration is correct.
- Diagnosis: If Prometheus is configured to store its data on a network file system (NFS) or a distributed storage system (like Ceph), the issue might lie with the external storage’s capacity, connectivity, or quotas. Check the available space and status of the underlying storage system. Look for errors in
The next error you’ll likely encounter if this is resolved but other underlying issues persist is out of memory if the exemplar volume is still too high for available RAM, or a general scrape failure if the root cause was a system-wide resource exhaustion.