Fix Prometheus Exemplar Storage Full Error (2026)

The Prometheus server failed to write new exemplar data because its disk storage for exemplars became completely full, preventing any further data ingestion for that specific metric type.

Common Causes and Fixes

Exemplar Retention Too High for Disk Space:
- Diagnosis: Check your Prometheus configuration file (prometheus.yml) for the exemplar.retention setting. Then, check the available disk space on the partition where Prometheus stores its data (typically /var/lib/prometheus or similar). Use df -h to see disk usage. If exemplar.retention is set to a very large value (e.g., 24h or more) and your disk is nearly full, this is the likely culprit.
- Fix: Reduce the exemplar.retention value in your prometheus.yml file. For instance, if you’re seeing this error and only need exemplars for the last hour, change it to exemplar.retention: 1h. After saving the configuration, restart Prometheus: sudo systemctl restart prometheus.
- Why it works: Prometheus stores exemplars in memory for a configured duration (exemplar.retention). When this duration expires, the exemplars are flushed to disk. If the disk is too small to hold the volume of exemplars generated within that retention period, it fills up. Reducing retention means fewer exemplars are held in memory and eventually written to disk, freeing up space.
Excessive Exemplar Generation Rate:
- Diagnosis: Identify which metrics are generating an unusually high volume of exemplars. You can do this by looking at Prometheus’s own metrics, specifically prometheus_tsdb_exemplar_uploads_total and prometheus_tsdb_exemplar_uploads_failed_total. If prometheus_tsdb_exemplar_uploads_failed_total is increasing rapidly, it indicates a problem. Also, check your Prometheus scrape configuration for enable_exemplar_trace_context on high-cardinality metrics.
- Fix: Configure Prometheus to not collect exemplars for high-cardinality metrics or metrics that don’t benefit from exemplar tracing. Edit your prometheus.yml scrape configuration. For example, to disable exemplars for a specific job:
```
scrape_configs:
  - job_name: 'my_app'
    static_configs:
      - targets: ['localhost:9090']
    exemplar_config:
      # Disable for this job
      enabled: false
```
  Or, more granularly, exclude specific metrics from exemplar collection:
```
scrape_configs:
  - job_name: 'my_app'
    static_configs:
      - targets: ['localhost:9090']
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'http_requests_total' # Exclude this metric
        action: drop
        target_label: __exemplar_enabled__
        replacement: 'false'
```
  Restart Prometheus after applying changes.
- Why it works: Exemplars are most useful for metrics with low cardinality that are associated with specific traces. If you enable exemplar collection on metrics with thousands or millions of unique label combinations (e.g., http_requests_total{path="/api/v1/users/<user_id>/..."}), the sheer volume of unique exemplar data can overwhelm your storage. Disabling collection for these metrics dramatically reduces the write load.
Insufficient Disk Space for Data Directory:
- Diagnosis: Even if exemplar retention is reasonable, the overall disk partition where Prometheus stores its TSDB data (/var/lib/prometheus by default) might be too small. Use df -h to inspect the available space on the relevant partition. If it’s consistently above 90-95% full, Prometheus will struggle to write any new data, including exemplars.
- Fix: Increase the size of the disk partition or move Prometheus’s data directory to a larger partition. This is often an infrastructure-level task. For cloud environments, you might resize an EBS volume or attach a new, larger disk. On-premises, you’d add physical storage. After resizing or moving, ensure Prometheus is configured to use the new path if necessary (via --storage.tsdb.path command-line flag or storage.tsdb.path in prometheus.yml) and restart Prometheus.
- Why it works: Prometheus writes all its time-series data, including WAL (Write-Ahead Log) files and compacted blocks, to its data directory. If this directory’s containing filesystem runs out of space, no new data can be written, leading to errors like this. Providing more disk space allows Prometheus to operate normally.
WAL Corruption or Incomplete Compaction:
- Diagnosis: Sometimes, Prometheus might have issues writing to its Write-Ahead Log (WAL) or completing background compaction processes, which can indirectly lead to storage being marked as full or unavailable for new writes. Check Prometheus logs for any errors related to tsdb, WAL, or compaction. You might see messages like WAL segment corrupted or compaction failed.
- Fix: In rare cases, the WAL might need to be reset. This is a destructive operation and will cause data loss for samples that were only in the WAL and not yet flushed to blocks. Stop Prometheus, navigate to the data directory (e.g., /var/lib/prometheus), and delete the wal subdirectory: rm -rf /var/lib/prometheus/wal. Then, restart Prometheus. Prometheus will rebuild the WAL from existing blocks.
- Why it works: The WAL is crucial for durability. If it becomes corrupted, Prometheus might refuse to start or write new data. Removing it forces Prometheus to reconstruct its state from the last successfully flushed blocks, allowing it to resume normal operation, albeit with potential minor data loss.
Incorrect Permissions on Data Directory:
- Diagnosis: The user running the Prometheus process might not have write permissions to its data directory (e.g., /var/lib/prometheus). Check the ownership and permissions of the directory: ls -ld /var/lib/prometheus. If the user prometheus (or whatever user it runs as) doesn’t have w permissions, it won’t be able to write new data.
- Fix: Ensure the Prometheus user owns the data directory and has write permissions. If Prometheus runs as user prometheus and group prometheus:
```
sudo chown -R prometheus:prometheus /var/lib/prometheus
sudo chmod -R u+w /var/lib/prometheus
```
  Then restart Prometheus: sudo systemctl restart prometheus.
- Why it works: Operating systems enforce file permissions. If the Prometheus process doesn’t have explicit write access to the directory where it needs to store data, it will fail to write any new information, manifesting as a storage full error.
External Storage Issues (e.g., NFS, Ceph):
- Diagnosis: If Prometheus is configured to store its data on a network file system (NFS) or a distributed storage system (like Ceph), the issue might lie with the external storage’s capacity, connectivity, or quotas. Check the available space and status of the underlying storage system. Look for errors in dmesg or system logs related to the mount point.
- Fix: Address the issue on the external storage system. This could involve increasing quotas, freeing up space on the storage server, or resolving network connectivity problems to the storage. Once the external storage is healthy and has available space, Prometheus should be able to resume writing.
- Why it works: Prometheus relies on the underlying filesystem to persist data. If that filesystem is unavailable, full, or experiencing errors, Prometheus cannot write its data, even if its own configuration is correct.

The next error you’ll likely encounter if this is resolved but other underlying issues persist is out of memory if the exemplar volume is still too high for available RAM, or a general scrape failure if the root cause was a system-wide resource exhaustion.