The TombstoneReader failed to open because the underlying storage for Prometheus’s TSDB blocks is corrupted or inaccessible, preventing Prometheus from reading essential metadata about deleted time series data.

Here are the most common reasons this happens and how to fix them:

1. Disk Full or I/O Errors

  • Diagnosis: Check disk space on the Prometheus data directory.
    df -h /path/to/prometheus/data
    
    Look for I/O errors in dmesg or system logs.
    dmesg | grep -iE 'error|fail|corrupt'
    
  • Cause: Prometheus cannot write its WAL (Write-Ahead Log) or block metadata if the disk is full, leading to corruption. Or, the disk itself is failing, causing read/write errors.
  • Fix:
    • If disk is full: Free up space by deleting old data (if not already managed by retention) or expanding the disk.
      # Example: Remove blocks older than 7 days if retention is set to 7 days
      find /path/to/prometheus/data/01*/chunks -type f -mtime +7 -delete
      # Or, if using object storage and this is a local cache issue:
      # Clean up local object storage cache if applicable
      
      Then restart Prometheus.
    • If I/O errors: Address the underlying disk issue (replace drive, check RAID status, etc.). Once the disk is healthy, you might need to repair the TSDB.
  • Why it works: Prometheus needs reliable disk access to maintain its data integrity. Clearing space or fixing disk errors ensures it can read and write necessary files.

2. Corrupted WAL (Write-Ahead Log)

  • Diagnosis: Examine the wal directory within your Prometheus data directory. Look for unusually large files or files that seem incomplete.
    ls -lh /path/to/prometheus/data/wal/
    
    Prometheus logs will often indicate WAL corruption specifically.
  • Cause: A crash or improper shutdown can leave the WAL in an inconsistent state. This is Prometheus’s primary mechanism for recovering data after a restart.
  • Fix: If WAL corruption is suspected and cannot be automatically recovered by Prometheus on restart, you may need to truncate or remove the problematic WAL segment. This will result in data loss for metrics scraped since the last successful block commit.
    1. Stop Prometheus.
    2. Navigate to the wal directory:
      cd /path/to/prometheus/data/wal/
      
    3. Identify the WAL directory that seems problematic (often the most recent one). You might need to move it aside or delete it. Be extremely cautious here.
      # Move the current WAL directory aside as a backup
      mv /path/to/prometheus/data/wal /path/to/prometheus/data/wal_backup_$(date +%Y%m%d_%H%M%S)
      # Create a new, empty WAL directory
      mkdir /path/to/prometheus/data/wal
      
    4. Restart Prometheus. It will start with a fresh WAL, and the existing blocks will be read.
  • Why it works: By providing a clean WAL, you force Prometheus to rebuild its in-memory state from the last successfully persisted block, bypassing the corrupted recovery log.

3. Corrupted TSDB Blocks

  • Diagnosis: Prometheus logs will often explicitly mention errors reading specific block directories within /path/to/prometheus/data/01*/.
    ls -lh /path/to/prometheus/data/
    
    Look for directories starting with 01 followed by a timestamp.
  • Cause: Similar to WAL corruption, unexpected shutdowns, disk errors, or bugs can corrupt the immutable TSDB blocks that Prometheus uses for long-term storage.
  • Fix: The tsdb tool provided with Prometheus can help repair blocks.
    1. Stop Prometheus.
    2. Use the tsdb command-line tool to check and potentially repair the blocks. First, check the block:
      /path/to/prometheus/binary tsdb check --log.level=debug /path/to/prometheus/data/01XXXXXXXXXXXXXX
      
      Replace 01XXXXXXXXXXXXXX with the specific block directory that is failing. If check reports errors, attempt a repair. Repairing a block can lead to data loss within that block.
      /path/to/prometheus/binary tsdb fsck --log.level=debug /path/to/prometheus/data/01XXXXXXXXXXXXXX
      
      If fsck finds issues and allows repair, it will attempt to fix them. After repair, try restarting Prometheus. If a block is irreparable, you may have to delete it:
      rm -rf /path/to/prometheus/data/01XXXXXXXXXXXXXX
      
      And then restart Prometheus.
  • Why it works: The tsdb tool is designed to understand the internal structure of Prometheus blocks and can identify and, in some cases, mend inconsistencies or missing metadata. Removing a corrupted block allows Prometheus to continue operating with its remaining data.

4. Insufficient Memory or System Resources

  • Diagnosis: Monitor system memory usage (free -m, top, htop) and Prometheus process memory. Check ulimit settings for the Prometheus user.
    ulimit -a
    
  • Cause: The TombstoneReader needs to load index and metadata files. If the system is under extreme memory pressure, the operating system might aggressively page out Prometheus’s data, or the process might be OOM-killed. Also, locked memory limits (memlock) can prevent Prometheus from mmapping files efficiently.
  • Fix:
    • Increase system RAM or reduce the load on the server.
    • Adjust Prometheus’s memory configuration if applicable (e.g., via command-line flags or systemd service file).
    • Increase ulimit -n (open files) and ulimit -l (locked memory) for the Prometheus user in /etc/security/limits.conf or equivalent. For example:
      prometheus soft memlock unlimited
      prometheus hard memlock unlimited
      prometheus soft nofile 65536
      prometheus hard nofile 65536
      
      Then restart Prometheus and ensure the limits are applied.
  • Why it works: Sufficient memory allows Prometheus to efficiently access and process its index and metadata without being starved by the OS or hitting resource limits.

5. Incorrect File Permissions or Ownership

  • Diagnosis: Verify the ownership and permissions of the Prometheus data directory and its contents.
    ls -ld /path/to/prometheus/data
    ls -lR /path/to/prometheus/data | head
    
    Ensure the user running Prometheus (e.g., prometheus) has read and write access.
  • Cause: If the data directory or its files have been modified by another user or process, or if the Prometheus service was started under a different user, it might lack the necessary permissions to access the files.
  • Fix: Correct the ownership and permissions.
    sudo chown -R prometheus:prometheus /path/to/prometheus/data
    sudo chmod -R u+rwX,g+rX,o-rwx /path/to/prometheus/data
    
    Replace prometheus:prometheus with the actual user and group Prometheus runs as. Then restart Prometheus.
  • Why it works: The Prometheus process must be able to read and write to its data directory to manage its WAL, blocks, and metadata.

6. Corrupted Index Files (index or meta.json)

  • Diagnosis: Errors in Prometheus logs might specifically point to issues reading index files within block directories or the meta.json file at the root of the data directory.
  • Cause: These files contain critical metadata about the TSDB blocks and series. Corruption here can prevent Prometheus from understanding its own data.
  • Fix: This is a more severe form of block corruption. Often, the tsdb fsck command mentioned in point 3 is the primary tool. If meta.json is the issue, and it’s not recoverable, you might have to rebuild Prometheus from scratch or restore from a backup if available. A temporary workaround, if you can identify the bad block, is to remove it as described in point 3.
  • Why it works: Similar to other corruption issues, fixing or removing the problematic metadata allows Prometheus to re-initialize its state.

After resolving the primary TombstoneReader error, your next challenge will likely be out of memory errors if the underlying cause was resource starvation, or potentially too many open files if file descriptor limits were the issue.

Want structured learning?

Take the full Prometheus course →