Fix Prometheus TSDB Out-of-Order Samples Error (2026)

Prometheus’s Time Series Database (TSDB) is failing because it’s receiving data points that are older than what it expects, causing it to reject them.

The most common culprit is clock drift between the Prometheus server and the metric-generating agents (exporters or applications). If the agent’s clock is significantly behind the Prometheus server’s clock, it will send samples with timestamps in the past.

To diagnose this, check the prometheus_tsdb_head_out_of_order_samples_total metric. If it’s increasing, you’re seeing the error.

Cause 1: NTP Sync Issues on Prometheus Server

Diagnosis: On the Prometheus server, run ntpq -p. Look for asterisks (*) next to your NTP servers, indicating they are synchronized. If there are no asterisks or the offset is large, NTP isn’t working correctly.
Fix: Ensure the ntpd or chronyd service is running and configured with reliable NTP servers. For example, in /etc/chrony/chrony.conf, you might have:
```
server pool.ntp.org iburst
```
Then restart the service: sudo systemctl restart chronyd.
Why it works: This ensures the Prometheus server’s clock is accurately synchronized to a reliable time source, reducing the likelihood of receiving "old" data.

Cause 2: NTP Sync Issues on Metric Agents

Diagnosis: Repeat ntpq -p (or chronyc sources for chrony) on the machines running your exporters or applications sending metrics.
Fix: Similar to the Prometheus server, ensure NTP is running and configured correctly on all agents. If you have many agents, consider using a local NTP server for them.
Why it works: Synchronizing the agent’s clock prevents it from generating metrics with timestamps that are too far in the past relative to the Prometheus server.

Cause 3: Incorrect scrape_interval and evaluation_interval

Diagnosis: Review your prometheus.yml configuration. If scrape_interval is very short (e.g., 15s) and evaluation_interval is even shorter, and you have network latency or slow agents, samples might arrive after the scrape collection window has effectively "closed" for that specific timestamp.
Fix: Increase scrape_interval and/or evaluation_interval. For example, change from scrape_interval: 15s to scrape_interval: 30s and evaluation_interval: 60s.
```
scrape_configs:
  - job_name: 'my_app'
    scrape_interval: 30s
    static_configs:
      - targets: ['localhost:9100']
evaluation_interval: 60s
```
Why it works: A longer scrape interval gives more buffer time for samples to arrive, and a longer evaluation interval means Prometheus waits longer before processing scraped data, making it less sensitive to minor timing discrepancies.

Cause 4: Agent Application Time Skew

Diagnosis: Some applications, especially those running in containers or on systems with custom clock sources, might not be properly synchronized. Check the system time on the application host itself.
Fix: Ensure the operating system on the agent host is using NTP. If the application itself is responsible for timestamping (e.g., custom metrics libraries), verify its internal clock source is reliable and synchronized. Some libraries allow you to specify a clock source.
Why it works: Guarantees the timestamps generated by the application are as accurate as possible.

Cause 5: Network Latency and Packet Reordering

Diagnosis: High network latency or packet loss between the agent and Prometheus can cause scrapes to be delayed. While Prometheus itself handles some out-of-order data, extreme delays can push timestamps too far back. Use tools like mtr or ping to check network health between the agent and Prometheus.
Fix: Improve network connectivity. This might involve optimizing routing, upgrading network hardware, or ensuring sufficient bandwidth. For Prometheus, increasing scrape_timeout in prometheus.yml can also help if the network is the bottleneck:
```
scrape_configs:
  - job_name: 'my_app'
    scrape_timeout: 15s # Default is 10s
    static_configs:
      - targets: ['localhost:9100']
```
Why it works: A longer scrape timeout allows Prometheus to wait more for data over a latent network, and improved network stability ensures packets arrive in a more timely fashion.

Cause 6: System Clock Changes (Manual or Hardware Issues)

Diagnosis: If a server’s clock was manually set incorrectly and then corrected, or if hardware clock issues occur, it can lead to historical timestamps. Check system logs (/var/log/syslog or journalctl) for any manual date commands or hardware clock warnings.
Fix: Re-synchronize the system clock using NTP and investigate any underlying hardware clock problems. Ensure the hardware clock (RTC) is also being updated correctly by the OS.
Why it works: Resets the system’s fundamental timekeeping to an accurate baseline.

If all these are addressed, you might then encounter prometheus_tsdb_head_chunks filling up, indicating your disk is getting full or you have a very high cardinality problem.