Prometheus’s Time Series Database (TSDB) is failing because it’s receiving data points that are older than what it expects, causing it to reject them.
The most common culprit is clock drift between the Prometheus server and the metric-generating agents (exporters or applications). If the agent’s clock is significantly behind the Prometheus server’s clock, it will send samples with timestamps in the past.
To diagnose this, check the prometheus_tsdb_head_out_of_order_samples_total metric. If it’s increasing, you’re seeing the error.
Cause 1: NTP Sync Issues on Prometheus Server
- Diagnosis: On the Prometheus server, run
ntpq -p. Look for asterisks (*) next to your NTP servers, indicating they are synchronized. If there are no asterisks or the offset is large, NTP isn’t working correctly. - Fix: Ensure the
ntpdorchronydservice is running and configured with reliable NTP servers. For example, in/etc/chrony/chrony.conf, you might have:
Then restart the service:server pool.ntp.org iburstsudo systemctl restart chronyd. - Why it works: This ensures the Prometheus server’s clock is accurately synchronized to a reliable time source, reducing the likelihood of receiving "old" data.
Cause 2: NTP Sync Issues on Metric Agents
- Diagnosis: Repeat
ntpq -p(orchronyc sourcesfor chrony) on the machines running your exporters or applications sending metrics. - Fix: Similar to the Prometheus server, ensure NTP is running and configured correctly on all agents. If you have many agents, consider using a local NTP server for them.
- Why it works: Synchronizing the agent’s clock prevents it from generating metrics with timestamps that are too far in the past relative to the Prometheus server.
Cause 3: Incorrect scrape_interval and evaluation_interval
- Diagnosis: Review your
prometheus.ymlconfiguration. Ifscrape_intervalis very short (e.g.,15s) andevaluation_intervalis even shorter, and you have network latency or slow agents, samples might arrive after the scrape collection window has effectively "closed" for that specific timestamp. - Fix: Increase
scrape_intervaland/orevaluation_interval. For example, change fromscrape_interval: 15stoscrape_interval: 30sandevaluation_interval: 60s.scrape_configs: - job_name: 'my_app' scrape_interval: 30s static_configs: - targets: ['localhost:9100'] evaluation_interval: 60s - Why it works: A longer scrape interval gives more buffer time for samples to arrive, and a longer evaluation interval means Prometheus waits longer before processing scraped data, making it less sensitive to minor timing discrepancies.
Cause 4: Agent Application Time Skew
- Diagnosis: Some applications, especially those running in containers or on systems with custom clock sources, might not be properly synchronized. Check the system time on the application host itself.
- Fix: Ensure the operating system on the agent host is using NTP. If the application itself is responsible for timestamping (e.g., custom metrics libraries), verify its internal clock source is reliable and synchronized. Some libraries allow you to specify a clock source.
- Why it works: Guarantees the timestamps generated by the application are as accurate as possible.
Cause 5: Network Latency and Packet Reordering
- Diagnosis: High network latency or packet loss between the agent and Prometheus can cause scrapes to be delayed. While Prometheus itself handles some out-of-order data, extreme delays can push timestamps too far back. Use tools like
mtrorpingto check network health between the agent and Prometheus. - Fix: Improve network connectivity. This might involve optimizing routing, upgrading network hardware, or ensuring sufficient bandwidth. For Prometheus, increasing
scrape_timeoutinprometheus.ymlcan also help if the network is the bottleneck:scrape_configs: - job_name: 'my_app' scrape_timeout: 15s # Default is 10s static_configs: - targets: ['localhost:9100'] - Why it works: A longer scrape timeout allows Prometheus to wait more for data over a latent network, and improved network stability ensures packets arrive in a more timely fashion.
Cause 6: System Clock Changes (Manual or Hardware Issues)
- Diagnosis: If a server’s clock was manually set incorrectly and then corrected, or if hardware clock issues occur, it can lead to historical timestamps. Check system logs (
/var/log/syslogorjournalctl) for any manualdatecommands or hardware clock warnings. - Fix: Re-synchronize the system clock using NTP and investigate any underlying hardware clock problems. Ensure the hardware clock (RTC) is also being updated correctly by the OS.
- Why it works: Resets the system’s fundamental timekeeping to an accurate baseline.
If all these are addressed, you might then encounter prometheus_tsdb_head_chunks filling up, indicating your disk is getting full or you have a very high cardinality problem.