Prometheus VM scraping doesn’t actually have a hard limit on the number of virtual machines it can scrape, but rather on the rate at which it can process the metrics those VMs expose.

Let’s see this in action. Imagine you have a fleet of VMs, each running a small Node Exporter instance.

# On a single VM
curl http://localhost:9100/metrics

This simple curl command fetches metrics. Now, imagine thousands of these, all hitting your Prometheus server simultaneously. Prometheus isn’t a database; it’s a time-series processor. It ingests, processes, and stores data in memory before writing it to disk. The bottleneck isn’t the number of targets, but the throughput of metric ingestion and the CPU/memory required to handle the scrape loop and rule evaluation.

The core problem Prometheus solves here is the consolidation of operational visibility from distributed, dynamic infrastructure. Instead of logging into each VM, you point Prometheus at them (or rather, at their Node Exporter endpoints), and it pulls the data.

Internally, Prometheus operates on a scrape configuration. For VM scraping, this usually looks something like this in prometheus.yml:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
        - '192.168.1.10:9100'
        - '192.168.1.11:9100'
        - '192.168.1.12:9100'
    # ... more VMs

Or, more commonly in dynamic environments, you’ll use service discovery (like Consul, EC2, Kubernetes) to automatically populate the targets list.

The "limit" you hit isn’t an explicit max_vms setting. It manifests as:

  • High CPU Usage on Prometheus Server: When Prometheus is overwhelmed, its CPU cores will spike as it tries to keep up with the scrape loop, parse incoming data, evaluate alerting/recording rules, and query its internal storage. You’ll see processes like prometheus consuming 100% of one or more CPU cores.
  • Increased Scrape Duration and Failures: The time it takes for Prometheus to complete a scrape cycle for all targets will increase. Scrapes that were previously fast will start to time out. In the Prometheus UI under "Status" -> "Targets," you’ll see targets showing "UP" status briefly, then transitioning to "DOWN" or having very long scrape durations.
  • High Memory Usage on Prometheus Server: As Prometheus ingests more data, its in-memory data structures grow. If it can’t write data to disk fast enough, or if the cardinality of your metrics is extremely high, it can lead to excessive memory consumption, potentially causing the Prometheus process to be OOM-killed by the operating system.
  • Network Saturation (Less Common): While Prometheus itself is efficient, if you have an enormous number of VMs scraping very frequently, the aggregate network traffic from Prometheus to all targets (or from targets to Prometheus, depending on configuration) could become a factor, though this is rarely the primary bottleneck.
  • Disk I/O Bottlenecks: Prometheus writes ingested data to its TSDB (Time Series Database) on disk. If the disk subsystem is slow, or if Prometheus is trying to write data faster than the disk can handle, this can become a bottleneck, impacting ingestion rates and overall performance.

Common Causes and Fixes:

  1. Too Many Targets for Available Resources:

    • Diagnosis: Monitor your Prometheus server’s CPU, memory, and network usage. Use Prometheus’s own metrics (process_cpu_seconds_total, process_resident_memory_bytes) or external monitoring tools. Check /targets in the Prometheus UI for long scrape durations or frequent UP/DOWN transitions.
    • Fix:
      • Scale Up Prometheus: Increase the CPU and RAM allocated to the Prometheus server. For example, if running in Kubernetes, increase the resource requests/limits for the Prometheus pod. If on a VM, provision a larger instance.
      • Scale Out Prometheus: Implement sharding or use a federated Prometheus setup. This involves running multiple Prometheus instances, each scraping a subset of targets, and potentially a central aggregator.
    • Why it works: More resources allow Prometheus to process more data concurrently and faster. Sharding distributes the load across multiple instances.
  2. Excessive Metric Cardinality:

    • Diagnosis: Use Prometheus’s metrics like go_memstats_heap_alloc_bytes and analyze your metrics. High cardinality means a very large number of unique time series (e.g., http_requests_total{method="GET", path="/users/12345", status="200"} vs. http_requests_total{method="GET", path="/users/67890", status="200"}). Look for metrics with labels that have highly variable values (like user IDs, request IDs, or IP addresses).
    • Fix:
      • Reduce Cardinality at the Source: Modify your applications or Node Exporter configurations to reduce the number of labels or the variability of label values. For Node Exporter, you might disable collectors that generate high-cardinality metrics (e.g., textfile collector with dynamically generated files).
      • Use metric_relabel_configs: In your prometheus.yml, you can drop specific metrics or labels before they are ingested.
        scrape_configs:
          - job_name: 'node_exporter'
            metric_relabel_configs:
              - source_labels: [__address__]
                regex: '.*' # Apply to all targets
                action: drop # Drop the metric entirely
                # OR
              - source_labels: [instance] # Example: Drop metrics from specific instances
                regex: 'vm-123.example.com'
                action: drop
        
    • Why it works: Fewer unique time series means less data to store, process, and index, dramatically reducing memory and CPU load.
  3. Long Scrape Intervals:

    • Diagnosis: Check your scrape_interval configuration in prometheus.yml. If it’s too short (e.g., 5s) for a large number of targets, Prometheus might not finish scraping before the next interval begins.
    • Fix: Increase the scrape_interval. For example, change scrape_interval: 15s to scrape_interval: 30s or even 60s.
      scrape_configs:
        - job_name: 'node_exporter'
          scrape_interval: 30s # Default is 1 minute (60s)
          static_configs:
            - targets: ['192.168.1.10:9100', '192.168.1.11:9100']
      
    • Why it works: A longer interval gives Prometheus more time to complete its scrape cycle for all targets without falling behind.
  4. Short Scrape Timeout:

    • Diagnosis: Prometheus has a default scrape_timeout of 10 seconds. If individual targets are slow to respond (e.g., due to high load on the VM itself), Prometheus will time them out, leading to scrape failures.
    • Fix: Increase the scrape_timeout for the job.
      scrape_configs:
        - job_name: 'node_exporter'
          scrape_timeout: 20s # Default is 10s
          static_configs:
            - targets: ['192.168.1.10:9100', '192.168.1.11:9100']
      
    • Why it works: Gives slow targets more time to respond before Prometheus gives up on them.
  5. Inefficient Rule Evaluation:

    • Diagnosis: Complex or numerous recording_rules and alerting_rules can consume significant CPU. Prometheus evaluates these rules periodically based on evaluation_interval.
    • Fix: Optimize your rules. Break down complex rules into simpler ones, use group_by effectively, and ensure your evaluation_interval is reasonable (often the same as scrape_interval or longer).
      rule_files:
        - "rules/*.yml"
      
      # In a rules file (e.g., rules/node.yml)
      groups:
        - name: node_rules
          interval: 30s # Match scrape_interval or be longer
          rules:
            - record: node_cpu_seconds_total:sum_irate
              expr: sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
            # ... other rules
      
    • Why it works: Optimized rules require less CPU to compute, freeing up resources for scraping.
  6. Under-provisioned Storage (Disk I/O):

    • Diagnosis: Monitor disk I/O on the Prometheus server. High disk utilization, latency, or slow write speeds can indicate a storage bottleneck. Prometheus’s own metrics (prometheus_tsdb_head_blocks_*, prometheus_tsdb_wal_fsync_duration_seconds) can help.
    • Fix: Use faster storage (SSDs), provision more I/O capacity for your storage (e.g., provisioned IOPS for cloud block storage), or adjust Prometheus’s storage.tsdb.max-block-size and storage.tsdb.min-block-size (though this is advanced tuning).
    • Why it works: Faster disk I/O allows Prometheus to write its time-series data more efficiently, preventing write amplification and ingestion bottlenecks.

If you’ve addressed all these, the next problem you’ll likely encounter is Prometheus’s own internal memory management under extremely high ingest rates, where it might struggle to efficiently manage its TSDB head block or write-ahead log (WAL).

Want structured learning?

Take the full Prometheus course →