An API’s latency under load isn’t just about how fast it responds; it’s a measure of how well its underlying infrastructure scales to meet concurrent demand.
Let’s see this in action. Imagine a simple GET request to /users/{id}.
# Simulate 100 concurrent requests
for i in {1..100}; do
curl -s -o /dev/null -w "%{time_total}\n" http://api.example.com/users/123 &
done
If the average time_total is high, say over 500ms, we have a problem.
The core issue is often a bottleneck in one or more of these areas: database queries, application code execution, network I/O, or external service dependencies. Under load, these bottlenecks become amplified. A single slow database query, if executed by many concurrent requests, can halt the entire system.
Database Bottlenecks
- Diagnosis: Use your database’s slow query log or performance monitoring tools (e.g.,
pg_stat_statementsfor PostgreSQL,SHOW PROFILEfor MySQL). Look for queries exceeding 100ms. - Cause: Inefficient queries, missing indexes, or insufficient database connection pooling.
- Fix (Indexing): If a query like
SELECT * FROM orders WHERE user_id = Xis slow, create an index:CREATE INDEX idx_orders_user_id ON orders (user_id);. This allows the database to find rows much faster without scanning the entire table. - Fix (Connection Pooling): Ensure your application’s connection pool is adequately sized. For a service handling 500 requests per second, a pool of 50-100 connections might be appropriate, depending on query execution times. Check your web framework or application server’s configuration (e.g.,
max_connectionsin HikariCP for Java). This prevents the overhead of establishing a new database connection for every request.
Application Code Inefficiencies
- Diagnosis: Profile your application code using tools like
pproffor Go,cProfilefor Python, or APM tools (Datadog, New Relic). Identify functions consuming the most CPU time or holding locks for extended periods. - Cause: Expensive computations, blocking I/O operations within request handlers, or excessive object allocation.
- Fix (Asynchronous Operations): If your API makes multiple sequential external HTTP calls, refactor them to run concurrently. In Node.js, this might involve
Promise.all(). In Python,asyncioorconcurrent.futures. This allows the CPU to work on other tasks while waiting for I/O to complete. - Fix (Caching): Cache frequently accessed, rarely changing data. For example, if user profile data is requested often but doesn’t change rapidly, cache it in Redis or Memcached. Set an appropriate TTL, like 300 seconds. This bypasses expensive database lookups or computations entirely for cache hits.
Network Latency and I/O
- Diagnosis: Use tools like
tcpdumpor Wireshark to inspect network traffic between your API service and databases/external services. Measure round-trip times. Check your load balancer’s metrics for connection duration and request queuing. - Cause: Network congestion, inefficient serialization/deserialization, or a single point of failure in network path.
- Fix (HTTP Keep-Alive): Ensure HTTP Keep-Alive is enabled on your web server and client libraries. This allows multiple requests to be sent over a single, persistent TCP connection, reducing the overhead of TCP handshake and TLS negotiation for each request. The default timeout is often 5 seconds.
- Fix (Payload Size): Reduce the size of request and response payloads. Avoid sending unnecessary fields. For large responses, consider pagination with a default
limitof 50 andoffsetof 0. This reduces the amount of data that needs to be transmitted over the network.
External Service Dependencies
- Diagnosis: Monitor the latency and error rates of all external services your API depends on. Tools like Datadog or Prometheus with
blackbox_exportercan help. - Cause: Slow responses from third-party APIs, rate limiting, or network issues between your service and the dependency.
- Fix (Timeouts and Retries): Implement aggressive timeouts for external calls (e.g., 200ms for a critical external API). Use a backoff strategy for retries, but limit the number of retries (e.g., max 3 retries). This prevents your API from being held hostage by a slow or unresponsive external service.
- Fix (Circuit Breaker): Implement a circuit breaker pattern. If an external service consistently fails or times out, the circuit breaker "opens," and subsequent calls to that service are immediately rejected without attempting the network call. This allows the external service time to recover and prevents cascading failures.
Concurrency Limits
- Diagnosis: Check your application server’s or language runtime’s concurrency limits. For example, Go’s
GOMAXPROCSor thread pool sizes in Java application servers. - Cause: The system is configured to handle fewer concurrent requests than are being sent to it, leading to request queuing and increased latency.
- Fix (Increase Workers/Threads): If your application is CPU-bound and you’re seeing high CPU utilization with low request throughput, increase the number of worker processes or threads. For a multi-core server, setting the number of workers equal to the number of CPU cores is a common starting point. For example, in Gunicorn (Python),
--workers 4for a 4-core machine. This allows the application to process more requests in parallel.
The most common pitfall is treating latency as a single metric. When optimizing, you need to understand that a 10ms increase in database query time can translate to a 100ms increase in API response time if the database is the primary bottleneck and requests are piling up.
The next challenge you’ll face is understanding how caching strategies for your API responses can significantly impact both latency and the load on your backend services.