A common misconception is that simply increasing hardware resources directly translates to higher request throughput.

Let’s look at a hypothetical web service handling API requests. Imagine a service with a few independent microservices behind a load balancer.

GET /users/123
Host: api.example.com
Authorization: Bearer <token>

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 512
Date: Tue, 15 Oct 2024 10:00:00 GMT

{
  "id": 123,
  "name": "Alice",
  "email": "alice@example.com",
  "orders": [
    {"id": 101, "item": "Book", "price": 25.00},
    {"id": 102, "item": "Pen", "price": 2.50}
  ]
}

This request might involve calling a User Service to fetch user details, then an Order Service to fetch recent orders, and finally aggregating them before returning a response. Each service has its own database, potentially its own caching layer, and its own processing logic. The load balancer distributes incoming requests across multiple instances of these services.

To maximize requests per second, we need to understand the bottleneck. This is the component that, when its capacity is reached, limits the overall throughput. It could be CPU on a specific service, I/O on a database, network bandwidth, or even contention for shared resources like locks.

Consider a scenario where our User Service is the bottleneck. It’s processing requests too slowly because its CPU is maxed out. Simply adding more User Service instances might not help if they’re all waiting on the same under-provisioned database.

The key is to identify the true bottleneck and address it. This often involves a multi-pronged approach:

  1. Profiling and Monitoring: Use tools like Prometheus, Grafana, Datadog, or New Relic to gather metrics on CPU, memory, I/O, network, and application-level metrics like request latency and error rates for each service. Identify which service consistently shows high utilization or latency under load.

    • Diagnosis: kubectl top pods -n your-namespace --containers to see container resource usage. kubectl describe pod <pod-name> -n your-namespace for resource requests/limits.
    • Fix: If User Service CPU is 95%+, increase its CPU limits in the Kubernetes deployment spec from resources: limits: cpu: "500m" to cpu: "1000m".
    • Why it works: Provides more processing power for the service to handle requests faster.
  2. Database Optimization: If database queries are slow, they can become the bottleneck. This could be due to inefficient queries, missing indexes, or insufficient database resources.

    • Diagnosis: Examine slow query logs (SHOW GLOBAL VARIABLES LIKE 'slow_query_log%'; and SHOW GLOBAL VARIABLES LIKE 'long_query_time%'; in MySQL). Use EXPLAIN on problematic queries.
    • Fix: Add an index to a frequently queried column: CREATE INDEX idx_user_email ON users (email);.
    • Why it works: Indexes allow the database to find rows much faster without scanning entire tables.
  3. Caching: Implementing caching at various levels (application, distributed cache like Redis or Memcached, CDN) can significantly reduce load on backend services and databases.

    • Diagnosis: Monitor cache hit/miss ratios. If the cache hit ratio is low, requests are not being served from the cache effectively.
    • Fix: Configure Redis to store user profiles for 5 minutes: SETEX user:123 <user_profile_json> 300.
    • Why it works: Frequently accessed data is served directly from fast in-memory cache, bypassing slower database or service calls.
  4. Asynchronous Processing: For non-critical or long-running tasks (like sending emails or generating reports), offload them to background workers using message queues (e.g., RabbitMQ, Kafka, SQS).

    • Diagnosis: Observe that the API endpoint for initiating the task returns quickly, but the actual work takes a long time, potentially impacting subsequent requests if done synchronously.
    • Fix: The API publishes a message to a queue: producer.send('email_queue', { userId: 123, email: 'alice@example.com', subject: 'Welcome!' }). A separate worker consumes and processes it.
    • Why it works: The primary request thread is freed up immediately, allowing it to handle more incoming requests, while the work is processed independently.
  5. Connection Pooling: Many applications open and close database connections for each request, which is very inefficient. Using connection pooling maintains a set of open connections.

    • Diagnosis: High CPU usage on the application server related to thread management or frequent socket open/close operations.
    • Fix: Configure your database driver or ORM to use a connection pool with a reasonable size (e.g., 10-50 connections depending on expected concurrency): HikariConfig config = new HikariConfig(); config.setJdbcUrl("jdbc:mysql://localhost:3306/mydb"); config.setUsername("user"); config.setPassword("password"); config.setMaximumPoolSize(20);.
    • Why it works: Reuses existing database connections, eliminating the overhead of establishing new connections for every request.
  6. Code Optimization: Inefficient algorithms or excessive object creation within the application code itself can lead to performance bottlenecks.

    • Diagnosis: Application-level profiling using tools like JProfiler, YourKit, or pprof (for Go). Look for hot spots in CPU usage or excessive memory allocation.
    • Fix: Refactor a nested loop that iterates millions of times into a hash map lookup: Change for i in list1: for j in list2: if i == j: to lookup = set(list2); for i in list1: if i in lookup:.
    • Why it works: Replaces an O(n*m) operation with an O(n+m) operation, drastically reducing execution time for large datasets.
  7. Network Latency: In distributed systems, network calls between services add latency. Reducing the number of network hops or optimizing network configurations can help.

    • Diagnosis: High latency reported for inter-service communication, even when individual services are performing well.
    • Fix: Colocate services that communicate frequently within the same availability zone or even the same Kubernetes node.
    • Why it works: Minimizes the physical distance data travels, reducing network transit time.

Once you’ve addressed the primary bottleneck, expect to encounter the next one, often shifting to a different component or a more subtle issue like garbage collection pauses or thread contention.

Want structured learning?

Take the full Performance course →