Kubernetes can feel like a black box when it comes to performance, but the core of its efficiency lies in how it manages resources for your applications.

Let’s see a simple example. Imagine you have a web application running in a Kubernetes cluster.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-webapp
  template:
    metadata:
      labels:
        app: my-webapp
    spec:
      containers:
      - name: web
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "200m"
            memory: "256Mi"

This Deployment tells Kubernetes to run three replicas of an Nginx container. Crucially, it specifies resources.requests and resources.limits. requests is what Kubernetes guarantees for the pod to be scheduled. limits is the absolute maximum the pod can consume. If a pod exceeds its memory limit, it’s killed (OOMKilled). If it exceeds its CPU limit, it gets throttled.

This is the fundamental mechanism: Kubernetes uses these requests to decide where to place pods (scheduling) and how much resource to reserve for them. The limits define the boundaries of their behavior.

The problem this solves is resource contention and inefficient utilization. Without explicit requests and limits, pods could starve each other of CPU or memory, leading to unpredictable performance or outright crashes. Conversely, over-allocating resources wastes expensive cloud infrastructure.

Internally, the Kubernetes scheduler uses the requests to find a node with enough available capacity. The kubelet on each node then enforces the limits. It uses Linux control groups (cgroups) to isolate and manage the resources for each container.

Tuning these values is an ongoing process. You start with educated guesses, monitor performance, and adjust.

  • Requests: These should reflect the typical resource needs of your application. If your app usually needs 200m of CPU to handle its workload, set requests.cpu to 200m. If it needs 512Mi of memory, set requests.memory to 512Mi.
  • Limits: These should be set higher than requests, accounting for peak loads and bursts. For a web server, you might set limits.cpu to 400m and limits.memory to 768Mi. This allows it to handle temporary spikes without being killed, but prevents it from hogging the node if something goes wrong.

The cpu is measured in millicores (m), where 1000m equals one full CPU core. memory is typically in Megabytes (Mi) or Gigabytes (Gi).

The magic happens when you have many pods. Kubernetes aggregates the requests of all pods scheduled on a node to determine its total allocatable capacity. This prevents overcommitting resources. If a node is full based on these aggregated requests, the scheduler won’t place more pods there, even if there’s some free CPU or memory available that isn’t reserved.

The most surprising aspect for many is that exceeding CPU limits doesn’t kill your pod; it just makes it slow. The scheduler, however, uses the requests to decide if a node can accept a new pod, not the current utilization. This means a node could be technically underutilized but unable to schedule new pods because the sum of their requests exceeds its capacity.

The next rabbit hole you’ll fall down is understanding how autoscaling works based on these resource metrics.

Want structured learning?

Take the full Performance course →