Performance Production Checklist: 50-Point Guide (2026)

This checklist isn’t about making your system fast; it’s about making sure it’s not unintentionally slow.

Let’s look at a typical production deployment in action. Imagine we’re setting up a new web service that needs to serve dynamic content.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-web-app
  labels:
    app: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app-container
        image: my-registry/my-web-app:v1.2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

This Deployment defines three replicas of our my-web-app service. Each container requests 200 millicores of CPU and 256 MiB of memory, and is limited to 500 millicores and 512 MiB. It also has readiness and liveness probes configured.

The core problem this setup solves is managing and scaling stateless applications. By defining a desired number of replicas, the system ensures that if a pod crashes, another is automatically started. The resource requests and limits prevent a single application from consuming all available node resources, while probes ensure traffic is only sent to healthy instances and unhealthy ones are restarted.

Internally, the kube-scheduler decides which node a new pod runs on, balancing resource utilization. The kubelet on each node manages the lifecycle of the pods assigned to it, starting, stopping, and monitoring them. The kube-controller-manager’s deployment controller watches the Deployment object and ensures the actual state (running pods) matches the desired state (3 replicas).

Here’s the mental model:

Desired State: You declare what you want (e.g., 3 replicas of my-web-app with specific resources).
Control Loop: Kubernetes controllers (like the deployment controller) continuously compare the desired state with the actual state.
Reconciliation: If there’s a drift (e.g., only 2 replicas are running), the controller takes action to bring the actual state back to the desired state (starts a new pod).
Node Management: The kubelet on each node is the agent that executes the desired pod state on that specific machine.
Health Checks: Probes (liveness, readiness) are the system’s eyes and ears, telling the kubelet when an application is misbehaving or unavailable.

The most surprising thing is how much of the "performance tuning" is actually about resource allocation and contention management, not just application code optimization. For instance, setting CPU limits too low on a bursty application can lead to constant throttling, making it appear slow, even if the application code is perfectly efficient. The containerd or CRI-O runtime on the node, under the direction of the kubelet, uses cgroups to enforce these limits, and when a container hits its CPU limit, its execution is paused until the next time slice is available, leading to latency spikes.

When you scale this out, you start thinking about horizontal pod autoscaling based on CPU or memory utilization, and how that interacts with node autoscaling to ensure you have enough underlying capacity.