A shard failure in a distributed database doesn’t just take down a piece of data; it can cascade and halt the entire application if not handled with extreme care.

Let’s say you have a sharded PostgreSQL cluster managing user data, with shards spread across three nodes. Each shard is responsible for a range of user IDs. Suddenly, the node hosting shard_2 (responsible for user IDs 2000-3999) becomes unresponsive.

Sharding Failure Scenarios: Partial Outage and Recovery

When shard_2 goes down, here’s what can happen and how to fix it.

1. The Application Grinds to a Halt (and you didn’t expect it)

  • What broke: The application’s connection pool, designed to talk to all shards, starts failing requests. It’s not that the entire database is down, but any request touching shard_2 (e.g., fetching user profile user_id=3500) will hang indefinitely, eventually timing out and returning an error to the user. The interesting part is that other shards (shard_1, shard_3) are perfectly fine, but the application can’t serve any user, even those on healthy shards.

  • Common Causes & Fixes:

    • Cause: Node Failure/Network Partition. The physical machine hosting shard_2 crashed, or a network switch failed, making the node unreachable.

      • Diagnosis: On your application servers, check logs for connection timeouts to the specific database host. On the database side, attempt to ssh to the affected node. If you can’t reach it, it’s a network or hardware issue.
      • Fix: If it’s a hardware failure, replace the faulty hardware. If it’s a network issue, fix the network path. Once the node is back online and reachable, the application’s connection pool should automatically re-establish connections.
      • Why it works: The application’s connection pool is typically configured with health checks. When the node becomes available, the pool detects it and starts routing traffic again.
    • Cause: Database Process Crash. The PostgreSQL process on the shard_2 node crashed due to a bug, out-of-memory error, or data corruption.

      • Diagnosis: ssh into the affected node. Check PostgreSQL logs (/var/log/postgresql/postgresql-X.Y-main.log or similar) for crash messages, OOM killer entries, or FATAL errors indicating corruption. Use systemctl status postgresql to see if the service is running.
      • Fix: Restart the PostgreSQL service: sudo systemctl restart postgresql. If the crash is due to data corruption, you’ll need to restore from a backup or use point-in-time recovery.
      • Why it works: Restarting the PostgreSQL service brings the database process back online, allowing it to accept connections and serve queries for shard_2.
    • Cause: Resource Exhaustion (CPU/Memory/Disk). The node hosting shard_2 is overloaded, causing PostgreSQL to become unresponsive or crash.

      • Diagnosis: ssh into the affected node. Use top, htop, free -m, and df -h to check CPU, memory, and disk usage. Look for PostgreSQL processes consuming excessive resources. Check PostgreSQL logs for errors related to out of memory or disk full.
      • Fix:
        • CPU/Memory: Identify and optimize or terminate runaway queries. If the load is consistently high, consider upgrading the node’s resources or rebalancing shards.
        • Disk: Free up disk space by deleting old logs, vacuuming tables, or archiving data. If disk is consistently full, add more storage or move data.
      • Why it works: By resolving the resource contention, PostgreSQL can operate normally, respond to requests, and maintain its connections.
    • Cause: Lock Contention/Deadlock. A long-running transaction or a deadlock on shard_2 is blocking all other queries, making the shard appear unresponsive.

      • Diagnosis: Connect to the affected shard’s PostgreSQL instance (if possible, or through a healthy replica if configured). Run SELECT * FROM pg_stat_activity WHERE state = 'active'; to find long-running queries. Check SELECT * FROM pg_locks; for lock information and potential deadlocks.
      • Fix:
        • Long-running query: If it’s a legitimate but slow query, optimize it. If it’s stuck, you may need to pg_cancel_backend(pid) or pg_terminate_backend(pid) the offending process.
        • Deadlock: PostgreSQL usually detects and resolves deadlocks automatically by terminating one of the involved transactions. If it’s not resolving, you might need to manually identify and terminate the blocking sessions.
      • Why it works: Releasing the locks or terminating the blocking process allows other queries to proceed, unblocking the shard.
    • Cause: Configuration Errors. Incorrect PostgreSQL configuration on the shard_2 node (e.g., max_connections set too low, incorrect network binding).

      • Diagnosis: Review postgresql.conf on the affected node. Check listen_addresses to ensure it’s bound to the correct network interface and max_connections to ensure it’s sufficient for your workload.
      • Fix: Adjust postgresql.conf (e.g., listen_addresses = '*' or a specific IP, max_connections = 500) and reload the PostgreSQL configuration: sudo systemctl reload postgresql.
      • Why it works: Correcting the configuration allows PostgreSQL to listen on the network and accept the expected number of connections.
    • Cause: Underlying Storage Issues. The disk array or storage system hosting shard_2’s data files is experiencing errors or performance degradation.

      • Diagnosis: Check system logs (/var/log/syslog, dmesg) on the affected node for disk-related errors (e.g., I/O errors, disk timeouts). If using a SAN or NAS, check the storage system’s own health monitoring.
      • Fix: Depending on the storage technology, this might involve replacing a failing drive, reconfiguring the RAID array, or addressing network issues to the storage.
      • Why it works: Restoring the integrity and performance of the storage subsystem allows PostgreSQL to read and write data reliably.

2. Recovery and Resiliency

  • The Goal: Once shard_2 is back online and serving requests, the application should recover. However, a robust system doesn’t just recover; it aims for high availability.
  • Replication: If you have read replicas for shard_2, the application might be able to failover to a replica if the primary becomes unavailable. This requires application-level logic or a proxy (like PgBouncer or a custom load balancer) to detect primary failure and switch to a replica.
  • Automatic Failover: For true resilience, consider setting up automatic failover mechanisms. Tools like Patroni, Pacemaker, or cloud-provider managed services can detect primary failure and promote a replica to become the new primary for shard_2, minimizing downtime.
  • Application-Level Shard Mapping: Ensure your application’s shard mapping logic is resilient. If the shard map itself is stored in a database that becomes unavailable, your application won’t know which shard to query even if the shards are technically running. Consider a distributed cache (like Redis or Memcached) or a separate, highly available configuration store for shard mappings.

The next error you’ll hit after fixing all these is likely a Too many clients error if your max_connections setting is still too low for your traffic, or a more subtle performance degradation if the underlying cause was resource exhaustion and it hasn’t been fully addressed.

Want structured learning?

Take the full Sharding course →