DynamoDB partitions don’t grow; you have to actively manage their size and distribution by carefully choosing your shard key.

Let’s say you’ve got a table tracking user activity, and you’re seeing errors like ProvisionedThroughputExceededException or even just slow reads and writes. This usually means one or more of your partitions are getting hammered, while others are sitting idle.

Here’s how you can diagnose and fix a hot partition problem:

1. Identify the Hot Partition

The first step is to figure out which partition is the problem. DynamoDB doesn’t expose partition IDs directly, but you can infer it from the PartitionKey value.

Diagnosis Command:

Use CloudWatch metrics for your table. Look at ReadThrottleEvents and WriteThrottleEvents. If these are consistently high, you have a throttling issue. To pinpoint the hot partition, you need to query your data.

Check:

Run a scan with a filter expression on your PartitionKey and analyze the results. For example, if your PartitionKey is userId:

aws dynamodb scan \
    --table-name your-table-name \
    --filter-expression "begins_with(userId, :val)" \
    --expression-attribute-values '{":val": {"S": "user123"}}' \
    --select COUNT

Run this for various prefixes of your PartitionKey. The prefix that returns a disproportionately high count, or that you suspect is receiving the most traffic based on your application logic, is likely hitting a hot partition. You can also look at CloudWatch metrics for individual items if you enable DynamoDB’s detailed monitoring, but this is less common.

Fix:

Unfortunately, you can’t directly rebalance partitions in DynamoDB. The fix involves changing your PartitionKey design or implementing a strategy to distribute the load.

2. Common Causes and Solutions for Hot Partitions

Cause A: Sequential or Predictable Partition Keys

If your PartitionKey is something like a timestamp (e.g., 2023-10-27T10:00:00Z) or an auto-incrementing ID, all writes for a given time or the latest ID will land on the same partition.

Diagnosis:

Examine your application’s write patterns. If you’re inserting records sequentially by time or ID, you’re likely creating a hot spot.

Fix:

Add a Random Prefix/Suffix: Prepend or append a random string or number to your PartitionKey. For example, if your key is order_id, you could make it random_prefix#order_id.

  • Command Example (Conceptual - this is an application change): When writing data, generate a random number (e.g., 0-9) and use it as the first character of your PartitionKey. PartitionKey = "5#order_12345"
  • Why it works: This distributes writes across 10 different partitions (or however many buckets you create) instead of a single one. You’ll then need a SortKey to retrieve specific orders efficiently.

Cause B: Uneven Access Patterns with a High-Cardinality Partition Key

Even with a good PartitionKey, certain values might be accessed far more frequently than others. For example, if one user (userId=admin) performs 90% of all operations.

Diagnosis:

Use CloudWatch metrics for ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits at the table level. If you see a consistent skew where one key is responsible for a huge percentage of traffic, that’s your culprit.

Fix:

Key Widening: Introduce a secondary, more granular key that shares the load. This often involves adding a "sub-partition key" or a "virtual partition key."

  • Command Example (Conceptual - application logic): If userId is hot, you can add a subUserId key. When writing, you’d generate a random number and use it for subUserId. userId = "user123" subUserId = "shard_7" Your query would then look like: aws dynamodb query --table-name your-table-name --key-condition-expression "userId = :uid AND subUserId = :sid" --expression-attribute-values '{":uid": {"S": "user123"}, ":sid": {"S": "shard_7"}}'
  • Why it works: This breaks down the hot userId into multiple, smaller partitions based on subUserId, effectively spreading the load for that specific user. You’ll need to adjust your application logic to manage this new key.

Cause C: Insufficient Provisioned Throughput

This is the most straightforward cause, but it’s easy to overlook if you’re focused on key design. Your table simply doesn’t have enough read or write capacity units.

Diagnosis:

Monitor CloudWatch metrics for ReadThrottleEvents and WriteThrottleEvents. If they are consistently high, and your PartitionKey design seems reasonable, you might just need more throughput.

Fix:

Increase Provisioned Throughput: Manually increase the provisioned read and write capacity units for your table.

  • Command Example:
    aws dynamodb update-table \
        --table-name your-table-name \
        --provisioned-throughput ReadCapacityUnits=1000,WriteCapacityUnits=1000
    
    (Adjust 1000 to a value appropriate for your expected load, starting higher and scaling down if needed.)
  • Why it works: This allocates more resources to your table, allowing it to handle more requests per second.

Cause D: Inefficient Query Patterns

Queries that scan large portions of your table or repeatedly access the same hot items can indirectly cause hot partitions, especially if those items are concentrated.

Diagnosis:

Enable DynamoDB Detailed CloudWatch Metrics. Look for high ConsumedReadCapacityUnits on Scan operations or high ConsumedWriteCapacityUnits on PutItem or UpdateItem operations targeting specific keys.

Fix:

Optimize Queries: Rearchitect your queries to use Query operations on your PartitionKey and SortKey instead of Scan. Consider Global Secondary Indexes (GSIs) or Local Secondary Indexes (LSIs) to support your access patterns.

  • Command Example (Conceptual - application logic): Instead of scanning for all items with status = "PENDING", create a GSI with status as the PartitionKey. aws dynamodb query --table-name your-table-name --index-name YourStatusIndex --key-condition-expression "status = :s" --expression-attribute-values '{":s": {"S": "PENDING"}}'
  • Why it works: Query operations are much more efficient as they target specific partitions, whereas Scan operations read every item in the table, which can overload partitions.

Cause E: High-Cardinality Sort Keys with Unbalanced Access

While the PartitionKey is the primary driver of partition distribution, a very high-cardinality SortKey with unbalanced access within a partition can also lead to hot spots. If one specific SortKey value is accessed far more than others for a given PartitionKey, that specific item’s partition can become a bottleneck.

Diagnosis:

Examine CloudWatch metrics for ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits for specific items if you have detailed monitoring enabled, or infer from application logs. If a particular item (defined by PartitionKey + SortKey) is consistently the target of most requests, it might be a hot spot.

Fix:

Item Splitting or Denormalization: If a single item is too large or too frequently updated, consider splitting it into multiple items. This could involve creating new SortKey values to represent different aspects of the original item, or denormalizing data into separate tables.

  • Command Example (Conceptual - application logic): If an order item contains a large lineItems array that’s frequently updated, split lineItems into a separate "OrderLineItem" table, keyed by orderId and lineItemId.
  • Why it works: By breaking down a monolithic item into smaller, more manageable pieces, you reduce the contention on any single item and distribute the load across potentially multiple partitions if the new keys are designed well.

Cause F: Using DynamoDB Streams for High-Volume Workloads

If you’re using DynamoDB Streams for event processing and your stream records are being generated at a very high rate, the stream processor itself can become a bottleneck, indirectly appearing as a hot spot if the stream is tied to a specific partition key.

Diagnosis:

Monitor the IteratorAge metric for your DynamoDB Stream consumer. A consistently growing IteratorAge indicates your consumer is falling behind.

Fix:

Scale Stream Consumers or Batching: Increase the number of parallel consumers reading from the stream, or implement more efficient batch processing logic.

  • Command Example (Conceptual - application logic): If using AWS Lambda, increase the ReservedConcurrency for your Lambda function. If using Kinesis Data Streams (which DynamoDB Streams can integrate with), increase the number of shards.
  • Why it works: This allows more requests to be processed in parallel, catching up with the stream generation rate and preventing backlogs that can impact perceived performance.

The Next Error You’ll See

After fixing your hot partition issues and ensuring your throughput is adequate, the next thing you’ll likely encounter is a need for more complex data modeling, perhaps involving eventual consistency challenges or the need for more sophisticated indexing strategies, leading you to explore DynamoDB’s Global Secondary Indexes (GSIs) in more detail.

Want structured learning?

Take the full Sharding course →