S3 boto3 pagination is actually a deliberate design choice to prevent overwhelming your application and S3 itself with massive requests.
Here’s how it looks in practice, fetching objects from a bucket named my-huge-bucket:
import boto3
s3 = boto3.client('s3')
bucket_name = 'my-huge-bucket'
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name)
for page in pages:
if 'Contents' in page:
for obj in page['Contents']:
print(obj['Key'])
This code doesn’t just magically get all objects. The get_paginator method creates an iterator that, when you loop through it, makes sequential list_objects_v2 calls to S3. Each call retrieves a "page" of results, up to 1000 objects by default. When you request the next page, boto3 automatically constructs the NextContinuationToken from the previous response and uses it in the next list_objects_v2 API call. This continues until S3 indicates there are no more objects.
The core problem this solves is the sheer scale of object listing in large S3 buckets. Imagine a bucket with millions or billions of objects. A single, unpaginated list_objects_v2 call would attempt to return all of them in one go. This would likely fail due to:
- S3 API Limits: S3 has limits on response payload size and request duration. A massive list could exceed these.
- Client Resource Exhaustion: Your application’s memory and CPU could be completely consumed trying to process an enormous dataset at once.
- Network Instability: Large, long-running requests are more susceptible to network interruptions.
The list_objects_v2 API, which boto3 uses under the hood, is designed for this. When you call list_objects_v2 and the number of objects exceeds the MaxKeys parameter (defaulting to 1000), S3 doesn’t return everything. Instead, it returns the first MaxKeys objects and includes a NextContinuationToken in the response. Your next request must include this token to tell S3 where to resume listing. Boto3’s paginator abstracts this token management away.
You can influence the page size using the PaginationConfig parameter within paginate. For example, to request a maximum of 500 objects per page:
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, PaginationConfig={'MaxItems': 500, 'PageSize': 500})
MaxItems is the total number of items you want to retrieve across all pages, and PageSize is the maximum number of items to return in each individual API call to S3. If MaxItems is not specified, it will fetch all available items.
A common misconception is that list_objects_v2 is slow because it’s a "list" operation. In reality, the performance bottleneck for very large buckets isn’t the operation itself, but the network latency and the sheer number of API calls required to iterate through all the pages. Each page fetch is a separate network round trip.
The list_objects_v2 API also supports Prefix and Delimiter parameters, which can significantly optimize your listings by allowing you to traverse your bucket’s "directory" structure more efficiently. For example, Prefix='my/folder/' will only list objects within that specific "folder."
If you’re encountering errors like MaxKeys being exceeded or timeouts when trying to list objects in a very large bucket without pagination, it’s because you’re attempting to pull too much data in a single API call. The paginator is the direct solution, ensuring each request to S3 is manageable.
The next challenge you’ll likely face after successfully paginating through all objects is understanding the difference between object keys and S3’s internal partitioning, especially when it comes to performance tuning for extremely high-throughput workloads.