S3 Batch Operations is a game-changer for managing massive datasets in S3, but its real power lies in its ability to operate asynchronously and idempotently across millions of objects without you needing to write a single line of code for the actual object manipulation.

Let’s say you need to update the storage class for 10 million objects from Standard to Intelligent-Tiering, or maybe you need to add a tag to all objects in a bucket that haven’t been accessed in a year. Doing this with individual s3api calls or SDK scripts would be a nightmare of rate limiting, error handling, and long-running processes. S3 Batch Operations abstracts all of that away.

Here’s how it works in practice:

First, you define the operation you want to perform. This could be COPY, REPLACE, TAGG_OBJECT, or INVOKE_LAMBDA. For COPY, you’d specify a destination bucket and optionally a prefix. For REPLACE, you’d specify a replacement object. For TAGG_OBJECT, you provide the tags. For INVOKE_LAMBDA, you point to a Lambda function ARN.

Next, you define the scope of the operation. This is where you tell S3 which objects to operate on. You have two main options:

  1. Manifest File: You provide an S3 object (a CSV file) that lists the objects to operate on, by bucket and key. This is incredibly flexible.

    # Example manifest.csv
    Bucket,Key
    my-source-bucket,object1.txt
    my-source-bucket,folder/object2.jpg
    my-source-bucket,another/path/to/document.pdf
    

    You upload this CSV to an S3 bucket, and then reference its S3 URI when creating the batch job.

  2. S3 Inventory Report: If you already use S3 Inventory to generate reports of your bucket contents, you can use that as your input. This is useful if you want to operate on objects based on criteria already captured in your inventory (e.g., objects last modified before a certain date, objects of a certain size). The inventory report needs to be in CSV format.

Once you have your operation and your object list (via manifest or inventory), you create an S3 Batch Operations job. You’ll need to specify:

  • The manifest or inventory report location.
  • The operation to perform (e.g., COPY, TAGG_OBJECT).
  • Any parameters for the operation (e.g., destination bucket for COPY, tags for TAGG_OBJECT).
  • An IAM role that grants S3 Batch Operations permission to perform the specified operation on your behalf (e.g., s3:GetObject, s3:PutObject, s3:PutObjectTagging).
  • (Optional) An S3 bucket where completion reports (successes, failures) will be written.

Let’s look at a concrete example: Adding a tag environment=production to all objects in my-data-bucket.

First, create your manifest tags_manifest.csv:

Bucket,Key
my-data-bucket,data/file1.csv
my-data-bucket,images/photo.jpg
my-data-bucket,archive/old_report.zip

Upload tags_manifest.csv to s3://my-config-bucket/manifests/tags_manifest.csv.

Then, create the batch job using the AWS CLI:

aws s3control create-job \
    --account-id 111122223333 \
    --iam-role-arn arn:aws:iam::111122223333:role/S3BatchOpsRole \
    --operation '{
        "Tagging": {
            "TagSet": [
                {"Key": "environment", "Value": "production"}
            ]
        }
    }' \
    --report '{
        "Bucket": "s3://my-completion-reports-bucket",
        "Format": "Report_CSV_20180820",
        "Enabled": true,
        "Prefix": "tagging-job-report"
    }' \
    --manifest '{
        "Spec": {
            "Format": "S3BatchOperations_CSV_20180820",
            "Fields": ["Bucket", "Key"]
        },
        "Location": {
            "ObjectArn": "arn:aws:s3:::my-config-bucket/manifests/tags_manifest.csv",
            "ETag": "YOUR_MANIFEST_ETAG" # Get ETag from your manifest object
        }
    }' \
    --description "Add production tag to all objects" \
    --priority 1

The account-id is your AWS account ID. The iam-role-arn is the role you’ve set up for batch operations. The operation is Tagging with your desired TagSet. The report section specifies where to send the output. The manifest section points to your CSV file.

Once the job is created, S3 Batch Operations takes over. It will read your manifest, queue up operations for each object, execute them, and handle retries for transient errors. You can monitor the job’s progress through the S3 console or via the aws s3control describe-job command. The completion report will detail which objects succeeded and which failed, along with the reason for failure.

The most surprising true thing about S3 Batch Operations is that it doesn’t actually execute your operation directly on the objects. Instead, it orchestrates calls to the S3 API for each object, effectively acting as a massively scalable, managed task scheduler for S3. This is why you can use it to invoke Lambda functions: S3 Batch Operations doesn’t run the Lambda itself, but rather triggers an s3:InvokeFunction event that your configured Lambda function responds to.

The real power here is in the asynchronous and idempotent nature of the operations. If a COPY operation fails halfway through due to a network blip, S3 Batch Operations will retry the failed objects. If you accidentally run the same batch job twice, it won’t duplicate your data or re-tag objects unnecessarily if they already have the tag (for tagging operations). For COPY operations, it will only copy if the object doesn’t already exist at the destination or if you specify overwrite behavior.

When you use INVOKE_LAMBDA, the Lambda function is triggered for each object specified in your manifest. Your Lambda function then receives an event payload containing details about the S3 object. This is where you can perform custom logic, like resizing images, converting file formats, or triggering other downstream processes. The Lambda function’s execution is managed by Lambda, but S3 Batch Operations ensures that your Lambda is invoked for every object in your batch list.

This service is designed to handle petabytes of data and millions of objects efficiently. The rate at which S3 Batch Operations performs operations is managed by AWS, so you don’t have to worry about exceeding S3 API limits or overwhelming your own infrastructure.

The next thing you’ll likely encounter when using S3 Batch Operations for complex tasks is understanding how to craft robust Lambda functions that can handle the event payload and provide useful feedback for the batch job’s completion report.

Want structured learning?

Take the full S3 course →