S3 is designed to be durably store your data, but sometimes it’s good to double-check that the data you think you uploaded is actually what’s sitting there. This is where checksums come in.

Let’s say you’re uploading a 50GB file to S3. You initiate the upload, and it completes without any errors reported by the AWS CLI or SDK. Great! But what if, somewhere between your machine and S3, a few bits flipped? If you’re lucky, it might be in a part of the file that doesn’t matter. If you’re unlucky, it could corrupt a database file, a program executable, or any critical data. S3 has built-in mechanisms to protect against this, but you can also add your own layer of validation.

The most common way to verify S3 uploads is by comparing a checksum generated before the upload with one generated by S3 after the upload. S3 uses MD5 checksums for this, which is what the AWS CLI and SDKs use by default for multipart uploads.

Here’s how it works in practice with the AWS CLI:

aws s3 cp my-large-file.zip s3://my-bucket/my-large-file.zip --recursive

When you run this command for a multipart upload (which S3 automatically uses for files larger than 8MB), the AWS CLI calculates an MD5 checksum of the file locally. It then uploads the file in parts, and for each part, it sends its MD5 checksum to S3. S3 then concatenates these part checksums and calculates a final MD5 checksum for the entire object. If the checksum the CLI calculated matches the one S3 calculated, the upload is considered successful and the data is considered intact. If they don’t match, the CLI will report an error.

However, MD5 has known collision vulnerabilities, meaning it’s theoretically possible for two different files to have the same MD5 hash. While unlikely for accidental data corruption, it’s not cryptographically secure. For stronger integrity checks, you can use CRC32 or SHA-256.

S3 supports CRC32 checksums for object uploads. You can generate a CRC32 checksum locally and then provide it during the upload process.

First, generate the CRC32 checksum for your file. On Linux or macOS, you can use crc32:

crc32 my-large-file.zip

This will output a checksum, for example: 1a2b3c4d.

When uploading with the AWS CLI, you can specify this checksum using the --content-crc32 option:

aws s3 cp my-large-file.zip s3://my-bucket/my-large-file.zip --content-crc32 1a2b3c4d

If the provided CRC32 checksum doesn’t match the CRC32 checksum S3 calculates for the object, the upload will fail. This provides a stronger guarantee of data integrity than relying solely on S3’s internal MD5 checks, as CRC32 is less prone to collisions than MD5.

For even stronger integrity, you can use SHA-256. S3 doesn’t directly validate SHA-256 checksums during upload in the same way it does MD5 or CRC32. Instead, you would generate the SHA-256 hash of your file locally and then store it as S3 object metadata.

Generate the SHA-256 checksum:

shasum -a 256 my-large-file.zip

This will output something like: abcdef1234567890... my-large-file.zip.

Then, upload the file and set the SHA-256 hash as metadata:

aws s3 cp my-large-file.zip s3://my-bucket/my-large-file.zip --metadata sha256-hash=abcdef1234567890...

After the upload, you can retrieve the object and its metadata, then recalculate the SHA-256 hash locally to compare. This is a manual process, but it ensures you’re using a cryptographically secure hash for validation.

The critical insight here is that while S3 does perform internal checks, especially for multipart uploads using MD5, it’s not always sufficient for all use cases. By proactively generating and providing stronger checksums like CRC32 or SHA-256, you add an extra layer of assurance that the data you intended to store is precisely what’s residing in S3, mitigating risks of silent data corruption during transit or storage.

The next challenge you’ll likely encounter is managing these checksums for a large number of files or integrating this validation into automated workflows.

Want structured learning?

Take the full S3 course →