Packer is silently eating your build failures, masking the real issues.
The problem is that Packer’s on_error directive, when set to retry, doesn’t just retry on transient errors. It retries on any error, including fundamental configuration mistakes or network issues that will never resolve on their own. This can lead to hours of wasted build time and obscure the root cause of the failure. You end up chasing phantom issues because Packer is aggressively retrying a build that’s doomed from the start.
Let’s look at a typical scenario: a shell provisioner failing because a command doesn’t exist on the target image, or a file provisioner failing because the source file is missing. Packer’s default on_error behavior (or retry) will just keep spinning.
Here’s how to effectively manage on_error and diagnose failures:
Common Causes and Fixes
-
Command Not Found in Shell Provisioner:
- Diagnosis: Examine the Packer build output carefully. You’ll see the shell command that failed and the error message from the target machine (e.g.,
sudo: apt-get: command not foundorbash: yum: command not found). - Fix:
- If the command is genuinely missing: Install it using the appropriate package manager for the base image before your failing provisioner. For example, if you’re on a minimal Debian/Ubuntu and need
wget:
This works because you’re explicitly installing the missing dependency, allowing the subsequent provisioner to find and execute it.{ "type": "shell", "inline": [ "sudo apt-get update", "sudo apt-get install -y wget" ] } - If the command is misspelled or incorrect: Correct the spelling in your
inlineorscriptblock.
This resolves the issue by ensuring the provisioner is calling the correct, existing command.{ "type": "shell", "inline": [ "sudo apt-get update", "sudo apt-get install -y awscli" // Corrected from 'aws-cli' if that was the mistake ] }
- If the command is genuinely missing: Install it using the appropriate package manager for the base image before your failing provisioner. For example, if you’re on a minimal Debian/Ubuntu and need
- Why it works: The
shellprovisioner executes commands within the context of the guest OS. If a command isn’t in the PATH or isn’t installed, the OS itself will report an error, which Packer then forwards. By ensuring the command exists, you satisfy the OS’s requirement.
- Diagnosis: Examine the Packer build output carefully. You’ll see the shell command that failed and the error message from the target machine (e.g.,
-
Source File Not Found for
fileProvisioner:- Diagnosis: Packer will report an error like
Error uploading file: The system cannot find the file specified.orfile not found on the host. The error message points directly to the missing source file on your build machine. - Fix: Verify that the
sourcepath in yourfileprovisioner is correct relative to your Packer template file, or is an absolute path that exists.
This works because Packer needs to read the file from your local filesystem before it can upload it to the target instance. If the file isn’t there, the upload fails immediately.{ "type": "file", "source": "configs/app.conf", // Ensure 'configs/app.conf' exists on your build machine "destination": "/etc/app.conf" }
- Diagnosis: Packer will report an error like
-
Incorrect Permissions on Source File for
fileProvisioner:- Diagnosis: You might see a generic
Error uploading fileor a permission denied error when Packer tries to read thesourcefile on your build machine. - Fix: Ensure the user running
packer buildhas read permissions on thesourcefile.
This grants read access, allowing Packer to open and read the file for uploading.chmod +r configs/app.conf
- Diagnosis: You might see a generic
-
Network Issues During Instance Boot/Provisioning (e.g., SSH Timeout):
- Diagnosis: Packer output will show
Error: Timed out waiting for SSH to become available.or similar messages indicating it couldn’t connect to the instance. - Fix:
- Security Group/Firewall: Verify that the instance’s security group (AWS, Azure, GCP) or any network firewalls allow inbound SSH traffic (TCP port 22) from the IP address Packer is using.
- AWS Example: In your AWS console, navigate to EC2 -> Security Groups, find the group associated with your instance, and add an inbound rule for SSH (port 22) allowing access from your build machine’s IP or a trusted range.
- This works because the instance’s network layer is blocking the SSH connection attempt. Opening the port allows the connection.
- Instance Reachability: Ensure the instance has a public IP address (if needed) or that your build environment can reach the instance’s private IP (e.g., via VPN or within the same VPC).
- AWS Example: Check the instance’s subnet settings and ensure it’s in a public subnet if you expect to connect directly over the internet, or that routing is correctly configured for private IP access.
- This ensures that network packets can actually reach the instance’s SSH server.
- Security Group/Firewall: Verify that the instance’s security group (AWS, Azure, GCP) or any network firewalls allow inbound SSH traffic (TCP port 22) from the IP address Packer is using.
- Why it works: SSH requires a network path to be open. If firewalls or routing misconfigurations block traffic on port 22, Packer cannot establish the SSH connection needed to upload files or run commands.
- Diagnosis: Packer output will show
-
Invalid Cloud Provider Credentials or Permissions:
- Diagnosis: Errors will be specific to your cloud provider, often mentioning
Access Denied,InvalidClientTokenId,Authentication Failed, orAuthorizationError. - Fix:
- AWS: Ensure your
~/.aws/credentialsfile is correctly populated with a validaws_access_key_idandaws_secret_access_key, or that your environment variables (AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) are set. Also, verify the IAM user/role has permissions to create EC2 instances, security groups, etc.
This works by providing Packer with the necessary cryptographic proof of identity and authorization to interact with the cloud API.export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY" export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY" # Ensure IAM role/user has "AmazonEC2FullAccess" or equivalent permissions - Azure: Ensure your
~/.azure/credentialsor environment variables (AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID) are correct, and the service principal has the appropriate RBAC roles (e.g., "Virtual Machine Contributor").
This validates your identity and permissions to perform actions within your Azure subscription.export ARM_CLIENT_ID="YOUR_CLIENT_ID" export ARM_CLIENT_SECRET="YOUR_CLIENT_SECRET" export ARM_TENANT_ID="YOUR_TENANT_ID" export ARM_SUBSCRIPTION_ID="YOUR_SUBSCRIPTION_ID"
- AWS: Ensure your
- Why it works: Cloud providers use credentials and permissions to control who can do what. Incorrect credentials mean the API calls fail authentication; insufficient permissions mean the API calls fail authorization.
- Diagnosis: Errors will be specific to your cloud provider, often mentioning
-
Syntax Errors in Packer Template:
- Diagnosis: Packer will fail before attempting to build, with an error message like
Error: Invalid character encountered at ...orError: Missing required field "type". - Fix: Run
packer validate <your-template.json>to catch these errors early. Correct the JSON syntax, missing fields, or incorrect key-value pairs.
This works because Packer performs a static analysis of your template file to ensure it conforms to the expected structure and syntax before it even starts provisioning.packer validate my-aws-template.json
- Diagnosis: Packer will fail before attempting to build, with an error message like
The on_error Directive: Use with Caution
Instead of retry, consider setting on_error to abort (the default) or cleanup.
abort: Stops the build immediately on the first error. This is usually what you want for debugging.cleanup: Stops the build and attempts to clean up any resources created by the failed build (e.g., terminates the instance).
If you must use retry, it’s often best to combine it with a max_retries count and a retry_wait duration to prevent infinite loops on persistent failures.
"on_error": "retry",
"max_retries": 3,
"retry_wait": "5m"
This will retry the build up to 3 times, waiting 5 minutes between each attempt. This is still risky if the underlying issue isn’t transient.
The next error you’ll hit after fixing fundamental issues is often a subtle misconfiguration in a provisioner that does exist but behaves unexpectedly, or a dependency on a resource that wasn’t created correctly in a previous step.