The most surprising thing about monitoring SMTP is how often the absence of errors is the real problem.
Let’s watch a simplified SMTP transaction unfold. Imagine two mail servers, sender.com and receiver.com.
# On sender.com's mail server (e.g., Postfix)
echo "Subject: Test Email\n\nThis is a test." | sendmail -f sender@sender.com receiver@receiver.com
# On receiver.com's mail server (e.g., Postfix)
tail -f /var/log/mail.log
On sender.com, the sendmail command initiates the process. The local Postfix instance (or whatever MTA is running) will try to connect to receiver.com’s MX (Mail Exchanger) record. If it connects, it’ll handshake, send the MAIL FROM:, RCPT TO:, and DATA commands. If receiver.com accepts the email, it’ll respond with a 250 OK. The email is now handed off.
Here’s where monitoring becomes crucial. What if receiver.com is temporarily unavailable? sender.com won’t immediately bounce the email. Instead, it’ll queue it.
Queue Depth: The Waiting Line
The primary indicator of a problem is the queue depth. This is the number of emails waiting to be delivered. On Postfix, you can see this with:
mailq
This command lists emails in the queue. For a quick count, you can pipe it:
mailq | grep -v "^(" | wc -l
If this number starts climbing past a reasonable threshold (e.g., 500 for a busy server, 50 for a smaller one), it means emails aren’t getting out. The most common cause is a downstream issue:
-
Recipient Server Unreachable/Rate Limited: The
receiver.comserver might be down, refusing connections, or throttlingsender.com’s IP address.- Diagnosis: Check
mailqfor a growing queue and look for specific error messages in/var/log/mail.log(e.g.,connect to receiver.com[X.X.X.X]: Connection refused,4xx temporary failure,451 4.7.1 Service unavailable). You can also try a manualtelnet receiver.com 25to see if you can connect. - Fix: If it’s a temporary network issue or a transient rate limit, waiting is often the best course. If it’s a permanent block, you’ll need to contact the recipient’s mail administrator. For rate limits, implement a staggered sending strategy or request an increase.
- Why it works: This addresses the direct cause of the queue buildup by either resolving the connectivity issue or managing the flow to comply with the recipient’s policies.
- Diagnosis: Check
-
DNS Resolution Problems:
sender.comcan’t findreceiver.com’s MX records.- Diagnosis:
mailqwill show errors likehost receiver.com[A.B.C.D]: Name or service not known. Trydig receiver.com MXornslookup receiver.com. - Fix: Check your server’s
/etc/resolv.conffor correct DNS server entries. Ensure your DNS servers are reachable. Restarting the DNS caching service (e.g.,systemctl restart nscd) or the MTA might be necessary if configuration changed. - Why it works: Correct DNS resolution is fundamental to finding the IP address of the mail server to connect to.
- Diagnosis:
-
Local Resource Exhaustion: The sending server itself is overloaded.
- Diagnosis: High CPU, memory, or disk I/O. Check
top,htop,iostat,free -m. Also, check the MTA’s own logs for errors related to resource limits (e.g., "out of memory," "disk full"). - Fix: Optimize MTA configuration (e.g., tune
smtpd_process_limitin Postfix), add more resources (RAM, CPU), or ensure sufficient disk space. - Why it works: The MTA needs sufficient system resources to process and send emails efficiently; lack of these resources directly impedes its ability to dequeue messages.
- Diagnosis: High CPU, memory, or disk I/O. Check
-
Firewall/Network Blocking: An intermediate firewall or
sender.com’s own firewall is blocking outbound connections on port 25.- Diagnosis:
mailqerrors might showconnect to receiver.com[X.X.X.X]: Connection timed outorConnection refused. Usetraceroute receiver.comto identify network hops. Checkiptables -Lor firewall logs onsender.com. - Fix: Adjust firewall rules to allow outbound SMTP traffic on port 25.
- Why it works: Ensures the network path is open, allowing the MTA to establish the necessary TCP connection to the recipient server.
- Diagnosis:
-
Recipient Server Policy Issues: The recipient server has updated its spam filters or blocklists, and
sender.com’s IP is now flagged.- Diagnosis:
mailqshows repeated4xxerrors from the recipient server, often with specific policy reasons (e.g.,550 5.7.1 ... blocked by spam filter). Check the IP reputation ofsender.comon services like MXToolbox. - Fix: Contact the recipient’s mail administrator to understand the block reason and request delisting. Improve your own outbound email practices (SPF, DKIM, DMARC, volume control).
- Why it works: Addresses the policy violation at the recipient end, which is the direct cause of the delivery refusal.
- Diagnosis:
Delivery Failures: The Bounced Emails
Beyond the queue, you need alerts for actual delivery failures (bounces). These are emails that Postfix tried to deliver but couldn’t, and are now being returned to the original sender.
grep "status=bounced" /var/log/mail.log
This will show lines like:
Oct 26 10:00:00 mail postfix/smtp[12345]: A1B2C3D4E5: to=<user@example.com>, relay=none, delay=3600, delays=3590/0/0/10, dsn=5.0.0, status=bounced (host mx.example.com[Y.Y.Y.Y] said: 550 5.1.1 <user@example.com>: Recipient address rejected: User unknown in virtual mailbox table (in reply to RCPT TO command))
Here, status=bounced is key. Common causes include:
-
Invalid Recipient Address: The email address simply doesn’t exist on the recipient server.
- Diagnosis: The bounce message itself will explicitly state "User unknown," "Recipient address rejected," or similar.
- Fix: The sender needs to correct the email address. You can’t fix this on the receiving end unless it’s a typo on your outgoing system (e.g., an auto-complete error).
- Why it works: Prevents wasted delivery attempts to non-existent mailboxes.
-
Recipient Server Policy Violation (Permanent): The recipient server permanently rejects the email due to policy (e.g., sender IP reputation, content filtering, domain blacklisting).
- Diagnosis: Bounce messages often contain specific
5xxerror codes and explanations like "blocked by policy," "spam content detected," or "IP address listed." - Fix: Similar to temporary policy issues, but often requires more sustained effort. Improve sender reputation, sanitize outbound content, and work with blacklisting services if applicable.
- Why it works: Resolves the underlying reason for the permanent rejection by correcting the sender’s compliance with recipient policies.
- Diagnosis: Bounce messages often contain specific
-
Mailbox Full: The recipient’s mailbox has exceeded its storage quota.
- Diagnosis: Bounce messages will typically say "mailbox full," "quota exceeded," or
552 5.2.2 ... message size exceeds fixed maximum recipient quota. - Fix: The recipient needs to clear space in their mailbox.
- Why it works: Frees up space on the recipient server, allowing new messages to be accepted.
- Diagnosis: Bounce messages will typically say "mailbox full," "quota exceeded," or
-
Configuration Errors on Receiving Server: Misconfigured virtual domain tables, user mappings, or other recipient-side MTA settings.
- Diagnosis: This is harder to diagnose from the sender’s side, but often presents as a consistent "Recipient address rejected" without a clear reason like "unknown user" or "mailbox full." It might be a specific
5xxerror code that points to an internal server issue. - Fix: The administrator of the receiving server needs to investigate their MTA configuration.
- Why it works: Corrects internal routing or user lookup mechanisms on the recipient server.
- Diagnosis: This is harder to diagnose from the sender’s side, but often presents as a consistent "Recipient address rejected" without a clear reason like "unknown user" or "mailbox full." It might be a specific
Monitoring queue depth and bounce rates gives you visibility into the health of your outbound email flow. You’re not just looking for explicit "failure" messages, but for the accumulation of waiting emails, which is a strong signal that something is beginning to break.
The next error you’ll hit after fixing queue depth issues is often related to mail security, such as SPF/DKIM/DMARC failures causing legitimate mail to be rejected by recipients.