Infra Logging & Monitoring: Detect & Respond

The surprising truth about infrastructure logging and monitoring for security event detection is that the most critical events are often the ones you aren’t explicitly looking for.

Let’s watch this in action. Imagine we’re monitoring a web server, webserver-01. We’ve got standard access logs and system logs.

Here’s a snippet of access.log from /var/log/nginx/access.log:

192.168.1.10 - - [01/Oct/2023:10:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
192.168.1.11 - - [01/Oct/2023:10:00:05 +0000] "GET /images/logo.png HTTP/1.1" 200 5678 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
192.168.1.10 - - [01/Oct/2023:10:00:10 +0000] "GET /about.html HTTP/1.1" 200 987 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

And a snippet from /var/log/syslog:

Oct  1 10:00:00 webserver-01 systemd[1]: Started nginx.service.
Oct  1 10:00:01 webserver-01 kernel: [    0.123456] Linux version 5.15.0-86-generic (buildd@lcy02-amd64-033) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #92-Ubuntu SMP Thu Sep 14 21:19:15 UTC 2023
Oct  1 10:00:02 webserver-01 sshd[1234]: Server listening on 0.0.0.0 port 22.

These are standard. We might set up alerts for 4xx or 5xx errors in access.log, or for critical messages in syslog. But what if something more insidious is happening?

The core problem this solves is the "needle in a haystack" challenge. You have mountains of data, and you need to find the few events that indicate a compromise, misconfiguration, or operational failure before they cause significant damage. Traditional monitoring often focuses on known-bad patterns or thresholds, missing novel or stealthy attacks.

Internally, logging systems capture events generated by applications and the operating system. Monitoring tools then ingest, parse, and analyze these logs. This can involve simple keyword searching, statistical anomaly detection, or sophisticated machine learning models. The key is to transform raw log data into actionable security intelligence.

Here are the levers you control:

Log Source Configuration: What gets logged and at what detail level. For Nginx, this is nginx.conf’s log_format directive. For syslog, it’s rsyslog.conf or syslog-ng.conf.
Log Forwarding: How logs get from the source to your central analysis system (e.g., Filebeat, Fluentd, rsyslog forwarding).
Analysis Rules/Queries: The specific patterns or statistical deviations you’re looking for. In tools like Splunk, Elasticsearch/Kibana, or Prometheus, these are search queries or alerting rules.
Alerting Thresholds: For anomaly detection, defining what constitutes an "unusual" event (e.g., "more than 5 failed logins in 1 minute").

Consider this log entry, which might appear innocuous at first glance:

Oct  1 10:05:30 webserver-01 sudo:    www-data : TTY=unknown ; PWD=/var/www/html ; USER=root ; COMMAND=/usr/bin/chown root:root /var/www/html/sensitive_config.php

This shows the www-data user (typically the web server’s unprivileged user) attempting to chown a sensitive file to root. If this is not a scheduled, expected operation, it’s a massive red flag. The www-data user should never have the ability to change ownership of files to root. This suggests a privilege escalation or a web shell that has gained elevated permissions.

To detect this, your monitoring system needs to ingest auth.log or secure.log (depending on your OS) and have a rule that looks for sudo events where the USER is www-data and the COMMAND involves sensitive system files or directories, especially if the target user is root. A query might look something like:

source="auth.log" "sudo" "USER=www-data" "COMMAND=*root*"

The fix isn’t to ignore it, but to investigate why it happened. If it’s legitimate, you might need to adjust sudoers rules or application deployment processes. If it’s malicious, it’s an immediate incident.

The most common mistake is assuming that standard log formats are sufficient for security. Many systems, by default, log what happened but not the context that makes it a security event. For instance, a successful SSH login from a known-malicious IP is a critical event, but if your logs only capture the source IP and not a threat intelligence feed lookup, you’ll miss it. You need to enrich logs with external data.

The next step is often realizing that your log volume is too high, and you need to implement intelligent sampling or aggregation to keep costs down while retaining critical security signals.