Enterprise-Grade Monitoring & SIEM for a Homelab - From Zero to 76 Custom Alert Rules
TL;DR
I built a production-grade monitoring and SIEM platform for my entire homelab infrastructure running on a single-node K3s cluster. The system combines Prometheus for metrics, Grafana for visualization, Loki for log aggregation, Wazuh for security event management, and a proactive security stack (CrowdSec, Falco, honeypot, Trivy) - all deployed via Ansible and Helm with full Infrastructure as Code.
Key Metrics:
- 60 Prometheus scrape jobs monitoring 114 endpoints
- 35 custom Grafana dashboards + 10 community imports
- 69 custom Prometheus alert rules + 7 Loki LogQL alerts
- 20 active Wazuh security agents across 3 OS families
- 8 agent groups with specialized detection rules
- ~100 custom Wazuh rules (IDs 100100-100639)
- 3 Wazuh Grafana dashboards (SIEM, Compliance, Vulnerabilities)
- 14-day log retention in Loki
- Multi-tier alerting: Email + ntfy mobile push
- Proactive security: CrowdSec threat intelligence, Falco runtime monitoring, OpenCanary honeypot, Trivy vulnerability scanning
Why Build Enterprise Monitoring for a Homelab?
Many homelabs run blind. Services crash, disks fill up, certificates expire, and you only notice when something stops working. I wanted the opposite: know about problems before they become outages.
Three goals drove the implementation:
Visibility: Every host, every service, every metric - in one place. From GPU temperature on the AI VM to battery charge on the UPS, from LoRaWAN sensor signal strength to Proxmox guest status.
Security: Running 20+ hosts with internet-facing services demands real intrusion detection, not just hoping firewalls are enough. Wazuh provides file integrity monitoring, vulnerability scanning, and active response across the entire fleet - backed by CrowdSec’s community threat intelligence and Falco’s runtime syscall monitoring.
Automation: No manual log checking, no SSH-ing into boxes to check disk space. Alerts come to my phone. Dashboards show the full picture. Problems get detected and - in some cases - resolved automatically.
Architecture
The monitoring stack runs entirely within the monitoring namespace on a single-node K3s cluster, while Wazuh operates as an All-in-One LXC container on Proxmox. Both systems feed into the same Grafana instance.
Infrastructure:
| Component | Location | Role |
|---|---|---|
| Prometheus | K3s Pod | Metrics collection, 15-day retention, 15Gi storage |
| Grafana | K3s Pod | Visualization, 35+ custom dashboards |
| Loki | K3s Pod | Log aggregation, single-binary, 14-day retention |
| Alertmanager | K3s Pod | Alert routing, email + ntfy |
| Promtail | K3s DaemonSet + external agents | Log shipping from pods, PVE hosts, Wazuh |
| Wazuh Manager | LXC on Proxmox | SIEM: Manager + Indexer + Dashboard |
| CrowdSec | K3s Pod | Community IP blocklist + Traefik bouncer |
| Falco | K3s DaemonSet | eBPF runtime security monitoring |
| Honeypot | K3s Pod (MetalLB IP) | OpenCanary decoy services |
| 8 Exporters | K3s Pods | PVE, UniFi, Blackbox, SNMP, NUT, PBS, AdGuard, Speedtest |
| Gatus | K3s Pod | Status page with uptime tracking |
Deployment Method:
| Stack | Tool | Source |
|---|---|---|
| Prometheus + Grafana + Alertmanager | Helm (kube-prometheus-stack) | kubernetes/monitoring/install.sh |
| Loki + Promtail | Helm | kubernetes/monitoring/install.sh |
| All exporters | Kustomize | kubernetes/monitoring/<exporter>/ |
| CrowdSec + Traefik Bouncer | Helm | kubernetes/crowdsec/install.sh |
| Falco | Helm | kubernetes/falco/install.sh |
| Trivy Operator | Helm | kubernetes/trivy-operator/install.sh |
| Wazuh Manager | Ansible | ansible/playbooks/configure-wazuh-manager.yml |
| Wazuh Agents | Ansible | ansible/playbooks/setup-wazuh-agents.yml |
| External monitoring agents | Shell scripts | scripts/setup-monitoring-hosts.sh |
Everything lives in a single Git repository - true Infrastructure as Code with Ansible playbooks for Wazuh and Helm/Kustomize for Kubernetes workloads.
The Monitoring Stack
Prometheus: 60 Scrape Jobs, 114 Endpoints
Prometheus sits at the center, scraping metrics from every layer of the infrastructure. The configuration in values.yaml defines 60 jobs organized by category:
Infrastructure Hosts (7 targets):
| Target | Host | Port | Interval |
|---|---|---|---|
| pve-nodes | 3 Proxmox hypervisors | 9100 | 30s |
| ai-vm | AI/GPU VM | 9100 | 30s |
| relay | Mail relay LXC | 9100 | 30s |
| opnsense | Firewall | 9100 | 30s |
| smartctl | Storage host | 9633 | 300s |
Docker Host Monitoring (4 targets):
| Target | What It Monitors | Port | Interval |
|---|---|---|---|
| docker-hosts | node_exporter on 4 hosts | 9100 | 30s |
| www-node | Web server host metrics | 9100 | 30s |
| www-cadvisor | Docker container metrics | 8080 | 30s |
| www-server-probe | HTTP probes for portfolio + chat API | - | 30s |
API Exporters (8 targets):
| Target | What It Monitors | Port | Interval |
|---|---|---|---|
| pve-exporter | Proxmox API (all VMs/CTs) | 9221 | 60s |
| pbs-exporter | Proxmox Backup Server | 10019 | 60s |
| unifi-poller | UniFi Controller (APs, clients) | 9130 | 30s |
| snmp-switch | D-Link switch | 9116 | 60s |
| snmp-truenas | TrueNAS SNMP | 9116 | 60s |
| nut-exporter | Eaton UPS (battery, load) | 9199 | 30s |
| adguard-exporter | AdGuard DNS analytics | 9618 | 30s |
| speedtest | Internet bandwidth (every 4h) | 9798 | 300s |
Service Health Probes (Blackbox Exporter):
| Probe Type | Targets | Module |
|---|---|---|
| ICMP Ping | 23 hosts (all infrastructure) | icmp_ping |
| HTTP 2xx | 19 services | http_2xx |
| HTTP Any | OPNsense (Let’s Encrypt) | http_any |
| DNS | OPNsense Unbound | dns_test |
| SMTP | Internal mail relay | smtp_relay |
Application-Specific (30+ targets):
GitLab alone exposes 5 scrape targets (exporter, webservice, gitaly, postgresql, redis). Additional targets include Traefik ingress metrics, cert-manager certificate lifecycle, Cloudflare Tunnel stats, ChirpStack LoRaWAN, MQTT sensor exporter, NVIDIA DCGM GPU metrics (15s interval!), Home Assistant, Wazuh SIEM exporter, Gatus status page, chatbot API with bearer token authentication, ArgoCD (4 ServiceMonitors), Trivy Operator metrics, Falcosidekick events, and CrowdSec agent metrics.
The full target list reads like a network inventory - because it essentially is one.
Grafana: 35 Custom Dashboards
Every dashboard is deployed as a Kubernetes ConfigMap with the grafana_dashboard: "1" label, automatically discovered by Grafana’s sidecar. No manual import, no clicking through UIs - git push deploys dashboards.
Custom Dashboards (ConfigMap-based):
| Dashboard | Key Panels | Data Source |
|---|---|---|
| Homelab Overview | Service status grid, host health, quick links | Prometheus |
| Wazuh SIEM | Agent fleet, alert categories, top rules | Prometheus (Wazuh exporter) |
| Wazuh Compliance & Threats | SCA scores, MITRE ATT&CK tactics, auth events | Prometheus |
| Wazuh Vulnerability Deep Dive | CVE counts by severity, per-host breakdown, trends | Prometheus |
| AI Platform Overview | GPU temp/utilization/VRAM, inference latency | Prometheus-AI |
| Portfolio www.pichler.dev | HTTP probe phases, SSL expiry, Docker metrics | Prometheus |
| OPNsense Firewall | Interface throughput, packet stats, rules | Prometheus |
| PBS Backup | Backup/verify age, datastore usage, job status | Prometheus |
| SMART & ZFS Health | Disk temperatures, pool status, error counts | Prometheus |
| NUT UPS | Battery charge, runtime, load, input voltage | Prometheus |
| Traefik Ingress | Request rate, latency percentiles, error codes | Prometheus |
| Loki & Promtail | Ingestion rate, query latency, dropped logs | Prometheus + Loki |
| LoRaWAN Sensors | Battery %, RSSI, SNR, last seen | Prometheus |
| Power Cost & Energy | UPS consumption, cost per month | Prometheus |
| SLO & Uptime Tracking | Service availability percentages | Prometheus |
| Network Map & Status | L2 topology, link utilization | Prometheus |
| Container Security | Trivy CVE counts, image scan results | Prometheus |
| Email Security & DMARC | DMARC pass/fail, SPF alignment | Loki |
| Incident Timeline | Alert correlation, Loki errors, deploys | Prometheus + Loki |
| Capacity Planning | Storage growth, CPU trends, predictions | Prometheus |
| Security Posture Score | Combined security metrics across all tools | Prometheus |
| SRE Metrics MTTR/MTBF | Mean time to recovery/between failures | Prometheus |
| Internet Speedtest | Bandwidth and latency trends | Prometheus |
Plus dashboards for ArgoCD, GitLab, ChirpStack, TrueNAS, AdGuard Home, Cloudflare Tunnel, Home Assistant, DNS Query Analytics, Systemd Services, and more.
10 community dashboards imported by gnetId:
| Dashboard | gnetId | Purpose |
|---|---|---|
| Node Exporter Full | 1860 (rev 42) | Comprehensive host metrics |
| Proxmox VE Cluster | 10347 | VM/CT overview |
| UniFi Client/UAP/USW/Sites | 11315/11314/11312/11311 | Network analytics |
| Blackbox Exporter | 7587 | Probe results |
| MinIO | 13502 | Object storage |
| SNMP Stats | 11169 | Switch metrics |
| cert-manager | 20842 | Certificate lifecycle |
Alertmanager: 69 Custom Rules + 7 Loki LogQL
Alerts aren’t useful if they wake you up for non-issues. The alerting system uses inhibition rules to suppress noise - if a host is down, don’t also alert about its services being unreachable.
Alert Groups (20 categories, 69 Prometheus rules):
| Group | Rules | Examples |
|---|---|---|
| host-alerts | 11 | HostDown, HighCPU, HighMemory, DiskSpaceCritical/Warning, DiskWillFillIn3/7Days, SmartDiskErrors/HighTemp |
| service-alerts | 5 | ServiceDown, ExternalServiceDown, SlowResponse, SmtpRelayDown, DnsDown |
| ups-alerts | 3 | UpsOnBattery (immediate!), UpsLowBattery, UpsBatteryReplace |
| kubernetes-alerts | 4 | K3sNodeNotReady, PodCrashLooping, K3sStorageWarning/Critical |
| wazuh-alerts | 4 | WazuhManagerDown, AgentDisconnected, HighCriticalCVEs, AlertSpike |
| gpu-alerts | 4 | GPUHighTemperature, GPUCriticalTemperature, GPUMemoryHigh, GPUPowerHigh |
| backup-alerts | 4 | PbsBackupStale, PbsVerifyStale, PbsExporterDown, K3sBackupStale |
| certificate-alerts | 2 | CertExpiringSoon (30d), CertExpiryCritical (7d) |
| chatbot-alerts | 3 | ChatbotAPIDown, HighErrorRate, HighResponseTime |
| www-alerts | 3 | WWWServerDown, WWWServerHighCPU, WWWServerDiskFull |
| lorawan-alerts | 2 | LoRaSensorOffline (>2h), LoRaSensorBatteryLow (<15%) |
| argocd-alerts | 2 | ArgoCdAppOutOfSync, ArgoCdAppDegraded |
| internet-alerts | 2 | InternetSpeedLow, InternetLatencyHigh |
| security-scanning | 2 | TrivyCriticalCVEsHigh, DockerHostCriticalCVEs |
| self-monitoring | 7 | PrometheusTargetDown, AlertmanagerNotificationsFailing, LokiDroppingLogs, LokiIngestionStalled, PrometheusStorageFilling, PrometheusTSDBCompactionFailed, PrometheusSlowScrape |
| security-stack | 2 | CrowdSecDown, FalcoDown |
| misc | 3+ | PveGuestDown, MinioOffline, HighSwapUsage, PowerCostHigh, GatusDown, HomeAssistantDown |
Predictive Alerts:
Two standout rules use Prometheus predict_linear() to forecast disk exhaustion before it happens:
- DiskWillFillIn7Days - Warning, early heads-up
- DiskWillFillIn3Days - Critical, take action now
These have already proven their value - catching a runaway backup directory that would have filled the disk within days.
Inhibition Logic (5 rules):
HostDown -> suppresses all warnings for that host
UpsOnBattery -> suppresses non-critical alerts
DiskCritical -> suppresses DiskWarning (same mountpoint)
CertCritical -> suppresses CertWarning (same instance)
DiskFill3Days -> suppresses DiskFill7Days
7 Loki LogQL Alert Rules:
Beyond Prometheus metrics, Loki watches log streams for patterns that metrics can’t catch:
| Rule | What It Detects | Source |
|---|---|---|
| LogSSHBruteForce | Repeated SSH failures in auth.log | PVE Promtail |
| LogOOMKiller | Kernel OOM killer invocations | System journal |
| LogDiskIOError | Disk I/O errors in kernel logs | System journal |
| LogWazuhCriticalAlert | Wazuh alerts at level 13+ | Wazuh Promtail |
| LogPodOOMKilled | Kubernetes OOMKilled events | K3s pod logs |
| LogSystemdServiceFailed | Systemd service failures | System journal |
| LogProxmoxStorageError | Proxmox storage-related errors | PVE logs |
These complement the Prometheus rules - metrics tell you what is wrong, logs tell you why.
Notification Routing:
| Severity | Channel | Timing |
|---|---|---|
| Critical | Email + ntfy push | Immediate, repeat 7d |
| Warning | ntfy push only | No repeat |
| UpsOnBattery | Email + ntfy | group_wait: 0s, repeat: 5min |
| Info | Silence | Dashboard only |
The ntfy-bridge is a custom Python service that translates Alertmanager webhooks into ntfy push notifications with priority mapping. Critical alerts get priority 5 (urgent), warnings priority 3 (default), and resolved notifications priority 2 (low). Separate topics for homelab-critical and homelab-warnings keep the notification channels clean.
Loki: Centralized Log Aggregation
Loki runs in single-binary mode - all components in one pod. For a homelab, this is the sweet spot between simplicity and capability.
Log Sources (3 tiers):
Tier 1 - K3s Pods (Promtail DaemonSet):
All pod logs from /var/log/pods are automatically collected with Kubernetes label enrichment. Zero configuration per service.
Tier 2 - PVE Hosts (External Promtail agents):
Installed via scripts/install-node-exporter.sh --with-promtail on pve1, pve2, pve3. Ships syslog, auth.log, pveproxy, pvedaemon, kernel, and systemd journal to Loki’s NodePort (31000).
Tier 3 - Wazuh Manager + Relay (Dedicated Promtail): The most interesting sources. Promtail on the Wazuh LXC ships:
| Log Source | Format | Content |
|---|---|---|
| wazuh-alerts | JSON | Security alerts with rule IDs, severity levels |
| wazuh-manager | Text | Manager operational logs (ossec.log) |
| active-responses | Text | IP blocks, firewall drops |
| wazuh-api | Text | Dashboard/API access logs |
| syslog + auth | Text | System and authentication events |
| systemd journal | Structured | Service lifecycle events |
Additionally, Promtail on the mail relay ships mail.log for the Email Security & DMARC dashboard.
This creates a powerful correlation capability: Prometheus shows you what is happening (metrics), Loki shows you why (logs), and Wazuh shows you who is responsible (security events).
Configuration:
# Loki retention
limits_config:
retention_period: 336h # 14 days
max_query_series: 50000
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_streams_per_user: 10000
External Host Monitoring
Not everything runs in Kubernetes. The PVE hypervisors, OPNsense firewall, and Wazuh LXC need monitoring agents installed directly.
Installation:
# One script to rule them all
./scripts/setup-monitoring-hosts.sh
# Installs on pve1/2/3:
# - node_exporter (port 9100) - CPU, RAM, disk, network
# - promtail (log shipping to Loki NodePort)
# Installs on pve3 additionally:
# - smartctl_exporter (port 9633) - SMART disk health
OPNsense uses its native os-node_exporter plugin - installed via System -> Firmware -> Plugins. No SSH needed.
TrueNAS and MinIO expose Prometheus endpoints natively - just enable in their respective UIs.
Docker hosts (5 servers) run Trivy Docker Scanner via systemd timer every 6 hours, exposing vulnerability metrics through node_exporter’s textfile collector.
Wazuh SIEM: Security for Every Host
Why Wazuh?
Open-source SIEM that combines log analysis, intrusion detection, file integrity monitoring, vulnerability detection, and active response in one platform. For a homelab with internet-facing services, this isn’t optional - it’s essential.
Deployment: All-in-One LXC
| Property | Value |
|---|---|
| Platform | LXC on Proxmox |
| Resources | 4 vCPU, 6GB RAM, 50GB Disk |
| Version | Wazuh 4.14.3 |
| Components | Manager + OpenSearch Indexer + Dashboard |
Deployed and configured entirely via Ansible:
# Deploy agents to all Linux hosts
ansible-playbook -i inventory.yml playbooks/setup-wazuh-agents.yml
# Configure Manager (groups, rules, active response, email)
ansible-playbook -i inventory.yml playbooks/configure-wazuh-manager.yml
Agent Fleet: 20 Active Agents, 8 Groups
Every host in the homelab runs a Wazuh agent. Agents are organized into groups with tailored configurations:
Agent Groups:
| Group | Agents | Specialization |
|---|---|---|
| proxmox | 3 hypervisors | Hypervisor config monitoring, /etc/pve FIM |
| kubernetes | K3s node | K3s audit logs, K8s events, pod status, containerd |
| storage | 4 storage hosts | Backup configs, ZFS settings, storage credentials |
| network | 3 network services | Network service configs, Docker listener |
| services | 10 agents | Docker lifecycle, Nginx logs, service configs |
| ai-workload | GPU VM | Ignores large model files (.gguf, .safetensors) |
| windows | 1 Windows host | Windows Event Logs, Sysmon (pending deployment) |
| siem | Wazuh (self) | Self-monitoring: Manager, Dashboard, OpenSearch configs |
Each group has its own agent.conf in ansible/files/wazuh/shared/<group>/, defining:
- FIM (File Integrity Monitoring): Which paths to watch in realtime vs. scheduled scans
- Syscollector: Hardware/software inventory intervals
- Localfile: Which logs to collect and parse
- Docker listener: Container lifecycle events (enabled on 6 hosts)
- Vulnerability detection: OS and package scanning
Example - Kubernetes Group Configuration:
The kubernetes agent does the heavy lifting for cluster security:
<!-- K3s audit log (JSON) -->
<localfile>
<log_format>json</log_format>
<location>/var/log/k3s-audit.log</location>
</localfile>
<!-- K8s Warning events (streamed) -->
<localfile>
<log_format>json</log_format>
<location>/var/log/k8s-events.log</location>
</localfile>
<!-- Pod status check (every 2 min) -->
<localfile>
<log_format>full_command</log_format>
<command>kubectl get pods --all-namespaces -o json | jq ...</command>
<frequency>120</frequency>
</localfile>
<!-- Container status check (every 2 min) -->
<localfile>
<log_format>full_command</log_format>
<command>crictl ps -a -o json | jq ...</command>
<frequency>120</frequency>
</localfile>
Custom Rules: ~100 Rules Across 7 Categories
Every custom rule has a specific ID range and severity level. Severity determines action: level 10+ gets logged prominently, level 13+ triggers email alerts.
Proxmox Rules (100100-100199):
| Rule ID | Event | Level |
|---|---|---|
| 100100-101 | VM/CT start/stop | 5 |
| 100102 | VM migration failure | 10 |
| 100103 | Cluster membership change | 8 |
| 100104-105 | Backup job failed/success | 10/3 |
| 100106 | Storage config change | 7 |
| 100107 | Ceph degradation | 12 |
K3s Audit Rules (100200-100206):
| Rule ID | Event | Level |
|---|---|---|
| 100200 | Secret access/modification | 10 |
| 100201 | Pod deletion | 8 |
| 100202 | RBAC denial (403) | 12 |
| 100203 | Namespace operations | 7 |
| 100204 | Workload changes (deploy/ds/sts) | 5 |
| 100205 | RBAC config changes | 10 |
| 100206 | kubectl exec into pods | 8 |
Kubernetes Container Monitoring (100210-100241):
| Rule ID | Event | Level |
|---|---|---|
| 100211 | OOMKilled | 12 |
| 100212 | CrashLoopBackOff | 10 |
| 100213 | Image pull errors | 8 |
| 100215 | Pod evictions | 10 |
| 100219 | Node not ready | 12 |
| 100240 | Multiple OOMKills (3+ in 10min) | 13 (email!) |
| 100241 | Multiple CrashLoopBackOff (3+ in 5min) | 12 |
Docker Container Monitoring (100250-100256):
| Rule ID | Event | Level |
|---|---|---|
| 100250 | Container start | 5 |
| 100252 | Container died unexpectedly | 10 |
| 100253 | Container OOMKilled | 8 |
| 100255 | Command executed in container | 10 |
| 100256 | Multiple container deaths (3+ in 10min) | 12 |
Homelab Security Rules (100400-100410):
| Rule ID | Event | Level |
|---|---|---|
| 100400/450 | SSH from non-local (suppressed for trusted subnets) | 8/0 |
| 100404 | SSH brute force detection | 10 -> Active Response |
| 100405 | New user account creation | 8 |
| 100408 | Disk space exhausted | 12 |
| 100409 | OOM killer triggered | 12 |
| 100410 | Certificate expiration warning | 8 |
Web Attack Rules (100500-100505):
| Rule ID | Event | Level |
|---|---|---|
| 100500 | SQL injection attempt | 10 |
| 100501 | XSS attempt | 10 |
| 100502 | Path traversal attempt | 10 |
| 100503 | Command injection attempt | 12 |
| 100504 | Repeated web attacks from same IP | 13 (email!) |
| 100505 | Scanner/bot detection | 8 |
Honeypot & Threat Intelligence (100600-100639):
| Rule ID | Event | Level |
|---|---|---|
| 100600 | Honeypot service interaction | 10 |
| 100620 | CrowdSec ban decision | 8 |
Custom Decoders
Seven custom decoders parse non-standard log formats:
proxmox-task -> pvedaemon/pveproxy/pvestatd/pveceph logs
pbs-task -> Proxmox Backup Server proxy/manager logs
k3s-audit -> K3s API server audit events (JSON)
k8s-event -> Kubernetes Warning events from k8s-event-logger
k8s-pod-status -> kubectl pod status JSON output
containerd-status -> crictl container status JSON output
opnsense-filterlog -> OPNsense packet filter log parsing
Active Response: Automated Threat Mitigation
Wazuh doesn’t just detect - it responds. The active response system implements escalating IP blocking:
SSH Brute Force Response:
Trigger: Repeated failed SSH login attempts
Action: firewall-drop (iptables block)
Escalation: Increasing block durations from minutes to hours
Port Scan: Automatic temporary block
An IP whitelist protects infrastructure hosts (hypervisors, K3s master, Wazuh manager) from accidental self-lockout - a lesson learned the hard way in many SIEM deployments.
K3s Audit Logging
Kubernetes API audit logging captures every request to the K3s API server. The audit policy defines four levels:
| Level | Events |
|---|---|
| None | Health checks, list/watch, system service accounts |
| Metadata | Secret ops, RBAC changes, namespace ops, workload changes |
| RequestResponse | Pod exec/attach/portforward |
| RequestResponse | RBAC modifications |
Implementation:
# WARNING: ~30s API downtime during deployment
ansible-playbook -i inventory.yml playbooks/setup-k3s-audit-logging.yml
The playbook configures K3s with an audit policy file and log rotation (7 days, 100MB max, 3 backups). Wazuh’s kubernetes agent reads /var/log/k3s-audit.log via the custom k3s-audit decoder.
This catches critical events: someone accessing Secrets, RBAC permission denials (potential privilege escalation attempts), unauthorized kubectl exec into pods, and workload modifications.
K8s Container Monitoring
A two-part system provides container-level visibility:
Part 1 - k8s-event-logger (systemd service):
# Streams K8s Warning events in real-time
kubectl get events --all-namespaces \
--field-selector type!=Normal \
--watch-only \
-o json >> /var/log/k8s-events.log
This captures OOMKilled, CrashLoopBackOff, ImagePullBackOff, scheduling failures, and evictions as they happen.
Part 2 - Periodic Status Checks (every 2 minutes):
# Pods with restartCount > 3, waiting containers, or OOMKilled
kubectl get pods --all-namespaces -o json | jq '...'
# Non-running containers via containerd
crictl ps -a -o json | jq '...'
The combination of real-time event streaming and periodic health checks ensures nothing slips through.
Docker Container Monitoring
Six hosts run Docker alongside the Wazuh agent. The Docker listener wodle captures container lifecycle events:
| Host Type | Containers Monitored |
|---|---|
| Web server | Portfolio, chatbot, tunnel, analytics |
| Network | UniFi Controller |
| Security | Password manager |
| Media | Media server |
| Communication | Matrix/Synapse |
| Infrastructure | UPS notification service |
Events tracked: start, stop, die (unexpected), oom, pull, exec_start. An unexpected container death (rule 100252, level 10) gets immediate attention; three deaths in 10 minutes (rule 100256, level 12) indicates a systemic problem.
OPNsense: FreeBSD Agent
OPNsense requires special handling - it’s FreeBSD, not Linux. Deployment uses the native plugin:
System -> Firmware -> Plugins -> os-wazuh-agent -> Install
Services -> Wazuh Agent -> Settings -> Manager: <wazuh-ip> -> Enable
The custom opnsense-filterlog decoder parses OPNsense’s unique packet filter log format, extracting rule numbers, interfaces, source/destination IPs, and actions.
Email Alerting
Email alerting is configured with a high severity threshold - only critical events (multiple OOMKills in rapid succession, RBAC denials, Ceph degradation, or brute force escalations) trigger email notifications. This prevents warning spam. A daily summary report covers lower-severity events for non-urgent review.
Vulnerability Detection
Wazuh scans every agent for known CVEs using package inventory data and vulnerability feeds (updated hourly). The Grafana Vulnerability Deep Dive dashboard visualizes:
- Total CVE count by severity (Critical/High/Medium/Low)
- Top 15 CVEs by affected host count
- Per-host vulnerability breakdown
- CVE trends over time
- Agent keepalive staleness (detecting disconnected agents)
This drives our vulnerability remediation workflow:
# Patch all hosts (rolling update, one at a time)
ansible-playbook -i inventory.yml playbooks/patch-vulnerabilities.yml
Wazuh <-> Grafana Integration
The bridge between Wazuh and Grafana is a custom Prometheus exporter running on the Wazuh LXC. It queries the Wazuh Manager API and exposes 33 metric families:
Key Metrics Exported:
| Metric | Description |
|---|---|
| wazuh_agents_active | Number of connected agents |
| wazuh_agents_disconnected | Agents that lost connection |
| wazuh_alerts_24h | Alert volume (last 24 hours) |
| wazuh_alerts_by_level | Alerts grouped by severity |
| wazuh_vulnerabilities_by_severity | CVE counts (critical/high/medium/low) |
| wazuh_sca_score | Security Configuration Assessment score per agent |
| wazuh_mitre_tactic_count | MITRE ATT&CK tactic distribution |
| wazuh_fim_entries | File integrity monitoring file count |
| wazuh_active_response_24h | Automated blocks in last 24 hours |
| wazuh_agent_keepalive_age_seconds | Agent staleness indicator |
Three Dedicated Dashboards:
1. Wazuh SIEM Dashboard (uid: wazuh-siem)
- SIEM Overview: Manager status, agent counts, 24h alert volume
- Security Alerts: By agent (pie), by category (pie), by severity
- Events Breakdown: SCA, Rootcheck, FIM, AppArmor, K8s Audit stats
- Vulnerability Assessment: Severity distribution, per-host CVE bars
- Agent Fleet: Status grid (online/offline), OS distribution
- Trends: Alert rate, agent fleet, FIM entries over time
2. Wazuh Compliance & Threats (uid: wazuh-compliance)
- SCA Compliance: Score per agent (bar gauge), pass/fail breakdown
- MITRE ATT&CK: Tactic distribution, top 15 techniques, trends
- Authentication: Success/failed counters, active response count
3. Wazuh Vulnerability Deep Dive (uid: wazuh-vulns)
- Total CVEs with critical/high/medium/low breakdown
- Top 15 CVEs by count
- Per-host analysis with severity breakdown
- Agent health via keepalive age
Proactive Security Stack
Beyond detection, the security stack implements proactive defense - blocking known threats before they reach services, monitoring runtime behavior, and using deception to catch intruders.
CrowdSec: Community Threat Intelligence
CrowdSec adds a collaborative dimension to security. It receives crowd-sourced IP blocklists from thousands of deployments worldwide and applies them via a Traefik bouncer.
How it works:
- CrowdSec Agent monitors K3s logs for attack patterns (brute force, scanning, exploits)
- Local decisions block offending IPs immediately
- Community blocklist proactively blocks IPs flagged by other CrowdSec deployments
- Traefik Bouncer enforces bans at the ingress level - blocked IPs never reach services
This means known attackers are blocked before they even see a login prompt.
Falco: Runtime Security Monitoring
Falco provides kernel-level security monitoring using eBPF. It watches every syscall on the K3s node and alerts on suspicious behavior:
- Shell spawned inside a container
- Sensitive file access (/etc/shadow, /etc/passwd)
- Unexpected network connections from pods
- Crypto mining process patterns
- Reverse shell detection
- Container escape attempts
Deployed as a DaemonSet with eBPF driver - no kernel module needed. Falcosidekick forwards events to Prometheus for Grafana alerting.
Honeypot: Zero-False-Positive Intruder Detection
An OpenCanary honeypot runs on a dedicated MetalLB IP, presenting fake services:
| Service | Port | Purpose |
|---|---|---|
| FTP | 21 | Catch credential scanning |
| HTTP | 80 | Catch web scanning |
| MySQL | 3306 | Catch database scanning |
| Telnet | 23 | Catch legacy protocol scanning |
Any interaction with these services is by definition malicious - legitimate traffic has no reason to touch this IP. Honeypot events trigger Wazuh rule 100600 (level 10), providing zero-false-positive intruder detection.
Trivy: Container Vulnerability Scanning
Two layers of container scanning:
1. Trivy Operator (K3s): Helm-deployed operator that continuously scans all container images running in K3s. Results feed into the Container Security dashboard and trigger alerts when critical CVEs exceed thresholds.
2. Trivy Docker Scanner (5 Docker hosts):
Systemd timer runs trivy image every 6 hours on all Docker hosts, exporting results as Prometheus metrics via node_exporter’s textfile collector. Alert rule DockerHostCriticalCVEs fires when critical vulnerabilities are found.
# Deploy Trivy scanner to all Docker hosts
ansible-playbook -i inventory.yml playbooks/setup-trivy-docker-scanner.yml
Automation & Operational Intelligence
Weekly Reports
Two automated reports provide regular situational awareness:
| Report | Schedule | Content |
|---|---|---|
| Infrastructure Report | Sunday 10:00 | Host status, resource trends, service uptime, notable events |
| Update & CVE Report | Monday 09:00 | Available updates, new CVEs, Helm chart versions, alert summary |
Both support --dry-run for local preview and are sent via email.
AI-Powered Log Analysis
A Python script uses Claude to analyze Loki logs for anomalies that pattern-based rules would miss:
python3 scripts/ai-log-analysis.py
It pulls recent logs from Loki, identifies unusual patterns, correlates events across services, and generates a human-readable analysis. Useful for post-incident review and discovering unknown-unknowns.
Automated Backup Verification
Monthly automated restore testing via verify-backup.sh:
- Select a random VM backup from Proxmox Backup Server
- Restore it to a temporary VM on Proxmox
- Boot the VM, verify it starts successfully
- Destroy the temporary VM
- Report results via ntfy notification
This validates that backups are not just running but actually restorable - the only metric that matters.
Status Page (Gatus)
Gatus provides an external status page at status.pichler.dev (via Cloudflare Tunnel):
- 22 endpoint checks across K3s services, infrastructure, and external services
- Uptime badges via API for embedding in dashboards
- Alerting integration for endpoint failures
Automated Operations
| Automation | Schedule | Purpose |
|---|---|---|
| GitOps Drift Detection | Every 30min | Kustomize diff against live cluster state |
| Synthetic API Monitoring | Every 5min | End-to-end health checks for critical APIs |
| CT Monitor | Every 6h | Certificate Transparency log monitoring for *.pichler.dev |
| Nuclei Pentesting | Weekly | Automated vulnerability scanning of 11 services |
| Auto-Remediation | Event-driven | Webhook server for PodCrashLooping, DiskSpace, BackupStale |
| Grafana Annotations | On deploy | Mark deployments in Grafana dashboards for correlation |
Lessons Learned
1. Inhibition rules are non-negotiable. Without them, a single host going down generates 5+ alerts (host down + services down + probes failing). With inhibition, you get one alert.
2. Separate critical from noise early. Wazuh’s email threshold at level 13 and Alertmanager’s critical/warning split prevent alert fatigue. If everything is urgent, nothing is.
3. Single-binary Loki is perfect for homelabs. The microservices deployment mode is overkill. Single-binary with 14-day retention on local storage handles everything a homelab needs.
4. Custom decoders make or break Wazuh. Out-of-the-box Wazuh doesn’t understand Proxmox, K3s audit, or OPNsense logs. Seven custom decoders were needed to make the data useful.
5. Active response needs a whitelist. Without one, Wazuh’s SSH brute force blocking will eventually block your management IPs. The whitelist for PVE hosts and K3s master prevents self-lockout.
6. GitOps dashboards > manual dashboards. ConfigMap-based Grafana dashboards survive pod restarts, are version-controlled, and deploy automatically. Never create dashboards through the UI in production.
7. The Prometheus exporter bridge is worth building. Wazuh’s native dashboard is great for investigation, but Grafana provides the unified view. A custom exporter bridging the two gives you the best of both worlds.
8. Monitor the monitoring. Dedicated alerts for LokiDown, PromtailDown, WazuhManagerDown, WazuhAgentDisconnected, and PbsExporterDown ensure the observability stack itself stays healthy.
9. Predictive alerts save weekends. predict_linear() caught a disk filling up 3 days before it would have caused an outage. Reactive monitoring catches problems; predictive monitoring prevents them.
10. Defense in depth beats any single tool. CrowdSec blocks known attackers at the edge, Falco catches suspicious runtime behavior, the honeypot detects intruders with zero false positives, and Wazuh ties it all together with correlation and response. No single tool covers everything.
Built With Claude Code
The entire monitoring and SIEM stack - all Ansible playbooks, Helm values, custom rules, decoders, dashboards, exporters, alert configurations, and automation scripts - was built using Claude Code (Claude Opus).
My role: Architect defining requirements, reviewing outputs, and making security decisions.
Claude’s execution:
- 1,596-line Helm values.yaml with 60 scrape jobs and 69 alert rules
- ~100 Wazuh custom rules with proper severity levels and escalation
- 7 custom decoders for non-standard log formats
- 35 Grafana dashboard ConfigMaps with PromQL queries
- 8 exporter Kustomize deployments with proper resource limits
- Ansible playbooks for automated agent deployment across Linux, FreeBSD, and Windows
- ntfy-bridge Python service for mobile push notifications
- CrowdSec + Falco + Honeypot + Trivy security stack
- Weekly report generators (infrastructure + CVE)
- AI-powered log analysis pipeline
- Backup verification automation
- Shell scripts for external monitoring agent installation
This is infrastructure-as-code at scale - the kind of work that would take weeks manually, delivered in days with AI-assisted development.
What’s Next
Shipped:
- 60 Prometheus scrape jobs covering all infrastructure (114 endpoints)
- 20 active Wazuh agents with specialized group configs
- 69 custom Prometheus + 7 Loki LogQL alert rules with inhibition
- 35 custom Grafana dashboards (67 total including community + built-ins)
- Multi-tier alerting (email + ntfy)
- Active response (SSH brute force, port scan blocking)
- CrowdSec community threat intelligence with Traefik bouncer
- Falco eBPF runtime security monitoring
- OpenCanary honeypot for zero-false-positive intruder detection
- Trivy container scanning (K3s Operator + Docker host scanner)
- Restore testing automation (monthly random VM restore from PBS, boot verification, auto-cleanup)
- UPS auto-shutdown orchestration (NUT client on all PVE hosts, graceful shutdown on LOWBATT)
- Incident Timeline dashboard (alert correlation, Loki error logs, ArgoCD deploy tracking)
- Weekly infrastructure + CVE reports (automated email)
- AI-powered log analysis (Claude-based anomaly detection)
- GitOps drift detection (CronJob every 30min)
- Automated pentesting (Nuclei, weekly)
- Certificate Transparency monitoring (CronJob every 6h)
- Auto-remediation webhook (PodCrashLooping, DiskSpace, BackupStale)
- Gatus status page (external access via Cloudflare Tunnel)
- Offsite backup (PBS -> Hetzner S3 Storage Box, 3-2-1 rule)
Roadmap:
- Server closet sensor: LoRaWAN sensor measuring temperature, humidity (and depending on model CO2/VOC), displayed via ChirpStack MQTT Exporter -> Prometheus -> Grafana
The Stack (Complete Reference)
| Layer | Technology |
|---|---|
| Metrics | Prometheus (kube-prometheus-stack Helm, 15d retention, 60 jobs) |
| Visualization | Grafana (35 custom + 10 community dashboards) |
| Logs | Loki single-binary (14d retention, 10Gi storage) |
| Log Shipping | Promtail (K3s DaemonSet + 5 external agents) |
| SIEM | Wazuh 4.14.3 All-in-One (LXC, 20 agents, 8 groups) |
| Alerting | Alertmanager -> Email (critical) + ntfy (all), 76 custom rules |
| Log Alerts | Loki LogQL (7 rules: SSH, OOM, disk, pod, systemd, Proxmox) |
| Host Metrics | node_exporter on 7+ hosts + OPNsense plugin |
| Proxmox | pve-exporter (API scrape, cluster mode) |
| Backup Monitoring | pbs-exporter (Proxmox Backup Server) |
| Network | UniFi Poller + SNMP Exporter (switch, TrueNAS) |
| UPS | NUT Exporter (Eaton Ellipse PRO 850) |
| DNS | AdGuard Exporter |
| Probing | Blackbox Exporter (23 ICMP, 19 HTTP, DNS, SMTP) |
| Disk Health | smartctl_exporter on pve3 |
| Bandwidth | Speedtest Exporter (4h intervals) |
| GPU | DCGM Exporter (NVIDIA RTX 3060, 15s scrape) |
| LoRaWAN | ChirpStack MQTT Exporter |
| Threat Intelligence | CrowdSec (community IP blocklist + Traefik bouncer) |
| Runtime Security | Falco (eBPF syscall monitoring) |
| Deception | OpenCanary Honeypot (FTP/HTTP/MySQL/Telnet decoy) |
| Container Scanning | Trivy Operator (K3s) + Docker Scanner (5 hosts) |
| Security Rules | ~100 custom Wazuh rules (IDs 100100-100639) |
| Active Response | SSH brute force + port scan blocking with escalation |
| Vulnerability | Wazuh CVE scanning + Grafana deep-dive dashboard |
| Status Page | Gatus (22 endpoint checks, status.pichler.dev) |
| Reporting | Weekly infra + CVE reports, Wazuh security report |
| AI Analysis | Claude-powered Loki anomaly detection |
| Deployment | Ansible (Wazuh) + Helm/Kustomize (K8s) |
| Ingress | Traefik (internal) + Cloudflare Tunnel (external) |
| TLS | cert-manager with Cloudflare DNS-01 wildcard |
| Infrastructure | Proxmox -> Ubuntu 24.04 -> K3s single-node |








