Observability stack
This summary page explains what the observability stack is, how the series fits together, and how to get started.
Observability stack series
- Observability stack - You are here
- Architecture
- Manifests
- Flux integration
- Operations
What this is
A GitOps-managed observability stack that provides proactive alerting for issues Kubernetes events do not surface. It combines metrics, logs, and external monitoring into a single deployment with Slack notifications.
The problem I solved
I discovered my website was down by chance. Cloudflare showed a 504 gateway timeout, but the tunnel was healthy. The real issue was buried in MariaDB pod logs: InnoDB IO errors that never appeared in Kubernetes events so my existing cluster monitoring didn't alert me about it. The website was down without it's database being fully operational.
Without log-based alerting I missed it entirely. So today I rebuilt my entire observability stack from the ground up.
This stack solves that by:
- Collecting logs from all pods and alerting on error patterns
- Monitoring external HTTP endpoints for availability
- Providing dashboards for metrics and log queries
- Sending all alerts to Slack
Immediate value
Within minutes of deploying the stack, Slack lit up with alerts for an unrelated namespace: blaster-dev. The PostgreSQL database had been silently crash-looping for 17 days with 45+ restarts - invisible because I rarely check the dev environment.
[FIRING] KubeStatefulSetReplicasMismatch blaster-dev
[FIRING] KubePodCrashLooping blaster-dev (postgres CrashLoopBackOff)
The root cause: aggressive liveness probe timeouts (1 second default) combined with a probe command that resolved the Service name instead of localhost - creating a chicken-and-egg deadlock where the pod could never become Ready.
This is exactly what the stack is designed to catch: issues that exist but never surface until something breaks in production.
Components
| Component | Version | Purpose |
|---|---|---|
| kube-prometheus-stack | 80.6.0 | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
| Loki | 6.46.0 | Log aggregation |
| Grafana Alloy | 1.5.1 | Log collection (DaemonSet, replaces Promtail) |
| Uptime Kuma | 1.23.16 | External HTTP monitoring |
Alert coverage
| Layer | Tool | Detects |
|---|---|---|
| External | Uptime Kuma | Site down, SSL expiry, slow response |
| Kubernetes | Prometheus + kube-state-metrics | Pod crashes, restarts, OOM, node issues |
| Application | Loki + Alloy | MariaDB IO errors, PHP fatal errors, custom patterns |
| Infrastructure | node-exporter | Disk full, CPU/memory pressure, NFS issues |
Quick start
After deploying via Flux:
# Check Flux Kustomizations
flux get kustomizations -n flux-system | grep monitoring
# Check HelmReleases
flux get helmreleases -n monitoring
# Check all pods
kubectl get pods -n monitoring
# Get Grafana admin password
kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d && echo
Architecture overview
Repository structure
The stack spans two repositories following the standard GitOps pattern.
App repo: observability/monitoring
Contains the Kubernetes manifests for the monitoring components.
k8s/prod/
├── kustomization.yaml
├── 00-secret-slack.enc.yaml # SOPS-encrypted Slack webhook
├── 10-helmrelease-kube-prom.yaml # Prometheus stack
├── 20-helmrelease-loki.yaml # Loki
├── 30-helmrelease-alloy.yaml # Log collector
├── 40-uptime-kuma/ # External monitoring
├── 50-ingress-grafana.yaml # https://grafana.example.local
└── 51-ingress-uptime-kuma.yaml # https://uptime.example.local
Flux config: your-flux-org/flux-config
Contains the Flux objects that deploy the app repo with proper dependency ordering.
clusters/my-cluster/monitoring/
├── kustomization.yaml
├── source.yaml # GitRepository
├── 00-kustomization-ns.yaml # Namespace + HelmRepositories
├── 10-kustomization-app.yaml # App (dependsOn: monitoring-ns)
└── ns/
├── namespace.yaml
├── 10-helm-repo-prometheus.yaml
└── 11-helm-repo-grafana.yaml
Applications using this stack
Other applications in the cluster expose Prometheus metrics via ServiceMonitor and define alerts via PrometheusRule:
| Application | Metrics exposed | Alerts |
|---|---|---|
| Email relay | MX validation accept/reject, Mailpit message counts | High rejection rate, relay down |
| Cal.com | - | Uses email-relay for notifications |
Each application series documents its own ServiceMonitor and PrometheusRule configuration.
What you will learn
- Architecture: how the components fit together and the data flow from pods to Slack
- Manifests: the app repo structure and key HelmRelease configurations
- Flux integration: dependency control to avoid race conditions, GitRepository and Kustomization setup
- Operations: DNS setup, accessing dashboards, testing alerts, and troubleshooting