Observability stack

info

This summary page explains what the observability stack is, how the series fits together, and how to get started.

Observability stack series

Observability stack - You are here
Architecture
Manifests
Flux integration
Operations

What this is

A GitOps-managed observability stack that provides proactive alerting for issues Kubernetes events do not surface. It combines metrics, logs, and external monitoring into a single deployment with Slack notifications.

The problem I solved

I discovered my website was down by chance. Cloudflare showed a 504 gateway timeout, but the tunnel was healthy. The real issue was buried in MariaDB pod logs: InnoDB IO errors that never appeared in Kubernetes events so my existing cluster monitoring didn't alert me about it. The website was down without it's database being fully operational.

Without log-based alerting I missed it entirely. So today I rebuilt my entire observability stack from the ground up.

This stack solves that by:

Collecting logs from all pods and alerting on error patterns
Monitoring external HTTP endpoints for availability
Providing dashboards for metrics and log queries
Sending all alerts to Slack

Immediate value

Within minutes of deploying the stack, Slack lit up with alerts for an unrelated namespace: blaster-dev. The PostgreSQL database had been silently crash-looping for 17 days with 45+ restarts - invisible because I rarely check the dev environment.

[FIRING] KubeStatefulSetReplicasMismatch blaster-dev
[FIRING] KubePodCrashLooping blaster-dev (postgres CrashLoopBackOff)

The root cause: aggressive liveness probe timeouts (1 second default) combined with a probe command that resolved the Service name instead of localhost - creating a chicken-and-egg deadlock where the pod could never become Ready.

This is exactly what the stack is designed to catch: issues that exist but never surface until something breaks in production.

Components

Component	Version	Purpose
kube-prometheus-stack	80.6.0	Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics
Loki	6.46.0	Log aggregation
Grafana Alloy	1.5.1	Log collection (DaemonSet, replaces Promtail)
Uptime Kuma	1.23.16	External HTTP monitoring

Alert coverage

Layer	Tool	Detects
External	Uptime Kuma	Site down, SSL expiry, slow response
Kubernetes	Prometheus + kube-state-metrics	Pod crashes, restarts, OOM, node issues
Application	Loki + Alloy	MariaDB IO errors, PHP fatal errors, custom patterns
Infrastructure	node-exporter	Disk full, CPU/memory pressure, NFS issues

Quick start

After deploying via Flux:

# Check Flux Kustomizations
flux get kustomizations -n flux-system | grep monitoring

# Check HelmReleases
flux get helmreleases -n monitoring

# Check all pods
kubectl get pods -n monitoring

# Get Grafana admin password
kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d && echo

Architecture overview

Repository structure

The stack spans two repositories following the standard GitOps pattern.

App repo: `observability/monitoring`

Contains the Kubernetes manifests for the monitoring components.

k8s/prod/
├── kustomization.yaml
├── 00-secret-slack.enc.yaml      # SOPS-encrypted Slack webhook
├── 10-helmrelease-kube-prom.yaml # Prometheus stack
├── 20-helmrelease-loki.yaml      # Loki
├── 30-helmrelease-alloy.yaml     # Log collector
├── 40-uptime-kuma/               # External monitoring
├── 50-ingress-grafana.yaml       # https://grafana.example.local
└── 51-ingress-uptime-kuma.yaml   # https://uptime.example.local

Flux config: `your-flux-org/flux-config`

Contains the Flux objects that deploy the app repo with proper dependency ordering.

clusters/my-cluster/monitoring/
├── kustomization.yaml
├── source.yaml                   # GitRepository
├── 00-kustomization-ns.yaml      # Namespace + HelmRepositories
├── 10-kustomization-app.yaml     # App (dependsOn: monitoring-ns)
└── ns/
    ├── namespace.yaml
    ├── 10-helm-repo-prometheus.yaml
    └── 11-helm-repo-grafana.yaml

Applications using this stack

Other applications in the cluster expose Prometheus metrics via ServiceMonitor and define alerts via PrometheusRule:

Application	Metrics exposed	Alerts
Email relay	MX validation accept/reject, Mailpit message counts	High rejection rate, relay down
Cal.com	-	Uses email-relay for notifications

Each application series documents its own ServiceMonitor and PrometheusRule configuration.

What you will learn

Architecture: how the components fit together and the data flow from pods to Slack
Manifests: the app repo structure and key HelmRelease configurations
Flux integration: dependency control to avoid race conditions, GitRepository and Kustomization setup
Operations: DNS setup, accessing dashboards, testing alerts, and troubleshooting