Skip to main content

Observability stack

info

This summary page explains what the observability stack is, how the series fits together, and how to get started.

Observability stack series

  1. Observability stack - You are here
  2. Architecture
  3. Manifests
  4. Flux integration
  5. Operations

What this is

A GitOps-managed observability stack that provides proactive alerting for issues Kubernetes events do not surface. It combines metrics, logs, and external monitoring into a single deployment with Slack notifications.

The problem I solved

I discovered my website was down by chance. Cloudflare showed a 504 gateway timeout, but the tunnel was healthy. The real issue was buried in MariaDB pod logs: InnoDB IO errors that never appeared in Kubernetes events so my existing cluster monitoring didn't alert me about it. The website was down without it's database being fully operational.

Without log-based alerting I missed it entirely. So today I rebuilt my entire observability stack from the ground up.

This stack solves that by:

  • Collecting logs from all pods and alerting on error patterns
  • Monitoring external HTTP endpoints for availability
  • Providing dashboards for metrics and log queries
  • Sending all alerts to Slack

Immediate value

Within minutes of deploying the stack, Slack lit up with alerts for an unrelated namespace: blaster-dev. The PostgreSQL database had been silently crash-looping for 17 days with 45+ restarts - invisible because I rarely check the dev environment.

[FIRING] KubeStatefulSetReplicasMismatch blaster-dev
[FIRING] KubePodCrashLooping blaster-dev (postgres CrashLoopBackOff)

The root cause: aggressive liveness probe timeouts (1 second default) combined with a probe command that resolved the Service name instead of localhost - creating a chicken-and-egg deadlock where the pod could never become Ready.

This is exactly what the stack is designed to catch: issues that exist but never surface until something breaks in production.

Components

ComponentVersionPurpose
kube-prometheus-stack80.6.0Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics
Loki6.46.0Log aggregation
Grafana Alloy1.5.1Log collection (DaemonSet, replaces Promtail)
Uptime Kuma1.23.16External HTTP monitoring

Alert coverage

LayerToolDetects
ExternalUptime KumaSite down, SSL expiry, slow response
KubernetesPrometheus + kube-state-metricsPod crashes, restarts, OOM, node issues
ApplicationLoki + AlloyMariaDB IO errors, PHP fatal errors, custom patterns
Infrastructurenode-exporterDisk full, CPU/memory pressure, NFS issues

Quick start

After deploying via Flux:

# Check Flux Kustomizations
flux get kustomizations -n flux-system | grep monitoring

# Check HelmReleases
flux get helmreleases -n monitoring

# Check all pods
kubectl get pods -n monitoring

# Get Grafana admin password
kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d && echo

Architecture overview

Repository structure

The stack spans two repositories following the standard GitOps pattern.

App repo: observability/monitoring

Contains the Kubernetes manifests for the monitoring components.

k8s/prod/
├── kustomization.yaml
├── 00-secret-slack.enc.yaml # SOPS-encrypted Slack webhook
├── 10-helmrelease-kube-prom.yaml # Prometheus stack
├── 20-helmrelease-loki.yaml # Loki
├── 30-helmrelease-alloy.yaml # Log collector
├── 40-uptime-kuma/ # External monitoring
├── 50-ingress-grafana.yaml # https://grafana.example.local
└── 51-ingress-uptime-kuma.yaml # https://uptime.example.local

Flux config: your-flux-org/flux-config

Contains the Flux objects that deploy the app repo with proper dependency ordering.

clusters/my-cluster/monitoring/
├── kustomization.yaml
├── source.yaml # GitRepository
├── 00-kustomization-ns.yaml # Namespace + HelmRepositories
├── 10-kustomization-app.yaml # App (dependsOn: monitoring-ns)
└── ns/
├── namespace.yaml
├── 10-helm-repo-prometheus.yaml
└── 11-helm-repo-grafana.yaml

Applications using this stack

Other applications in the cluster expose Prometheus metrics via ServiceMonitor and define alerts via PrometheusRule:

ApplicationMetrics exposedAlerts
Email relayMX validation accept/reject, Mailpit message countsHigh rejection rate, relay down
Cal.com-Uses email-relay for notifications

Each application series documents its own ServiceMonitor and PrometheusRule configuration.

What you will learn

  • Architecture: how the components fit together and the data flow from pods to Slack
  • Manifests: the app repo structure and key HelmRelease configurations
  • Flux integration: dependency control to avoid race conditions, GitRepository and Kustomization setup
  • Operations: DNS setup, accessing dashboards, testing alerts, and troubleshooting