Architecture
This page covers the architecture of the observability stack including component roles, data flow, and resource requirements.
Observability stack series
- Observability stack
- Architecture - You are here
- Manifests
- Flux integration
- Operations
Overview
The stack provides four layers of observability:
Components
Prometheus
Prometheus is the core metrics collection and time-series database. It scrapes metrics from exporters and evaluates alerting rules.
| Aspect | Configuration |
|---|---|
| Retention | 15 days |
| Storage | 50Gi PVC |
| Scrape targets | kube-state-metrics, node-exporter, application metrics |
Alertmanager
Alertmanager handles alert routing, deduplication, and notification delivery. It reads the Slack webhook from a Kubernetes secret.
| Aspect | Configuration |
|---|---|
| Grouping | By alertname and namespace |
| Group wait | 30 seconds |
| Repeat interval | 4 hours |
| Notification | Slack webhook |
Grafana
Grafana provides dashboards for visualising metrics and logs. It comes pre-configured with Prometheus and Loki as data sources.
| Aspect | Configuration |
|---|---|
| Data sources | Prometheus, Loki |
| Storage | 5Gi PVC |
| Access | https://grafana.example.local |
Loki
Loki is the log aggregation system. It stores logs in a single-binary deployment mode suitable for small to medium clusters.
| Aspect | Configuration |
|---|---|
| Mode | SingleBinary |
| Storage | 20Gi PVC (filesystem) |
| Schema | v13 with TSDB store |
| Replication | 1 (single node) |
Grafana Alloy
Grafana Alloy is the log collection agent that replaces the deprecated Promtail. It runs as a DaemonSet to collect logs from all nodes.
| Aspect | Configuration |
|---|---|
| Deployment | DaemonSet (one per node) |
| Targets | All running pods |
| Labels | namespace, pod, container, node, app |
| Output | Loki at http://loki:3100 |
Uptime Kuma
Uptime Kuma provides external HTTP monitoring with its own web UI for configuration. It probes endpoints from inside the cluster and sends notifications directly to Slack.
| Aspect | Configuration |
|---|---|
| Storage | 1Gi PVC |
| Access | https://uptime.example.local |
| Monitors | Configured via UI |
Supporting components
| Component | Purpose |
|---|---|
| kube-state-metrics | Exports Kubernetes object metrics (pods, deployments, etc.) |
| node-exporter | Exports host-level metrics (CPU, memory, disk, NFS) |
| Prometheus Operator | Manages Prometheus, Alertmanager, and PrometheusRule CRDs |
Data flow
Metrics path
- Exporters expose metrics on
/metricsendpoints - Prometheus scrapes metrics at configured intervals
- Prometheus evaluates alerting rules against time-series data
- Firing alerts are sent to Alertmanager
- Alertmanager groups, deduplicates, and sends to Slack
Applications integrate by creating a ServiceMonitor (for scraping) and PrometheusRule (for alerts) in their namespace. See Email relay for an example.
Logs path
- Pods write to stdout/stderr
- Alloy collects logs from all pods and adds labels
- Alloy pushes logs to Loki
- Loki stores logs and serves queries
- Prometheus can query Loki for log-based alerts
- Alerts flow through Alertmanager to Slack
External monitoring path
- Uptime Kuma probes configured endpoints
- On failure or threshold breach, sends directly to Slack
- Independent of Prometheus/Alertmanager pipeline
Resource requirements
| Component | Memory | Storage | Replicas |
|---|---|---|---|
| Prometheus | ~2GB | 50Gi | 1 |
| Alertmanager | ~256MB | 5Gi | 1 |
| Grafana | ~256MB | 5Gi | 1 |
| Loki | ~1GB | 20Gi | 1 |
| Loki caches | ~512MB | - | 2 |
| Alloy | ~128MB | - | 1 per node |
| Uptime Kuma | ~256MB | 1Gi | 1 |
| kube-state-metrics | ~128MB | - | 1 |
| node-exporter | ~64MB | - | 1 per node |
Total estimates:
- Memory: ~5GB base + ~200MB per node
- Storage: ~82Gi
HelmRepository placement
HelmRepositories are created in the monitoring namespace alongside the HelmReleases, following the same pattern as Gatekeeper in infra-trust.
This avoids cross-namespace references and keeps all monitoring resources together.