Operations
This page covers DNS setup, accessing dashboards, testing the alert pipeline, and troubleshooting common issues.
Observability stack series
- Observability stack
- Architecture
- Manifests
- Flux integration
- Operations - You are here
DNS setup
Add these DNS entries on your router pointing to your cluster ingress IP:
| Hostname | Points to |
|---|---|
| grafana.example.local | Cluster ingress IP |
| uptime.example.local | Cluster ingress IP |
TLS is handled by the cluster wildcard certificate for *.example.local.
Accessing Grafana
Get the admin password
kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d && echo
Login
- Open https://grafana.example.local
- Username:
admin - Password: from the command above
Pre-configured data sources
Grafana comes with two data sources:
| Data source | Type | URL |
|---|---|---|
| Prometheus | prometheus | http://kube-prometheus-stack-prometheus:9090 |
| Loki | loki | http://loki:3100 |
Explore logs
- Go to Explore in Grafana
- Select Loki as the data source
- Query logs by namespace:
{namespace="wp-coach"}
Query for specific error patterns:
{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"
Alert tuning
The default kube-prometheus-stack alerts include some that are noisy or not applicable. These have been disabled or silenced:
| Alert | Action | Reason |
|---|---|---|
| Watchdog | Routed to null receiver | Dead man's switch that always fires to prove alerting works |
| TargetDown kube-proxy | Disabled kube-proxy scraping | kube-proxy does not expose a metrics endpoint |
| KubeMemoryOvercommit | Disabled | Limits exceed capacity but actual usage is low |
| KubeCPUOvercommit | Disabled | Limits exceed capacity but actual usage is low |
To re-enable any of these, update the HelmRelease values in 10-helmrelease-kube-prom.yaml.
Accessing Uptime Kuma
- Open https://uptime.example.local
- On first visit, create an admin account
- Add monitors for your endpoints
Configure a monitor for your site
- Click Add New Monitor
- Monitor Type: HTTP(s)
- Friendly Name:
mysite.example.com - URL:
https://mysite.example.com - Heartbeat Interval:
60seconds - Click Save
Add Slack notification
- Go to Settings → Notifications
- Click Setup Notification
- Notification Type: Slack
- Friendly Name:
Slack Alerts - Webhook URL: Your Slack webhook URL
- Click Test to verify
- Click Save
Assign notification to monitor
- Edit the
mysite.example.commonitor - Under Notifications, enable the Slack notification
- Click Save
Testing the alert pipeline
The alert pipeline has two independent paths to Slack:
Test Alertmanager to Slack
Testing the alert pipeline after deployment confirms your Slack webhook is correctly configured before a real incident occurs.
Send a test alert directly to Alertmanager:
# Port forward to Alertmanager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
# Send test alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"namespace": "monitoring"
},
"annotations": {
"summary": "Test alert from monitoring setup",
"description": "This is a test alert to verify Slack integration"
}
}]'
# Stop port forward
kill %1
You should receive a Slack notification within 30 seconds (the group wait time).
Test Uptime Kuma to Slack
- In Uptime Kuma, edit a monitor
- Click Test next to the Slack notification
- Check your Slack channel for the test message
Verification commands
Check Flux status
# Kustomizations
flux get kustomizations -n flux-system | grep monitoring
# HelmReleases
flux get helmreleases -n monitoring
# HelmRepositories
kubectl get helmrepositories -n monitoring
Check pods
# All monitoring pods
kubectl get pods -n monitoring
# Expected pods:
# - alertmanager-kube-prometheus-stack-alertmanager-0
# - kube-prometheus-stack-grafana-*
# - kube-prometheus-stack-kube-state-metrics-*
# - kube-prometheus-stack-operator-*
# - kube-prometheus-stack-prometheus-node-exporter-* (one per node)
# - prometheus-kube-prometheus-stack-prometheus-0
# - loki-0
# - loki-canary-*
# - loki-chunks-cache-0
# - loki-results-cache-0
# - alloy-* (one per node)
# - uptime-kuma-*
Check ingresses
kubectl get ingress -n monitoring
# Expected:
# grafana-ingress grafana.example.local
# uptime-kuma-ingress uptime.example.local
Check PVCs
kubectl get pvc -n monitoring
# Expected PVCs for:
# - prometheus
# - alertmanager
# - grafana
# - loki
# - uptime-kuma
Troubleshooting
Grafana login fails
Check the Grafana pod logs:
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
Reset the admin password by deleting the secret (Flux will recreate it):
kubectl delete secret -n monitoring kube-prometheus-stack-grafana
flux reconcile helmrelease kube-prometheus-stack -n monitoring
Alertmanager not sending to Slack
Check if the secret is mounted correctly:
kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
-c alertmanager -- cat /etc/alertmanager/secrets/alertmanager-slack/webhook-url
Check Alertmanager logs:
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager
Loki not receiving logs
Check Alloy logs:
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy
Verify Alloy can reach Loki:
kubectl exec -n monitoring -l app.kubernetes.io/name=alloy -- \
wget -qO- http://loki:3100/ready
HelmRelease stuck
Get detailed status:
kubectl describe helmrelease -n monitoring kube-prometheus-stack
Check Helm history:
helm history kube-prometheus-stack -n monitoring
Force reconciliation:
flux reconcile helmrelease kube-prometheus-stack -n monitoring --force
Pods stuck in Pending
Check for resource constraints:
kubectl describe pod -n monitoring <pod-name>
Common causes:
- Insufficient CPU or memory on nodes
- PVC not bound (check storage class)
Force full reconciliation
# Reconcile the source
flux reconcile source git observability-monitoring -n flux-system
# Reconcile the app Kustomization
flux reconcile kustomization observability-monitoring -n flux-system --with-source
Log queries for common issues
MariaDB IO errors
{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"
WordPress PHP fatal errors
{namespace="wp-coach", container="wordpress"} |~ "Fatal error|PHP Fatal"
Pod crash loops
{namespace=~".+"} |~ "OOMKilled|CrashLoopBackOff"
Authentication failures
{namespace=~".+"} |~ "[Aa]ccess denied|[Aa]uthentication failed"
Adding log-based alerts
Create a PrometheusRule to alert on log patterns:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: log-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: log-errors
rules:
- alert: MariaDBIOError
expr: |
sum(count_over_time({namespace="wp-coach", container="mariadb"}
|~ "InnoDB.*IO.*[Ee]rror" [5m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MariaDB IO errors detected"
description: "InnoDB IO errors found in wp-coach namespace"
Apply the rule:
kubectl apply -f log-alerts.yaml
The Prometheus Operator will pick up the rule automatically.
Backup considerations
What to back up
| Component | Data location | Backup method |
|---|---|---|
| Prometheus | PVC | Velero snapshot |
| Alertmanager | PVC | Velero snapshot |
| Grafana | PVC (dashboards, users) | Velero snapshot |
| Loki | PVC | Velero snapshot |
| Uptime Kuma | PVC (monitors, settings) | Velero snapshot |
Grafana dashboard export
Export important dashboards as JSON:
- Open the dashboard
- Click Share → Export
- Enable Export for sharing externally
- Save the JSON file
Store exported dashboards in the app repo under dashboards/.