Operations

info

This page covers DNS setup, accessing dashboards, testing the alert pipeline, and troubleshooting common issues.

Observability stack series

DNS setup

Add these DNS entries on your router pointing to your cluster ingress IP:

Hostname	Points to
grafana.example.local	Cluster ingress IP
uptime.example.local	Cluster ingress IP

TLS is handled by the cluster wildcard certificate for *.example.local.

Accessing Grafana

Get the admin password

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d && echo

Open https://grafana.example.local
Username: admin
Password: from the command above

Pre-configured data sources

Grafana comes with two data sources:

Data source	Type	URL
Prometheus	prometheus	http://kube-prometheus-stack-prometheus:9090
Loki	loki	http://loki:3100

Explore logs

Go to Explore in Grafana
Select Loki as the data source
Query logs by namespace:

{namespace="wp-coach"}

Query for specific error patterns:

{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"

Alert tuning

The default kube-prometheus-stack alerts include some that are noisy or not applicable. These have been disabled or silenced:

Alert	Action	Reason
Watchdog	Routed to null receiver	Dead man's switch that always fires to prove alerting works
TargetDown kube-proxy	Disabled kube-proxy scraping	kube-proxy does not expose a metrics endpoint
KubeMemoryOvercommit	Disabled	Limits exceed capacity but actual usage is low
KubeCPUOvercommit	Disabled	Limits exceed capacity but actual usage is low

To re-enable any of these, update the HelmRelease values in 10-helmrelease-kube-prom.yaml.

Accessing Uptime Kuma

Open https://uptime.example.local
On first visit, create an admin account
Add monitors for your endpoints

Configure a monitor for your site

Click Add New Monitor
Monitor Type: HTTP(s)
Friendly Name: mysite.example.com
URL: https://mysite.example.com
Heartbeat Interval: 60 seconds
Click Save

Add Slack notification

Go to Settings → Notifications
Click Setup Notification
Notification Type: Slack
Friendly Name: Slack Alerts
Webhook URL: Your Slack webhook URL
Click Test to verify
Click Save

Assign notification to monitor

Edit the mysite.example.com monitor
Under Notifications, enable the Slack notification
Click Save

Testing the alert pipeline

The alert pipeline has two independent paths to Slack:

Test Alertmanager to Slack

tip

Testing the alert pipeline after deployment confirms your Slack webhook is correctly configured before a real incident occurs.

Send a test alert directly to Alertmanager:

# Port forward to Alertmanager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &

# Send test alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "namespace": "monitoring"
    },
    "annotations": {
      "summary": "Test alert from monitoring setup",
      "description": "This is a test alert to verify Slack integration"
    }
  }]'

# Stop port forward
kill %1

You should receive a Slack notification within 30 seconds (the group wait time).

Test Uptime Kuma to Slack

In Uptime Kuma, edit a monitor
Click Test next to the Slack notification
Check your Slack channel for the test message

Verification commands

Check Flux status

# Kustomizations
flux get kustomizations -n flux-system | grep monitoring

# HelmReleases
flux get helmreleases -n monitoring

# HelmRepositories
kubectl get helmrepositories -n monitoring

Check pods

# All monitoring pods
kubectl get pods -n monitoring

# Expected pods:
# - alertmanager-kube-prometheus-stack-alertmanager-0
# - kube-prometheus-stack-grafana-*
# - kube-prometheus-stack-kube-state-metrics-*
# - kube-prometheus-stack-operator-*
# - kube-prometheus-stack-prometheus-node-exporter-* (one per node)
# - prometheus-kube-prometheus-stack-prometheus-0
# - loki-0
# - loki-canary-*
# - loki-chunks-cache-0
# - loki-results-cache-0
# - alloy-* (one per node)
# - uptime-kuma-*

Check ingresses

kubectl get ingress -n monitoring

# Expected:
# grafana-ingress       grafana.example.local
# uptime-kuma-ingress   uptime.example.local

Check PVCs

kubectl get pvc -n monitoring

# Expected PVCs for:
# - prometheus
# - alertmanager
# - grafana
# - loki
# - uptime-kuma

Troubleshooting

Check the Grafana pod logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Reset the admin password by deleting the secret (Flux will recreate it):

kubectl delete secret -n monitoring kube-prometheus-stack-grafana
flux reconcile helmrelease kube-prometheus-stack -n monitoring

Alertmanager not sending to Slack

Check if the secret is mounted correctly:

kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager -- cat /etc/alertmanager/secrets/alertmanager-slack/webhook-url

Check Alertmanager logs:

kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager

Loki not receiving logs

Check Alloy logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=alloy

Verify Alloy can reach Loki:

kubectl exec -n monitoring -l app.kubernetes.io/name=alloy -- \
  wget -qO- http://loki:3100/ready

HelmRelease stuck

Get detailed status:

kubectl describe helmrelease -n monitoring kube-prometheus-stack

Check Helm history:

helm history kube-prometheus-stack -n monitoring

Force reconciliation:

flux reconcile helmrelease kube-prometheus-stack -n monitoring --force

Pods stuck in Pending

Check for resource constraints:

kubectl describe pod -n monitoring <pod-name>

Common causes:

Insufficient CPU or memory on nodes
PVC not bound (check storage class)

Force full reconciliation

# Reconcile the source
flux reconcile source git observability-monitoring -n flux-system

# Reconcile the app Kustomization
flux reconcile kustomization observability-monitoring -n flux-system --with-source

Log queries for common issues

MariaDB IO errors

{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"

WordPress PHP fatal errors

{namespace="wp-coach", container="wordpress"} |~ "Fatal error|PHP Fatal"

Pod crash loops

{namespace=~".+"} |~ "OOMKilled|CrashLoopBackOff"

Authentication failures

{namespace=~".+"} |~ "[Aa]ccess denied|[Aa]uthentication failed"

Adding log-based alerts

Create a PrometheusRule to alert on log patterns:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: log-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: log-errors
      rules:
        - alert: MariaDBIOError
          expr: |
            sum(count_over_time({namespace="wp-coach", container="mariadb"} 
              |~ "InnoDB.*IO.*[Ee]rror" [5m])) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "MariaDB IO errors detected"
            description: "InnoDB IO errors found in wp-coach namespace"

Apply the rule:

kubectl apply -f log-alerts.yaml

The Prometheus Operator will pick up the rule automatically.

Backup considerations

What to back up

Component	Data location	Backup method
Prometheus	PVC	Velero snapshot
Alertmanager	PVC	Velero snapshot
Grafana	PVC (dashboards, users)	Velero snapshot
Loki	PVC	Velero snapshot
Uptime Kuma	PVC (monitors, settings)	Velero snapshot

Grafana dashboard export

Export important dashboards as JSON:

Open the dashboard
Click Share → Export
Enable Export for sharing externally
Save the JSON file

Store exported dashboards in the app repo under dashboards/.

Observability stack series​

DNS setup​

Accessing Grafana​

Get the admin password​

Login​

Pre-configured data sources​

Explore logs​

Alert tuning​

Accessing Uptime Kuma​

Configure a monitor for your site​

Add Slack notification​

Assign notification to monitor​

Testing the alert pipeline​

Test Alertmanager to Slack​

Test Uptime Kuma to Slack​

Verification commands​

Check Flux status​

Check pods​

Check ingresses​

Check PVCs​

Troubleshooting​

Grafana login fails​

Alertmanager not sending to Slack​

Loki not receiving logs​

HelmRelease stuck​

Pods stuck in Pending​

Force full reconciliation​

Log queries for common issues​

MariaDB IO errors​

WordPress PHP fatal errors​

Pod crash loops​

Authentication failures​

Adding log-based alerts​

Backup considerations​

What to back up​

Grafana dashboard export​