Skip to main content

Operations

info

This page covers DNS setup, accessing dashboards, testing the alert pipeline, and troubleshooting common issues.

Observability stack series

  1. Observability stack
  2. Architecture
  3. Manifests
  4. Flux integration
  5. Operations - You are here

DNS setup

Add these DNS entries on your router pointing to your cluster ingress IP:

HostnamePoints to
grafana.example.localCluster ingress IP
uptime.example.localCluster ingress IP

TLS is handled by the cluster wildcard certificate for *.example.local.

Accessing Grafana

Get the admin password

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d && echo

Login

  1. Open https://grafana.example.local
  2. Username: admin
  3. Password: from the command above

Pre-configured data sources

Grafana comes with two data sources:

Data sourceTypeURL
Prometheusprometheushttp://kube-prometheus-stack-prometheus:9090
Lokilokihttp://loki:3100

Explore logs

  1. Go to Explore in Grafana
  2. Select Loki as the data source
  3. Query logs by namespace:
{namespace="wp-coach"}

Query for specific error patterns:

{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"

Alert tuning

The default kube-prometheus-stack alerts include some that are noisy or not applicable. These have been disabled or silenced:

AlertActionReason
WatchdogRouted to null receiverDead man's switch that always fires to prove alerting works
TargetDown kube-proxyDisabled kube-proxy scrapingkube-proxy does not expose a metrics endpoint
KubeMemoryOvercommitDisabledLimits exceed capacity but actual usage is low
KubeCPUOvercommitDisabledLimits exceed capacity but actual usage is low

To re-enable any of these, update the HelmRelease values in 10-helmrelease-kube-prom.yaml.

Accessing Uptime Kuma

  1. Open https://uptime.example.local
  2. On first visit, create an admin account
  3. Add monitors for your endpoints

Configure a monitor for your site

  1. Click Add New Monitor
  2. Monitor Type: HTTP(s)
  3. Friendly Name: mysite.example.com
  4. URL: https://mysite.example.com
  5. Heartbeat Interval: 60 seconds
  6. Click Save

Add Slack notification

  1. Go to SettingsNotifications
  2. Click Setup Notification
  3. Notification Type: Slack
  4. Friendly Name: Slack Alerts
  5. Webhook URL: Your Slack webhook URL
  6. Click Test to verify
  7. Click Save

Assign notification to monitor

  1. Edit the mysite.example.com monitor
  2. Under Notifications, enable the Slack notification
  3. Click Save

Testing the alert pipeline

The alert pipeline has two independent paths to Slack:

Test Alertmanager to Slack

tip

Testing the alert pipeline after deployment confirms your Slack webhook is correctly configured before a real incident occurs.

Send a test alert directly to Alertmanager:

# Port forward to Alertmanager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &

# Send test alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"namespace": "monitoring"
},
"annotations": {
"summary": "Test alert from monitoring setup",
"description": "This is a test alert to verify Slack integration"
}
}]'

# Stop port forward
kill %1

You should receive a Slack notification within 30 seconds (the group wait time).

Test Uptime Kuma to Slack

  1. In Uptime Kuma, edit a monitor
  2. Click Test next to the Slack notification
  3. Check your Slack channel for the test message

Verification commands

Check Flux status

# Kustomizations
flux get kustomizations -n flux-system | grep monitoring

# HelmReleases
flux get helmreleases -n monitoring

# HelmRepositories
kubectl get helmrepositories -n monitoring

Check pods

# All monitoring pods
kubectl get pods -n monitoring

# Expected pods:
# - alertmanager-kube-prometheus-stack-alertmanager-0
# - kube-prometheus-stack-grafana-*
# - kube-prometheus-stack-kube-state-metrics-*
# - kube-prometheus-stack-operator-*
# - kube-prometheus-stack-prometheus-node-exporter-* (one per node)
# - prometheus-kube-prometheus-stack-prometheus-0
# - loki-0
# - loki-canary-*
# - loki-chunks-cache-0
# - loki-results-cache-0
# - alloy-* (one per node)
# - uptime-kuma-*

Check ingresses

kubectl get ingress -n monitoring

# Expected:
# grafana-ingress grafana.example.local
# uptime-kuma-ingress uptime.example.local

Check PVCs

kubectl get pvc -n monitoring

# Expected PVCs for:
# - prometheus
# - alertmanager
# - grafana
# - loki
# - uptime-kuma

Troubleshooting

Grafana login fails

Check the Grafana pod logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Reset the admin password by deleting the secret (Flux will recreate it):

kubectl delete secret -n monitoring kube-prometheus-stack-grafana
flux reconcile helmrelease kube-prometheus-stack -n monitoring

Alertmanager not sending to Slack

Check if the secret is mounted correctly:

kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
-c alertmanager -- cat /etc/alertmanager/secrets/alertmanager-slack/webhook-url

Check Alertmanager logs:

kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager

Loki not receiving logs

Check Alloy logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=alloy

Verify Alloy can reach Loki:

kubectl exec -n monitoring -l app.kubernetes.io/name=alloy -- \
wget -qO- http://loki:3100/ready

HelmRelease stuck

Get detailed status:

kubectl describe helmrelease -n monitoring kube-prometheus-stack

Check Helm history:

helm history kube-prometheus-stack -n monitoring

Force reconciliation:

flux reconcile helmrelease kube-prometheus-stack -n monitoring --force

Pods stuck in Pending

Check for resource constraints:

kubectl describe pod -n monitoring <pod-name>

Common causes:

  • Insufficient CPU or memory on nodes
  • PVC not bound (check storage class)

Force full reconciliation

# Reconcile the source
flux reconcile source git observability-monitoring -n flux-system

# Reconcile the app Kustomization
flux reconcile kustomization observability-monitoring -n flux-system --with-source

Log queries for common issues

MariaDB IO errors

{namespace="wp-coach", container="mariadb"} |~ "InnoDB.*IO.*[Ee]rror"

WordPress PHP fatal errors

{namespace="wp-coach", container="wordpress"} |~ "Fatal error|PHP Fatal"

Pod crash loops

{namespace=~".+"} |~ "OOMKilled|CrashLoopBackOff"

Authentication failures

{namespace=~".+"} |~ "[Aa]ccess denied|[Aa]uthentication failed"

Adding log-based alerts

Create a PrometheusRule to alert on log patterns:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: log-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: log-errors
rules:
- alert: MariaDBIOError
expr: |
sum(count_over_time({namespace="wp-coach", container="mariadb"}
|~ "InnoDB.*IO.*[Ee]rror" [5m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MariaDB IO errors detected"
description: "InnoDB IO errors found in wp-coach namespace"

Apply the rule:

kubectl apply -f log-alerts.yaml

The Prometheus Operator will pick up the rule automatically.

Backup considerations

What to back up

ComponentData locationBackup method
PrometheusPVCVelero snapshot
AlertmanagerPVCVelero snapshot
GrafanaPVC (dashboards, users)Velero snapshot
LokiPVCVelero snapshot
Uptime KumaPVC (monitors, settings)Velero snapshot

Grafana dashboard export

Export important dashboards as JSON:

  1. Open the dashboard
  2. Click ShareExport
  3. Enable Export for sharing externally
  4. Save the JSON file

Store exported dashboards in the app repo under dashboards/.