Skip to main content

IdP internal backup and restore

info

This runbook describes how to protect and restore the identity-internal ZITADEL instance by treating the namespace as long-lived and backing up the PostgreSQL database to NAS-backed storage.

It assumes ZITADEL is deployed via FluxCD using the official Helm chart, with PostgreSQL running as zitadel-db in the identity-internal namespace, and that wildcard TLS trust is handled cluster-wide via the separate trust runbook (trust-manager + Gatekeeper + SSL_CERT_FILE).

1. Identity provider series

  1. IdP dual overview
  2. IdP dual architecture
  3. IdP internal deployment
  4. IdP internal console
  5. IdP internal SMTP
  6. IdP internal LDAP
  7. IdP internal OIDC
  8. IdP internal OAUTH2 proxy
  9. IdP internal backup and restore - you are here

2. Objectives

  • Keep the identity-internal namespace as a long-lived environment boundary.
  • Avoid deleting the namespace during normal changes.
  • Perform regular PostgreSQL backups to NAS-backed storage.
  • Document a tested backup and restore procedure for ZITADEL.
  • Make it easy to practise disaster recovery without guesswork.
  • Rely on the cluster-wide wildcard TLS trust mechanism instead of per-namespace TLS wiring.

3. High-level approach

  • Namespace lifecycle: treat identity-internal as durable; delete workloads, not the namespace.
  • State: ZITADEL is stateless; all important state lives in PostgreSQL and the ZITADEL masterkey.
  • Backups: use PostgreSQL logical backups (pg_dump) written to NAS via an NFS-backed PVC.
  • Restore: restore the database into PostgreSQL and reuse the original masterkey secret.
  • Automation: run backups via a Kubernetes CronJob in the identity-internal namespace.
  • TLS trust: wildcard TLS trust is provided centrally and is restored as part of the cluster GitOps configuration, not this namespace.
warning

If you lose the ZITADEL masterkey, restored database data will be unreadable to ZITADEL. Backups without the original masterkey are not sufficient.

4. Namespace lifecycle policy

4.1 Principle

Do not delete the identity-internal namespace during normal operations or upgrades. The namespace is the boundary for:

  • ZITADEL pods and jobs
  • PostgreSQL StatefulSet and PVCs
  • Secrets such as zitadel-secret (contains masterkey) and zitadel-db-secret

Instead, delete or restart workloads inside the namespace.

4.2 Typical large-change flow

Use this pattern when making significant ZITADEL changes (chart upgrades, config rewrites, and so on):

# 1. Optionally suspend Flux while you operate
flux suspend kustomization identity-internal

# 2. Stop ZITADEL and its init jobs (database stays and PVC is not touched)
kubectl delete deployment zitadel -n identity-internal --ignore-not-found
kubectl delete job zitadel-init zitadel-setup -n identity-internal --ignore-not-found

# 3. Keep PostgreSQL (and its PVC) running, or restart it if needed
kubectl rollout restart statefulset zitadel-db -n identity-internal

# 4. Re-enable Flux and reconcile
flux resume kustomization identity-internal
flux reconcile kustomization identity-internal --with-source
note

Deleting the identity-internal namespace will still delete PVCs unless your StorageClass uses a reclaim policy that retains PVs. This runbook assumes a policy of not deleting that namespace in normal workflows.

5. What to back up

  1. Database: PostgreSQL database zitadel in the zitadel-db StatefulSet.
  2. Secrets:
    • zitadel-secret (contains masterkey)
    • zitadel-db-secret (database password)
  3. GitOps configuration:
    • identity-internal repo (Helm values, manifests)
    • flux-config repo (Flux sources, Kustomizations)
    • Trust manifests (Bundle + Gatekeeper Assigns) that implement wildcard TLS trust
info

ZITADEL console configuration (SMTP config, LDAP IdP settings, login policies, projects, applications) lives in the ZITADEL database. If you back up and restore the database, these settings come with it.

6. Redis backups (when relevant)

This is only relevant if you run Redis in identity-internal.

6.1 If Redis is only for oauth2-proxy session storage

No backup is required.

  • Redis holds short-lived web sessions.
  • Losing Redis forces users to log in again.
  • The only "must keep" item is the Redis password Secret, and that already lives in Git as SOPS-encrypted YAML if you follow the dashboard runbook.
note

If your Redis is configured with AOF and a PVC, treat it as disposable state for oauth2-proxy. Do not restore it as part of DR unless you have a strong reason.

6.2 If you use Redis for anything else

If Redis becomes a dependency for anything other than session storage, decide whether it is:

  • cache-only (rebuildable, no backup), or
  • system of record (should not be Redis, but if it is, you need a real backup plan).

This runbook does not assume Redis is a system of record.

7. Backup frequency and retention

Suggested starting point:

  • Daily full database backup via pg_dump to the NAS
  • Retention: at least 7 days of daily dumps on the NAS
  • Optional: a weekly snapshot retained for 4 weeks
tip

Your real requirement is: "how many days back do I need to go back after a bad change and still recover cleanly?" Set retention to cover that window plus a buffer.

8. Backup implementation

8.1 Backup storage PVC

Create a dedicated backup PVC in identity-internal that writes to NAS (through nfs-client or equivalent).

warning

If you use NFS, prefer ReadWriteMany if available. If your dynamic provisioner only supports ReadWriteOnce, ensure the CronJob and its PVC scheduling will still work.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zitadel-backup
namespace: identity-internal
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-client
resources:
requests:
storage: 50Gi

8.2 Backup CronJob

Create a CronJob to run pg_dump and write into the backup PVC.

note

This assumes:

  • DB host is zitadel-db (Kubernetes Service name)
  • DB is named zitadel
  • Secret zitadel-db-secret contains POSTGRES_PASSWORD Adjust these to match your deployment.
apiVersion: batch/v1
kind: CronJob
metadata:
name: zitadel-db-backup
namespace: identity-internal
spec:
schedule: "0 2 * * *" # daily at 02:00
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 7
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: pg-backup
image: postgres:16.4-alpine
imagePullPolicy: IfNotPresent
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: zitadel-db-secret
key: POSTGRES_PASSWORD
command:
- /bin/sh
- -c
- |
set -euo pipefail

BACKUP_DIR=/backups/zitadel
mkdir -p "$BACKUP_DIR"

TS="$(date -u +%Y%m%d-%H%M%S)"
FILE="$BACKUP_DIR/zitadel-$TS.dump"

echo "Starting backup to $FILE"

pg_dump -h zitadel-db -U postgres -d zitadel -Fc -f "$FILE"

echo "Backup complete: $FILE"

# Optional: delete backups older than 7 days
find "$BACKUP_DIR" -type f -name 'zitadel-*.dump' -mtime +7 -print -delete
volumeMounts:
- name: backup-volume
mountPath: /backups
resources:
requests:
cpu: 25m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
capabilities:
drop: ["ALL"]
volumes:
- name: backup-volume
persistentVolumeClaim:
claimName: zitadel-backup

8.3 Manual on-demand backup

kubectl -n identity-internal create job   --from=cronjob/zitadel-db-backup   zitadel-db-backup-manual-$(date -u +%Y%m%d-%H%M%S)

kubectl -n identity-internal get jobs
kubectl -n identity-internal logs job/zitadel-db-backup-manual-...

9. Secrets backup

9.1 SOPS-encrypted secrets in Git

These are already versioned in Git and are part of your DR story:

  • k8s/prod/20-secrets-db.enc.yaml
  • k8s/prod/21-secrets-zitadel.enc.yaml

Protect:

  • Your Git remotes (backups of the repos)
  • The SOPS age private key used by Flux (sops-age Secret)

9.2 Offline copy of masterkey

For defence in depth, store the ZITADEL masterkey in a password manager.

cd ~/Projects/identity-internal/k8s/prod
sops -d 21-secrets-zitadel.enc.yaml | sed -n '1,60p'
warning

Treat the masterkey like a root credential. Anyone with it can decrypt ZITADEL data.

10. Restore procedures

danger

Do not restore into a running ZITADEL instance that is actively serving requests. Scale ZITADEL down first to prevent writes during restore.

10.1 Common pre-checks

kubectl -n identity-internal get pvc zitadel-backup
kubectl -n identity-internal get cronjob zitadel-db-backup
kubectl -n identity-internal get pods

10.2 In-place restore on the same cluster

  1. Stop ZITADEL writes:
kubectl -n identity-internal scale deployment zitadel --replicas=0
kubectl -n identity-internal get pods
  1. Start a temporary restore pod (mounts backups PVC):
kubectl -n identity-internal run zitadel-db-restore   --image=postgres:16.4-alpine   --restart=Never   --command -- sh -c "sleep 3600"   --overrides='{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {"name": "zitadel-db-restore", "namespace": "identity-internal"},
"spec": {
"containers": [{
"name": "restore",
"image": "postgres:16.4-alpine",
"command": ["sh", "-c", "sleep 3600"],
"env": [{
"name": "PGPASSWORD",
"valueFrom": {
"secretKeyRef": {"name": "zitadel-db-secret", "key": "POSTGRES_PASSWORD"}
}
}],
"volumeMounts": [{"name": "backup-volume", "mountPath": "/backups"}]
}],
"volumes": [{"name": "backup-volume", "persistentVolumeClaim": {"claimName": "zitadel-backup"}}],
"restartPolicy": "Never"
}
}'
  1. Drop and recreate DB, then restore (inside the pod):
kubectl -n identity-internal exec -it zitadel-db-restore -- sh
ls -1 /backups/zitadel

psql -h zitadel-db -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'zitadel';"
psql -h zitadel-db -U postgres -c "DROP DATABASE IF EXISTS zitadel;"
psql -h zitadel-db -U postgres -c "CREATE DATABASE zitadel;"

pg_restore -h zitadel-db -U postgres -d zitadel -v /backups/zitadel/zitadel-YYYYMMDD-HHMMSS.dump
  1. Cleanup and restart ZITADEL:
kubectl -n identity-internal delete pod zitadel-db-restore
kubectl -n identity-internal scale deployment zitadel --replicas=2
kubectl -n identity-internal rollout status deployment zitadel

10.3 Restore into a fresh cluster

High-level flow:

  1. Recreate Flux and flux-config (including wildcard TLS trust).
  2. Restore sops-age Secret in flux-system so Flux can decrypt secrets.
  3. Reconcile identity-internal to deploy PostgreSQL and secrets.
  4. Scale ZITADEL to zero.
  5. Restore the DB from NAS (pg_restore).
  6. Scale ZITADEL up and verify.

11. Testing the backup and restore

Suggested test cycle:

  1. Run the CronJob and confirm a .dump file is created on the backup PVC.
  2. Restore into a scratch DB (optional) or do a full restore rehearsal in a non-production cluster.
  3. Record results and any fixes needed in this runbook.

12. Verification checklist

12.1 Backup configuration

  • zitadel-backup PVC exists and is bound to NAS storage.
  • zitadel-db-backup CronJob exists and runs successfully.
  • Recent .dump files exist under /backups/zitadel.
  • SOPS age private key is stored securely outside the cluster.
  • ZITADEL masterkey is stored securely outside the cluster.
  • Wildcard TLS trust is applied and working (SSL_CERT_FILE in workloads).

12.2 Restore

  • ZITADEL was scaled to zero before restore.
  • pg_restore completed without errors.
  • ZITADEL restarted and is healthy.
  • Login works at https://auth.reids.net.au.
  • SMTP and LDAP flows behave as expected.

13. Rollback

If backups are failing or storage is full:

  1. Increase backup PVC size or retention window.
  2. Temporarily disable the CronJob while you remediate:
kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":true}}'

Re-enable it once fixed:

kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":false}}'