IdP internal backup and restore
This runbook describes how to protect and restore the identity-internal ZITADEL instance by treating the namespace as long-lived and backing up the PostgreSQL database to NAS-backed storage.
It assumes ZITADEL is deployed via FluxCD using the official Helm chart, with PostgreSQL running as zitadel-db in the identity-internal namespace, and that wildcard TLS trust is handled cluster-wide via the separate trust runbook (trust-manager + Gatekeeper + SSL_CERT_FILE).
1. Identity provider series
- IdP dual overview
- IdP dual architecture
- IdP internal deployment
- IdP internal console
- IdP internal SMTP
- IdP internal LDAP
- IdP internal OIDC
- IdP internal OAUTH2 proxy
- IdP internal backup and restore - you are here
2. Objectives
- Keep the
identity-internalnamespace as a long-lived environment boundary. - Avoid deleting the namespace during normal changes.
- Perform regular PostgreSQL backups to NAS-backed storage.
- Document a tested backup and restore procedure for ZITADEL.
- Make it easy to practise disaster recovery without guesswork.
- Rely on the cluster-wide wildcard TLS trust mechanism instead of per-namespace TLS wiring.
3. High-level approach
- Namespace lifecycle: treat
identity-internalas durable; delete workloads, not the namespace. - State: ZITADEL is stateless; all important state lives in PostgreSQL and the ZITADEL
masterkey. - Backups: use PostgreSQL logical backups (
pg_dump) written to NAS via an NFS-backed PVC. - Restore: restore the database into PostgreSQL and reuse the original
masterkeysecret. - Automation: run backups via a Kubernetes
CronJobin theidentity-internalnamespace. - TLS trust: wildcard TLS trust is provided centrally and is restored as part of the cluster GitOps configuration, not this namespace.
If you lose the ZITADEL masterkey, restored database data will be unreadable to ZITADEL.
Backups without the original masterkey are not sufficient.
4. Namespace lifecycle policy
4.1 Principle
Do not delete the identity-internal namespace during normal operations or upgrades. The namespace is the boundary for:
- ZITADEL pods and jobs
- PostgreSQL StatefulSet and PVCs
- Secrets such as
zitadel-secret(containsmasterkey) andzitadel-db-secret
Instead, delete or restart workloads inside the namespace.
4.2 Typical large-change flow
Use this pattern when making significant ZITADEL changes (chart upgrades, config rewrites, and so on):
# 1. Optionally suspend Flux while you operate
flux suspend kustomization identity-internal
# 2. Stop ZITADEL and its init jobs (database stays and PVC is not touched)
kubectl delete deployment zitadel -n identity-internal --ignore-not-found
kubectl delete job zitadel-init zitadel-setup -n identity-internal --ignore-not-found
# 3. Keep PostgreSQL (and its PVC) running, or restart it if needed
kubectl rollout restart statefulset zitadel-db -n identity-internal
# 4. Re-enable Flux and reconcile
flux resume kustomization identity-internal
flux reconcile kustomization identity-internal --with-source
Deleting the identity-internal namespace will still delete PVCs unless your StorageClass uses a reclaim policy that retains PVs.
This runbook assumes a policy of not deleting that namespace in normal workflows.
5. What to back up
- Database: PostgreSQL database
zitadelin thezitadel-dbStatefulSet. - Secrets:
zitadel-secret(containsmasterkey)zitadel-db-secret(database password)
- GitOps configuration:
identity-internalrepo (Helm values, manifests)flux-configrepo (Flux sources, Kustomizations)- Trust manifests (Bundle + Gatekeeper Assigns) that implement wildcard TLS trust
ZITADEL console configuration (SMTP config, LDAP IdP settings, login policies, projects, applications) lives in the ZITADEL database. If you back up and restore the database, these settings come with it.
6. Redis backups (when relevant)
This is only relevant if you run Redis in identity-internal.
6.1 If Redis is only for oauth2-proxy session storage
No backup is required.
- Redis holds short-lived web sessions.
- Losing Redis forces users to log in again.
- The only "must keep" item is the Redis password Secret, and that already lives in Git as SOPS-encrypted YAML if you follow the dashboard runbook.
If your Redis is configured with AOF and a PVC, treat it as disposable state for oauth2-proxy.
Do not restore it as part of DR unless you have a strong reason.
6.2 If you use Redis for anything else
If Redis becomes a dependency for anything other than session storage, decide whether it is:
- cache-only (rebuildable, no backup), or
- system of record (should not be Redis, but if it is, you need a real backup plan).
This runbook does not assume Redis is a system of record.
7. Backup frequency and retention
Suggested starting point:
- Daily full database backup via
pg_dumpto the NAS - Retention: at least 7 days of daily dumps on the NAS
- Optional: a weekly snapshot retained for 4 weeks
Your real requirement is: "how many days back do I need to go back after a bad change and still recover cleanly?" Set retention to cover that window plus a buffer.
8. Backup implementation
8.1 Backup storage PVC
Create a dedicated backup PVC in identity-internal that writes to NAS (through nfs-client or equivalent).
If you use NFS, prefer ReadWriteMany if available.
If your dynamic provisioner only supports ReadWriteOnce, ensure the CronJob and its PVC scheduling will still work.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zitadel-backup
namespace: identity-internal
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-client
resources:
requests:
storage: 50Gi
8.2 Backup CronJob
Create a CronJob to run pg_dump and write into the backup PVC.
This assumes:
- DB host is
zitadel-db(Kubernetes Service name) - DB is named
zitadel - Secret
zitadel-db-secretcontainsPOSTGRES_PASSWORDAdjust these to match your deployment.
apiVersion: batch/v1
kind: CronJob
metadata:
name: zitadel-db-backup
namespace: identity-internal
spec:
schedule: "0 2 * * *" # daily at 02:00
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 7
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: pg-backup
image: postgres:16.4-alpine
imagePullPolicy: IfNotPresent
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: zitadel-db-secret
key: POSTGRES_PASSWORD
command:
- /bin/sh
- -c
- |
set -euo pipefail
BACKUP_DIR=/backups/zitadel
mkdir -p "$BACKUP_DIR"
TS="$(date -u +%Y%m%d-%H%M%S)"
FILE="$BACKUP_DIR/zitadel-$TS.dump"
echo "Starting backup to $FILE"
pg_dump -h zitadel-db -U postgres -d zitadel -Fc -f "$FILE"
echo "Backup complete: $FILE"
# Optional: delete backups older than 7 days
find "$BACKUP_DIR" -type f -name 'zitadel-*.dump' -mtime +7 -print -delete
volumeMounts:
- name: backup-volume
mountPath: /backups
resources:
requests:
cpu: 25m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
capabilities:
drop: ["ALL"]
volumes:
- name: backup-volume
persistentVolumeClaim:
claimName: zitadel-backup
8.3 Manual on-demand backup
kubectl -n identity-internal create job --from=cronjob/zitadel-db-backup zitadel-db-backup-manual-$(date -u +%Y%m%d-%H%M%S)
kubectl -n identity-internal get jobs
kubectl -n identity-internal logs job/zitadel-db-backup-manual-...
9. Secrets backup
9.1 SOPS-encrypted secrets in Git
These are already versioned in Git and are part of your DR story:
k8s/prod/20-secrets-db.enc.yamlk8s/prod/21-secrets-zitadel.enc.yaml
Protect:
- Your Git remotes (backups of the repos)
- The SOPS
ageprivate key used by Flux (sops-ageSecret)
9.2 Offline copy of masterkey
For defence in depth, store the ZITADEL masterkey in a password manager.
cd ~/Projects/identity-internal/k8s/prod
sops -d 21-secrets-zitadel.enc.yaml | sed -n '1,60p'
Treat the masterkey like a root credential.
Anyone with it can decrypt ZITADEL data.
10. Restore procedures
Do not restore into a running ZITADEL instance that is actively serving requests. Scale ZITADEL down first to prevent writes during restore.
10.1 Common pre-checks
kubectl -n identity-internal get pvc zitadel-backup
kubectl -n identity-internal get cronjob zitadel-db-backup
kubectl -n identity-internal get pods
10.2 In-place restore on the same cluster
- Stop ZITADEL writes:
kubectl -n identity-internal scale deployment zitadel --replicas=0
kubectl -n identity-internal get pods
- Start a temporary restore pod (mounts backups PVC):
kubectl -n identity-internal run zitadel-db-restore --image=postgres:16.4-alpine --restart=Never --command -- sh -c "sleep 3600" --overrides='{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {"name": "zitadel-db-restore", "namespace": "identity-internal"},
"spec": {
"containers": [{
"name": "restore",
"image": "postgres:16.4-alpine",
"command": ["sh", "-c", "sleep 3600"],
"env": [{
"name": "PGPASSWORD",
"valueFrom": {
"secretKeyRef": {"name": "zitadel-db-secret", "key": "POSTGRES_PASSWORD"}
}
}],
"volumeMounts": [{"name": "backup-volume", "mountPath": "/backups"}]
}],
"volumes": [{"name": "backup-volume", "persistentVolumeClaim": {"claimName": "zitadel-backup"}}],
"restartPolicy": "Never"
}
}'
- Drop and recreate DB, then restore (inside the pod):
kubectl -n identity-internal exec -it zitadel-db-restore -- sh
ls -1 /backups/zitadel
psql -h zitadel-db -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'zitadel';"
psql -h zitadel-db -U postgres -c "DROP DATABASE IF EXISTS zitadel;"
psql -h zitadel-db -U postgres -c "CREATE DATABASE zitadel;"
pg_restore -h zitadel-db -U postgres -d zitadel -v /backups/zitadel/zitadel-YYYYMMDD-HHMMSS.dump
- Cleanup and restart ZITADEL:
kubectl -n identity-internal delete pod zitadel-db-restore
kubectl -n identity-internal scale deployment zitadel --replicas=2
kubectl -n identity-internal rollout status deployment zitadel
10.3 Restore into a fresh cluster
High-level flow:
- Recreate Flux and
flux-config(including wildcard TLS trust). - Restore
sops-ageSecret influx-systemso Flux can decrypt secrets. - Reconcile
identity-internalto deploy PostgreSQL and secrets. - Scale ZITADEL to zero.
- Restore the DB from NAS (
pg_restore). - Scale ZITADEL up and verify.
11. Testing the backup and restore
Suggested test cycle:
- Run the CronJob and confirm a
.dumpfile is created on the backup PVC. - Restore into a scratch DB (optional) or do a full restore rehearsal in a non-production cluster.
- Record results and any fixes needed in this runbook.
12. Verification checklist
12.1 Backup configuration
-
zitadel-backupPVC exists and is bound to NAS storage. -
zitadel-db-backupCronJob exists and runs successfully. - Recent
.dumpfiles exist under/backups/zitadel. - SOPS
ageprivate key is stored securely outside the cluster. - ZITADEL
masterkeyis stored securely outside the cluster. - Wildcard TLS trust is applied and working (
SSL_CERT_FILEin workloads).
12.2 Restore
- ZITADEL was scaled to zero before restore.
-
pg_restorecompleted without errors. - ZITADEL restarted and is healthy.
- Login works at
https://auth.reids.net.au. - SMTP and LDAP flows behave as expected.
13. Rollback
If backups are failing or storage is full:
- Increase backup PVC size or retention window.
- Temporarily disable the CronJob while you remediate:
kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":true}}'
Re-enable it once fixed:
kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":false}}'