IdP internal backup and restore

info

This runbook describes how to protect and restore the identity-internal ZITADEL instance by treating the namespace as long-lived and backing up the PostgreSQL database to NAS-backed storage.

It assumes ZITADEL is deployed via FluxCD using the official Helm chart, with PostgreSQL running as zitadel-db in the identity-internal namespace, and that wildcard TLS trust is handled cluster-wide via the separate trust runbook (trust-manager + Gatekeeper + SSL_CERT_FILE).

1. Identity provider series

IdP dual overview
IdP dual architecture
IdP internal deployment
IdP internal console
IdP internal SMTP
IdP internal LDAP
IdP internal OIDC
IdP internal OAUTH2 proxy
IdP internal backup and restore - you are here

2. Objectives

Keep the identity-internal namespace as a long-lived environment boundary.
Avoid deleting the namespace during normal changes.
Perform regular PostgreSQL backups to NAS-backed storage.
Document a tested backup and restore procedure for ZITADEL.
Make it easy to practise disaster recovery without guesswork.
Rely on the cluster-wide wildcard TLS trust mechanism instead of per-namespace TLS wiring.

3. High-level approach

Namespace lifecycle: treat identity-internal as durable; delete workloads, not the namespace.
State: ZITADEL is stateless; all important state lives in PostgreSQL and the ZITADEL masterkey.
Backups: use PostgreSQL logical backups (pg_dump) written to NAS via an NFS-backed PVC.
Restore: restore the database into PostgreSQL and reuse the original masterkey secret.
Automation: run backups via a Kubernetes CronJob in the identity-internal namespace.
TLS trust: wildcard TLS trust is provided centrally and is restored as part of the cluster GitOps configuration, not this namespace.

warning

If you lose the ZITADEL masterkey, restored database data will be unreadable to ZITADEL. Backups without the original masterkey are not sufficient.

4. Namespace lifecycle policy

4.1 Principle

Do not delete the identity-internal namespace during normal operations or upgrades. The namespace is the boundary for:

ZITADEL pods and jobs
PostgreSQL StatefulSet and PVCs
Secrets such as zitadel-secret (contains masterkey) and zitadel-db-secret

Instead, delete or restart workloads inside the namespace.

4.2 Typical large-change flow

Use this pattern when making significant ZITADEL changes (chart upgrades, config rewrites, and so on):

# 1. Optionally suspend Flux while you operate
flux suspend kustomization identity-internal

# 2. Stop ZITADEL and its init jobs (database stays and PVC is not touched)
kubectl delete deployment zitadel -n identity-internal --ignore-not-found
kubectl delete job zitadel-init zitadel-setup -n identity-internal --ignore-not-found

# 3. Keep PostgreSQL (and its PVC) running, or restart it if needed
kubectl rollout restart statefulset zitadel-db -n identity-internal

# 4. Re-enable Flux and reconcile
flux resume kustomization identity-internal
flux reconcile kustomization identity-internal --with-source

note

Deleting the identity-internal namespace will still delete PVCs unless your StorageClass uses a reclaim policy that retains PVs. This runbook assumes a policy of not deleting that namespace in normal workflows.

5. What to back up

Database: PostgreSQL database zitadel in the zitadel-db StatefulSet.
Secrets:
- zitadel-secret (contains masterkey)
- zitadel-db-secret (database password)
GitOps configuration:
- identity-internal repo (Helm values, manifests)
- flux-config repo (Flux sources, Kustomizations)
- Trust manifests (Bundle + Gatekeeper Assigns) that implement wildcard TLS trust

info

ZITADEL console configuration (SMTP config, LDAP IdP settings, login policies, projects, applications) lives in the ZITADEL database. If you back up and restore the database, these settings come with it.

6. Redis backups (when relevant)

This is only relevant if you run Redis in identity-internal.

6.1 If Redis is only for `oauth2-proxy` session storage

No backup is required.

Redis holds short-lived web sessions.
Losing Redis forces users to log in again.
The only "must keep" item is the Redis password Secret, and that already lives in Git as SOPS-encrypted YAML if you follow the dashboard runbook.

note

If your Redis is configured with AOF and a PVC, treat it as disposable state for oauth2-proxy. Do not restore it as part of DR unless you have a strong reason.

6.2 If you use Redis for anything else

If Redis becomes a dependency for anything other than session storage, decide whether it is:

cache-only (rebuildable, no backup), or
system of record (should not be Redis, but if it is, you need a real backup plan).

This runbook does not assume Redis is a system of record.

7. Backup frequency and retention

Suggested starting point:

Daily full database backup via pg_dump to the NAS
Retention: at least 7 days of daily dumps on the NAS
Optional: a weekly snapshot retained for 4 weeks

tip

Your real requirement is: "how many days back do I need to go back after a bad change and still recover cleanly?" Set retention to cover that window plus a buffer.

8. Backup implementation

8.1 Backup storage PVC

Create a dedicated backup PVC in identity-internal that writes to NAS (through nfs-client or equivalent).

warning

If you use NFS, prefer ReadWriteMany if available. If your dynamic provisioner only supports ReadWriteOnce, ensure the CronJob and its PVC scheduling will still work.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: zitadel-backup
  namespace: identity-internal
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 50Gi

8.2 Backup CronJob

Create a CronJob to run pg_dump and write into the backup PVC.

note

This assumes:

DB host is zitadel-db (Kubernetes Service name)
DB is named zitadel
Secret zitadel-db-secret contains POSTGRES_PASSWORD Adjust these to match your deployment.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: zitadel-db-backup
  namespace: identity-internal
spec:
  schedule: "0 2 * * *"  # daily at 02:00
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          securityContext:
            seccompProfile:
              type: RuntimeDefault
          containers:
            - name: pg-backup
              image: postgres:16.4-alpine
              imagePullPolicy: IfNotPresent
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: zitadel-db-secret
                      key: POSTGRES_PASSWORD
              command:
                - /bin/sh
                - -c
                - |
                  set -euo pipefail

                  BACKUP_DIR=/backups/zitadel
                  mkdir -p "$BACKUP_DIR"

                  TS="$(date -u +%Y%m%d-%H%M%S)"
                  FILE="$BACKUP_DIR/zitadel-$TS.dump"

                  echo "Starting backup to $FILE"

                  pg_dump                     -h zitadel-db                     -U postgres                     -d zitadel                     -Fc                     -f "$FILE"

                  echo "Backup complete: $FILE"

                  # Optional: delete backups older than 7 days
                  find "$BACKUP_DIR" -type f -name 'zitadel-*.dump' -mtime +7 -print -delete
              volumeMounts:
                - name: backup-volume
                  mountPath: /backups
              resources:
                requests:
                  cpu: 25m
                  memory: 64Mi
                limits:
                  cpu: 200m
                  memory: 256Mi
              securityContext:
                allowPrivilegeEscalation: false
                readOnlyRootFilesystem: true
                runAsNonRoot: true
                capabilities:
                  drop: ["ALL"]
          volumes:
            - name: backup-volume
              persistentVolumeClaim:
                claimName: zitadel-backup

8.3 Manual on-demand backup

kubectl -n identity-internal create job   --from=cronjob/zitadel-db-backup   zitadel-db-backup-manual-$(date -u +%Y%m%d-%H%M%S)

kubectl -n identity-internal get jobs
kubectl -n identity-internal logs job/zitadel-db-backup-manual-...

9. Secrets backup

9.1 SOPS-encrypted secrets in Git

These are already versioned in Git and are part of your DR story:

k8s/prod/20-secrets-db.enc.yaml
k8s/prod/21-secrets-zitadel.enc.yaml

Protect:

Your Git remotes (backups of the repos)
The SOPS age private key used by Flux (sops-age Secret)

9.2 Offline copy of `masterkey`

For defence in depth, store the ZITADEL masterkey in a password manager.

cd ~/Projects/identity-internal/k8s/prod
sops -d 21-secrets-zitadel.enc.yaml | sed -n '1,60p'

warning

Treat the masterkey like a root credential. Anyone with it can decrypt ZITADEL data.

10. Restore procedures

danger

Do not restore into a running ZITADEL instance that is actively serving requests. Scale ZITADEL down first to prevent writes during restore.

10.1 Common pre-checks

kubectl -n identity-internal get pvc zitadel-backup
kubectl -n identity-internal get cronjob zitadel-db-backup
kubectl -n identity-internal get pods

10.2 In-place restore on the same cluster

Stop ZITADEL writes:

kubectl -n identity-internal scale deployment zitadel --replicas=0
kubectl -n identity-internal get pods

Start a temporary restore pod (mounts backups PVC):

kubectl -n identity-internal run zitadel-db-restore   --image=postgres:16.4-alpine   --restart=Never   --command -- sh -c "sleep 3600"   --overrides='{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {"name": "zitadel-db-restore", "namespace": "identity-internal"},
    "spec": {
      "containers": [{
        "name": "restore",
        "image": "postgres:16.4-alpine",
        "command": ["sh", "-c", "sleep 3600"],
        "env": [{
          "name": "PGPASSWORD",
          "valueFrom": {
            "secretKeyRef": {"name": "zitadel-db-secret", "key": "POSTGRES_PASSWORD"}
          }
        }],
        "volumeMounts": [{"name": "backup-volume", "mountPath": "/backups"}]
      }],
      "volumes": [{"name": "backup-volume", "persistentVolumeClaim": {"claimName": "zitadel-backup"}}],
      "restartPolicy": "Never"
    }
  }'

Drop and recreate DB, then restore (inside the pod):

kubectl -n identity-internal exec -it zitadel-db-restore -- sh

ls -1 /backups/zitadel

psql -h zitadel-db -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'zitadel';"
psql -h zitadel-db -U postgres -c "DROP DATABASE IF EXISTS zitadel;"
psql -h zitadel-db -U postgres -c "CREATE DATABASE zitadel;"

pg_restore -h zitadel-db -U postgres -d zitadel -v /backups/zitadel/zitadel-YYYYMMDD-HHMMSS.dump

Cleanup and restart ZITADEL:

kubectl -n identity-internal delete pod zitadel-db-restore
kubectl -n identity-internal scale deployment zitadel --replicas=2
kubectl -n identity-internal rollout status deployment zitadel

10.3 Restore into a fresh cluster

High-level flow:

Recreate Flux and flux-config (including wildcard TLS trust).
Restore sops-age Secret in flux-system so Flux can decrypt secrets.
Reconcile identity-internal to deploy PostgreSQL and secrets.
Scale ZITADEL to zero.
Restore the DB from NAS (pg_restore).
Scale ZITADEL up and verify.

11. Testing the backup and restore

Suggested test cycle:

Run the CronJob and confirm a .dump file is created on the backup PVC.
Restore into a scratch DB (optional) or do a full restore rehearsal in a non-production cluster.
Record results and any fixes needed in this runbook.

12. Verification checklist

12.1 Backup configuration

zitadel-backup PVC exists and is bound to NAS storage.
zitadel-db-backup CronJob exists and runs successfully.
Recent .dump files exist under /backups/zitadel.
SOPS age private key is stored securely outside the cluster.
ZITADEL masterkey is stored securely outside the cluster.
Wildcard TLS trust is applied and working (SSL_CERT_FILE in workloads).

12.2 Restore

ZITADEL was scaled to zero before restore.
pg_restore completed without errors.
ZITADEL restarted and is healthy.
Login works at https://auth.internal.example.com.
SMTP and LDAP flows behave as expected.

13. Rollback

If backups are failing or storage is full:

Increase backup PVC size or retention window.
Temporarily disable the CronJob while you remediate:

kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":true}}'

Re-enable it once fixed:

kubectl -n identity-internal patch cronjob zitadel-db-backup -p '{"spec":{"suspend":false}}'

1. Identity provider series​

2. Objectives​

3. High-level approach​

4. Namespace lifecycle policy​

4.1 Principle​

4.2 Typical large-change flow​

5. What to back up​

6. Redis backups (when relevant)​

6.1 If Redis is only for oauth2-proxy session storage​

6.2 If you use Redis for anything else​

7. Backup frequency and retention​

8. Backup implementation​

8.1 Backup storage PVC​

8.2 Backup CronJob​

8.3 Manual on-demand backup​

9. Secrets backup​

9.1 SOPS-encrypted secrets in Git​

9.2 Offline copy of masterkey​

10. Restore procedures​

10.1 Common pre-checks​

10.2 In-place restore on the same cluster​

10.3 Restore into a fresh cluster​

11. Testing the backup and restore​

12. Verification checklist​

12.1 Backup configuration​

12.2 Restore​

13. Rollback​

1. Identity provider series

2. Objectives

3. High-level approach

4. Namespace lifecycle policy

4.1 Principle

4.2 Typical large-change flow

5. What to back up

6. Redis backups (when relevant)

6.1 If Redis is only for `oauth2-proxy` session storage

6.2 If you use Redis for anything else

7. Backup frequency and retention

8. Backup implementation

8.1 Backup storage PVC

8.2 Backup CronJob

8.3 Manual on-demand backup

9. Secrets backup

9.1 SOPS-encrypted secrets in Git

9.2 Offline copy of `masterkey`

10. Restore procedures

10.1 Common pre-checks

10.2 In-place restore on the same cluster

10.3 Restore into a fresh cluster

11. Testing the backup and restore

12. Verification checklist

12.1 Backup configuration

12.2 Restore

13. Rollback