# Backup Restoration Guide Procedures for restoring PDS and knot from S3 backups after data loss. ## Architecture All persistent volumes are backed by JuiceFS (S3-backed FUSE filesystem). Data survives node rescheduling and pod restarts — the old Hetzner Volumes data-loss-on-reschedule bug is eliminated. **What's backed up:** - **PDS**: SQLite databases only (`account.sqlite`, `did_cache.sqlite`, `sequencer.sqlite`). Blob storage is natively on S3 — not on the PVC, not in backups. - **Knot**: SQLite database (`knotserver.db`) + git repositories (`repositories/`). **What's not backed up:** - **Zot registry**: Container images are rebuildable artifacts. No backups. - **PDS blobs**: Stored natively in S3 by the PDS process. Already durable — not part of backup/restore. **Schedule:** PDS at 02:00 UTC, knot at 02:30 UTC. Daily. ## Diagnosis Check if services have lost their data: ```sh # PDS — should return your account, not RepoNotFound curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" # Knot — should return branches, not RepoNotFound curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure" ``` Check volume contents directly: ```sh kubectl exec -n pds deployment/pds -- ls -la /pds/ kubectl exec -n knot deployment/knot -c knot -- ls -la /home/git/data/ ``` ## Inspect S3 Backups Get S3 credentials: ```sh kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d ``` List available snapshots: ```sh kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \ ls :s3:sans-self-net/ \ --s3-provider Other \ --s3-access-key-id "${S3_ACCESS_KEY}" \ --s3-secret-access-key "${S3_SECRET_KEY}" \ --s3-endpoint nbg1.your-objectstorage.com \ --s3-region nbg1 --s3-no-check-bucket ``` DB snapshots are timestamped (e.g. `account-20260225-020012.sqlite`). Pick the newest one from *before* the data loss. ## Restore PDS ### 1. Scale down ```sh kubectl scale deployment -n pds pds --replicas=0 kubectl wait --for=delete pod -n pds -l app=pds --timeout=60s ``` ### 2. Run restore job Replace `TIMESTAMP` with the chosen snapshot timestamp (e.g. `20260225-020012`). Only SQLite databases are restored — blob storage lives natively on S3 and doesn't need restoration. ```yaml # kubectl apply -f - <<'YAML' apiVersion: batch/v1 kind: Job metadata: name: pds-restore namespace: pds spec: backoffLimit: 0 template: spec: restartPolicy: Never securityContext: fsGroup: 1000 runAsUser: 1000 runAsGroup: 1000 containers: - name: restore image: rclone/rclone:1.69 command: ["sh", "-c"] args: - | set -eux S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket" rm -rf /data/* # Replace TIMESTAMP with the chosen snapshot (e.g. 20260225-020012) rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite" /data/account.sqlite ${S3} rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3} rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3} ls -la /data/ echo "PDS restore complete" env: - name: S3_ACCESS_KEY valueFrom: secretKeyRef: { name: pds-s3-credentials, key: access-key } - name: S3_SECRET_KEY valueFrom: secretKeyRef: { name: pds-s3-credentials, key: secret-key } volumeMounts: - { name: data, mountPath: /data } volumes: - name: data persistentVolumeClaim: claimName: pds-data ``` ### 3. Fix sequencer cursor The relay (bsky.network) tracks the last sequence number it consumed. After a restore, the sequencer's autoincrement is behind the relay's cursor, so new events are invisible to the network. Check the relay's cursor from PDS logs after scaling back up: ```sh kubectl logs -n pds deployment/pds --tail=100 | grep subscribeRepos # Look for: "cursor":NNN ``` Then bump the autoincrement past that cursor. Scale down again first: ```yaml # kubectl apply -f - <<'YAML' apiVersion: batch/v1 kind: Job metadata: name: pds-seq-fix namespace: pds spec: backoffLimit: 0 template: spec: restartPolicy: Never securityContext: { fsGroup: 1000, runAsUser: 0 } containers: - name: fix image: keinos/sqlite3:3.47.2 command: ["sh", "-c"] args: - | set -eux # Set to at least relay_cursor + 100 sqlite3 /data/sequencer.sqlite "UPDATE sqlite_sequence SET seq = 1000 WHERE name = 'repo_seq';" sqlite3 /data/sequencer.sqlite "SELECT seq FROM sqlite_sequence WHERE name='repo_seq';" chown 1000:1000 /data/sequencer.sqlite volumeMounts: - { name: data, mountPath: /data } volumes: - name: data persistentVolumeClaim: claimName: pds-data ``` ### 4. Scale up and request crawl ```sh kubectl scale deployment -n pds pds --replicas=1 kubectl wait --for=condition=ready pod -n pds -l app=pds --timeout=120s # Tell the relay to re-subscribe curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \ -H "Content-Type: application/json" \ -d '{"hostname": "sans-self.org"}' ``` ### 5. Verify ```sh # All accounts resolve curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" | jq .handle curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:sg4udwrlnokqtpteaswzcps5" | jq .handle curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:uog7vhnxiskidenntic67g3z" | jq .handle # Test that new posts propagate — create a post via the app and check it appears on bsky.app ``` ## Restore Knot ### 1. Scale down ```sh kubectl scale deployment -n knot knot --replicas=0 kubectl wait --for=delete pod -n knot -l app=knot --timeout=60s ``` ### 2. Run restore job ```yaml # kubectl apply -f - <<'YAML' apiVersion: batch/v1 kind: Job metadata: name: knot-restore namespace: knot spec: backoffLimit: 0 template: spec: restartPolicy: Never securityContext: fsGroup: 1000 runAsUser: 1000 runAsGroup: 1000 containers: - name: restore image: rclone/rclone:1.69 command: ["sh", "-c"] args: - | set -eux S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket" rm -rf /data/* # Replace TIMESTAMP (e.g. 20260224-023011) mkdir -p /data/data rclone copyto ":s3:sans-self-net/knot/db/knotserver-TIMESTAMP.db" /data/data/knotserver.db ${S3} rclone copy ":s3:sans-self-net/knot/repositories" /data/repositories ${S3} ls -la /data/ echo "knot restore complete" env: - name: S3_ACCESS_KEY valueFrom: secretKeyRef: { name: knot-s3-credentials, key: access-key } - name: S3_SECRET_KEY valueFrom: secretKeyRef: { name: knot-s3-credentials, key: secret-key } volumeMounts: - { name: data, mountPath: /data } volumes: - name: data persistentVolumeClaim: claimName: knot-data ``` ### 3. Fix post-receive hooks Restored git repositories may have non-executable `post-receive` hooks (FUSE `default_permissions` prevents root-in-container from chmod on files owned by `git`). Fix as the `git` user: ```sh kubectl exec -n knot deploy/knot -- su -s /bin/sh git -c \ 'find /home/git/repositories -name post-receive -exec chmod +x {} \;' ``` Without this, pushes land in the bare repo but knot never processes them — no feed updates, no diff indexing, no notifications. ### 4. Fix repo ACLs (if needed) The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with `access denied: user not allowed` even though SSH auth succeeds. Copy the DB out, inspect, and patch: ```sh # Copy DB out of the running pod (after scale-up) kubectl cp knot/$(kubectl get pod -n knot -l app=knot -o jsonpath='{.items[0].metadata.name}'):/home/git/data/knotserver.db /tmp/knotserver.db -c knot # Check existing ACLs sqlite3 /tmp/knotserver.db "SELECT * FROM acl;" ``` To add ACL entries for a missing repo, scale down and run: ```yaml # kubectl apply -f - <<'YAML' apiVersion: batch/v1 kind: Job metadata: name: knot-acl-fix namespace: knot spec: backoffLimit: 0 template: spec: restartPolicy: Never securityContext: { fsGroup: 1000, runAsUser: 0 } containers: - name: fix image: keinos/sqlite3:3.47.2 command: ["sh", "-c"] args: - | set -eux DID="did:plc:wydyrngmxbcsqdvhmd7whmye" REPO="${DID}/REPO_NAME" sqlite3 /data/data/knotserver.db " INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:settings','',''); INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:push','',''); INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:owner','',''); INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:invite','',''); INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:delete','',''); INSERT INTO acl VALUES ('p','server:owner','thisserver','${REPO}','repo:delete','',''); " chown 1000:1000 /data/data/knotserver.db volumeMounts: - { name: data, mountPath: /data } volumes: - name: data persistentVolumeClaim: claimName: knot-data ``` ### 5. Scale up and verify ```sh kubectl scale deployment -n knot knot --replicas=1 kubectl wait --for=condition=ready pod -n knot -l app=knot --timeout=120s # Verify repos resolve curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure" # Test push git push --dry-run origin main ``` ## Post-Restore Cleanup Delete completed restore jobs: ```sh kubectl delete job -n pds pds-restore pds-seq-fix 2>/dev/null kubectl delete job -n knot knot-restore knot-acl-fix 2>/dev/null ``` Remove stale SSH host keys (knot regenerates host keys on every pod restart): ```sh ssh-keygen -R knot.sans-self.org ``` ## Known Gotchas - **PDS blobs are not in backups.** They live natively on S3 via the PDS process. If the S3 bucket itself is lost, blobs are gone. The backup only covers SQLite databases. - **Choose the right DB snapshot.** Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records. - **Sequencer cursor mismatch kills federation.** Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore. - **Knot ACLs are per-repo.** The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually. - **Knot post-receive hooks may lose execute permissions.** After restoring from S3, hooks may not be executable due to FUSE `default_permissions`. Must chmod as the `git` user, not root. - **SSH host keys change on pod restart.** Every knot scale-down/up regenerates sshd host keys. Run `ssh-keygen -R` to clear stale entries.