Backup Restoration Guide#
Procedures for restoring PDS and knot from S3 backups after data loss.
Architecture#
All persistent volumes are backed by JuiceFS (S3-backed FUSE filesystem). Data survives node rescheduling and pod restarts — the old Hetzner Volumes data-loss-on-reschedule bug is eliminated.
What's backed up:
- PDS: SQLite databases only (
account.sqlite,did_cache.sqlite,sequencer.sqlite). Blob storage is natively on S3 — not on the PVC, not in backups. - Knot: SQLite database (
knotserver.db) + git repositories (repositories/).
What's not backed up:
- Zot registry: Container images are rebuildable artifacts. No backups.
- PDS blobs: Stored natively in S3 by the PDS process. Already durable — not part of backup/restore.
Schedule: PDS at 02:00 UTC, knot at 02:30 UTC. Daily.
Diagnosis#
Check if services have lost their data:
# PDS — should return your account, not RepoNotFound
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye"
# Knot — should return branches, not RepoNotFound
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"
Check volume contents directly:
kubectl exec -n pds deployment/pds -- ls -la /pds/
kubectl exec -n knot deployment/knot -c knot -- ls -la /home/git/data/
Inspect S3 Backups#
Get S3 credentials:
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d
List available snapshots:
kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \
ls :s3:sans-self-net/ \
--s3-provider Other \
--s3-access-key-id "${S3_ACCESS_KEY}" \
--s3-secret-access-key "${S3_SECRET_KEY}" \
--s3-endpoint nbg1.your-objectstorage.com \
--s3-region nbg1 --s3-no-check-bucket
DB snapshots are timestamped (e.g. account-20260225-020012.sqlite). Pick the newest one from before the data loss.
Restore PDS#
1. Scale down#
kubectl scale deployment -n pds pds --replicas=0
kubectl wait --for=delete pod -n pds -l app=pds --timeout=60s
2. Run restore job#
Replace TIMESTAMP with the chosen snapshot timestamp (e.g. 20260225-020012).
Only SQLite databases are restored — blob storage lives natively on S3 and doesn't need restoration.
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
name: pds-restore
namespace: pds
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
containers:
- name: restore
image: rclone/rclone:1.69
command: ["sh", "-c"]
args:
- |
set -eux
S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"
rm -rf /data/*
# Replace TIMESTAMP with the chosen snapshot (e.g. 20260225-020012)
rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite" /data/account.sqlite ${S3}
rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3}
rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3}
ls -la /data/
echo "PDS restore complete"
env:
- name: S3_ACCESS_KEY
valueFrom:
secretKeyRef: { name: pds-s3-credentials, key: access-key }
- name: S3_SECRET_KEY
valueFrom:
secretKeyRef: { name: pds-s3-credentials, key: secret-key }
volumeMounts:
- { name: data, mountPath: /data }
volumes:
- name: data
persistentVolumeClaim:
claimName: pds-data
3. Fix sequencer cursor#
The relay (bsky.network) tracks the last sequence number it consumed. After a restore, the sequencer's autoincrement is behind the relay's cursor, so new events are invisible to the network.
Check the relay's cursor from PDS logs after scaling back up:
kubectl logs -n pds deployment/pds --tail=100 | grep subscribeRepos
# Look for: "cursor":NNN
Then bump the autoincrement past that cursor. Scale down again first:
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
name: pds-seq-fix
namespace: pds
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext: { fsGroup: 1000, runAsUser: 0 }
containers:
- name: fix
image: keinos/sqlite3:3.47.2
command: ["sh", "-c"]
args:
- |
set -eux
# Set to at least relay_cursor + 100
sqlite3 /data/sequencer.sqlite "UPDATE sqlite_sequence SET seq = 1000 WHERE name = 'repo_seq';"
sqlite3 /data/sequencer.sqlite "SELECT seq FROM sqlite_sequence WHERE name='repo_seq';"
chown 1000:1000 /data/sequencer.sqlite
volumeMounts:
- { name: data, mountPath: /data }
volumes:
- name: data
persistentVolumeClaim:
claimName: pds-data
4. Scale up and request crawl#
kubectl scale deployment -n pds pds --replicas=1
kubectl wait --for=condition=ready pod -n pds -l app=pds --timeout=120s
# Tell the relay to re-subscribe
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d '{"hostname": "sans-self.org"}'
5. Verify#
# All accounts resolve
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" | jq .handle
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:sg4udwrlnokqtpteaswzcps5" | jq .handle
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:uog7vhnxiskidenntic67g3z" | jq .handle
# Test that new posts propagate — create a post via the app and check it appears on bsky.app
Restore Knot#
1. Scale down#
kubectl scale deployment -n knot knot --replicas=0
kubectl wait --for=delete pod -n knot -l app=knot --timeout=60s
2. Run restore job#
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
name: knot-restore
namespace: knot
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
containers:
- name: restore
image: rclone/rclone:1.69
command: ["sh", "-c"]
args:
- |
set -eux
S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"
rm -rf /data/*
# Replace TIMESTAMP (e.g. 20260224-023011)
mkdir -p /data/data
rclone copyto ":s3:sans-self-net/knot/db/knotserver-TIMESTAMP.db" /data/data/knotserver.db ${S3}
rclone copy ":s3:sans-self-net/knot/repositories" /data/repositories ${S3}
ls -la /data/
echo "knot restore complete"
env:
- name: S3_ACCESS_KEY
valueFrom:
secretKeyRef: { name: knot-s3-credentials, key: access-key }
- name: S3_SECRET_KEY
valueFrom:
secretKeyRef: { name: knot-s3-credentials, key: secret-key }
volumeMounts:
- { name: data, mountPath: /data }
volumes:
- name: data
persistentVolumeClaim:
claimName: knot-data
3. Fix post-receive hooks#
Restored git repositories may have non-executable post-receive hooks (FUSE default_permissions prevents root-in-container from chmod on files owned by git). Fix as the git user:
kubectl exec -n knot deploy/knot -- su -s /bin/sh git -c \
'find /home/git/repositories -name post-receive -exec chmod +x {} \;'
Without this, pushes land in the bare repo but knot never processes them — no feed updates, no diff indexing, no notifications.
4. Fix repo ACLs (if needed)#
The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with access denied: user not allowed even though SSH auth succeeds.
Copy the DB out, inspect, and patch:
# Copy DB out of the running pod (after scale-up)
kubectl cp knot/$(kubectl get pod -n knot -l app=knot -o jsonpath='{.items[0].metadata.name}'):/home/git/data/knotserver.db /tmp/knotserver.db -c knot
# Check existing ACLs
sqlite3 /tmp/knotserver.db "SELECT * FROM acl;"
To add ACL entries for a missing repo, scale down and run:
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
name: knot-acl-fix
namespace: knot
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext: { fsGroup: 1000, runAsUser: 0 }
containers:
- name: fix
image: keinos/sqlite3:3.47.2
command: ["sh", "-c"]
args:
- |
set -eux
DID="did:plc:wydyrngmxbcsqdvhmd7whmye"
REPO="${DID}/REPO_NAME"
sqlite3 /data/data/knotserver.db "
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:settings','','');
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:push','','');
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:owner','','');
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:invite','','');
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:delete','','');
INSERT INTO acl VALUES ('p','server:owner','thisserver','${REPO}','repo:delete','','');
"
chown 1000:1000 /data/data/knotserver.db
volumeMounts:
- { name: data, mountPath: /data }
volumes:
- name: data
persistentVolumeClaim:
claimName: knot-data
5. Scale up and verify#
kubectl scale deployment -n knot knot --replicas=1
kubectl wait --for=condition=ready pod -n knot -l app=knot --timeout=120s
# Verify repos resolve
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"
# Test push
git push --dry-run origin main
Post-Restore Cleanup#
Delete completed restore jobs:
kubectl delete job -n pds pds-restore pds-seq-fix 2>/dev/null
kubectl delete job -n knot knot-restore knot-acl-fix 2>/dev/null
Remove stale SSH host keys (knot regenerates host keys on every pod restart):
ssh-keygen -R knot.sans-self.org
Known Gotchas#
- PDS blobs are not in backups. They live natively on S3 via the PDS process. If the S3 bucket itself is lost, blobs are gone. The backup only covers SQLite databases.
- Choose the right DB snapshot. Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records.
- Sequencer cursor mismatch kills federation. Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore.
- Knot ACLs are per-repo. The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually.
- Knot post-receive hooks may lose execute permissions. After restoring from S3, hooks may not be executable due to FUSE
default_permissions. Must chmod as thegituser, not root. - SSH host keys change on pod restart. Every knot scale-down/up regenerates sshd host keys. Run
ssh-keygen -Rto clear stale entries.