# Backup Restoration Guide

Procedures for restoring PDS and knot from S3 backups after data loss.

## Architecture

All persistent volumes are backed by JuiceFS (S3-backed FUSE filesystem). Data survives node rescheduling and pod restarts — the old Hetzner Volumes data-loss-on-reschedule bug is eliminated.

**What's backed up:**
- **PDS**: SQLite databases only (`account.sqlite`, `did_cache.sqlite`, `sequencer.sqlite`). Blob storage is natively on S3 — not on the PVC, not in backups.
- **Knot**: SQLite database (`knotserver.db`) + git repositories (`repositories/`).

**What's not backed up:**
- **Zot registry**: Container images are rebuildable artifacts. No backups.
- **PDS blobs**: Stored natively in S3 by the PDS process. Already durable — not part of backup/restore.

**Schedule:** PDS at 02:00 UTC, knot at 02:30 UTC. Daily.

## Diagnosis

Check if services have lost their data:

```sh
# PDS — should return your account, not RepoNotFound
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye"

# Knot — should return branches, not RepoNotFound
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"
```

Check volume contents directly:

```sh
kubectl exec -n pds deployment/pds -- ls -la /pds/
kubectl exec -n knot deployment/knot -c knot -- ls -la /home/git/data/
```

## Inspect S3 Backups

Get S3 credentials:

```sh
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d
```

List available snapshots:

```sh
kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \
  ls :s3:sans-self-net/ \
  --s3-provider Other \
  --s3-access-key-id "${S3_ACCESS_KEY}" \
  --s3-secret-access-key "${S3_SECRET_KEY}" \
  --s3-endpoint nbg1.your-objectstorage.com \
  --s3-region nbg1 --s3-no-check-bucket
```

DB snapshots are timestamped (e.g. `account-20260225-020012.sqlite`). Pick the newest one from *before* the data loss.

## Restore PDS

### 1. Scale down

```sh
kubectl scale deployment -n pds pds --replicas=0
kubectl wait --for=delete pod -n pds -l app=pds --timeout=60s
```

### 2. Run restore job

Replace `TIMESTAMP` with the chosen snapshot timestamp (e.g. `20260225-020012`).

Only SQLite databases are restored — blob storage lives natively on S3 and doesn't need restoration.

```yaml
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
  name: pds-restore
  namespace: pds
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
      containers:
        - name: restore
          image: rclone/rclone:1.69
          command: ["sh", "-c"]
          args:
            - |
              set -eux
              S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"

              rm -rf /data/*

              # Replace TIMESTAMP with the chosen snapshot (e.g. 20260225-020012)
              rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite"   /data/account.sqlite   ${S3}
              rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3}
              rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3}

              ls -la /data/
              echo "PDS restore complete"
          env:
            - name: S3_ACCESS_KEY
              valueFrom:
                secretKeyRef: { name: pds-s3-credentials, key: access-key }
            - name: S3_SECRET_KEY
              valueFrom:
                secretKeyRef: { name: pds-s3-credentials, key: secret-key }
          volumeMounts:
            - { name: data, mountPath: /data }
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: pds-data
```

### 3. Fix sequencer cursor

The relay (bsky.network) tracks the last sequence number it consumed. After a restore, the sequencer's autoincrement is behind the relay's cursor, so new events are invisible to the network.

Check the relay's cursor from PDS logs after scaling back up:

```sh
kubectl logs -n pds deployment/pds --tail=100 | grep subscribeRepos
# Look for: "cursor":NNN
```

Then bump the autoincrement past that cursor. Scale down again first:

```yaml
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
  name: pds-seq-fix
  namespace: pds
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      securityContext: { fsGroup: 1000, runAsUser: 0 }
      containers:
        - name: fix
          image: keinos/sqlite3:3.47.2
          command: ["sh", "-c"]
          args:
            - |
              set -eux
              # Set to at least relay_cursor + 100
              sqlite3 /data/sequencer.sqlite "UPDATE sqlite_sequence SET seq = 1000 WHERE name = 'repo_seq';"
              sqlite3 /data/sequencer.sqlite "SELECT seq FROM sqlite_sequence WHERE name='repo_seq';"
              chown 1000:1000 /data/sequencer.sqlite
          volumeMounts:
            - { name: data, mountPath: /data }
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: pds-data
```

### 4. Scale up and request crawl

```sh
kubectl scale deployment -n pds pds --replicas=1
kubectl wait --for=condition=ready pod -n pds -l app=pds --timeout=120s

# Tell the relay to re-subscribe
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
  -H "Content-Type: application/json" \
  -d '{"hostname": "sans-self.org"}'
```

### 5. Verify

```sh
# All accounts resolve
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" | jq .handle
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:sg4udwrlnokqtpteaswzcps5" | jq .handle
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:uog7vhnxiskidenntic67g3z" | jq .handle

# Test that new posts propagate — create a post via the app and check it appears on bsky.app
```

## Restore Knot

### 1. Scale down

```sh
kubectl scale deployment -n knot knot --replicas=0
kubectl wait --for=delete pod -n knot -l app=knot --timeout=60s
```

### 2. Run restore job

```yaml
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
  name: knot-restore
  namespace: knot
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
      containers:
        - name: restore
          image: rclone/rclone:1.69
          command: ["sh", "-c"]
          args:
            - |
              set -eux
              S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"

              rm -rf /data/*

              # Replace TIMESTAMP (e.g. 20260224-023011)
              mkdir -p /data/data
              rclone copyto ":s3:sans-self-net/knot/db/knotserver-TIMESTAMP.db" /data/data/knotserver.db ${S3}
              rclone copy ":s3:sans-self-net/knot/repositories" /data/repositories ${S3}

              ls -la /data/
              echo "knot restore complete"
          env:
            - name: S3_ACCESS_KEY
              valueFrom:
                secretKeyRef: { name: knot-s3-credentials, key: access-key }
            - name: S3_SECRET_KEY
              valueFrom:
                secretKeyRef: { name: knot-s3-credentials, key: secret-key }
          volumeMounts:
            - { name: data, mountPath: /data }
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: knot-data
```

### 3. Fix post-receive hooks

Restored git repositories may have non-executable `post-receive` hooks (FUSE `default_permissions` prevents root-in-container from chmod on files owned by `git`). Fix as the `git` user:

```sh
kubectl exec -n knot deploy/knot -- su -s /bin/sh git -c \
  'find /home/git/repositories -name post-receive -exec chmod +x {} \;'
```

Without this, pushes land in the bare repo but knot never processes them — no feed updates, no diff indexing, no notifications.

### 4. Fix repo ACLs (if needed)

The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with `access denied: user not allowed` even though SSH auth succeeds.

Copy the DB out, inspect, and patch:

```sh
# Copy DB out of the running pod (after scale-up)
kubectl cp knot/$(kubectl get pod -n knot -l app=knot -o jsonpath='{.items[0].metadata.name}'):/home/git/data/knotserver.db /tmp/knotserver.db -c knot

# Check existing ACLs
sqlite3 /tmp/knotserver.db "SELECT * FROM acl;"
```

To add ACL entries for a missing repo, scale down and run:

```yaml
# kubectl apply -f - <<'YAML'
apiVersion: batch/v1
kind: Job
metadata:
  name: knot-acl-fix
  namespace: knot
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      securityContext: { fsGroup: 1000, runAsUser: 0 }
      containers:
        - name: fix
          image: keinos/sqlite3:3.47.2
          command: ["sh", "-c"]
          args:
            - |
              set -eux
              DID="did:plc:wydyrngmxbcsqdvhmd7whmye"
              REPO="${DID}/REPO_NAME"
              sqlite3 /data/data/knotserver.db "
                INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:settings','','');
                INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:push','','');
                INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:owner','','');
                INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:invite','','');
                INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:delete','','');
                INSERT INTO acl VALUES ('p','server:owner','thisserver','${REPO}','repo:delete','','');
              "
              chown 1000:1000 /data/data/knotserver.db
          volumeMounts:
            - { name: data, mountPath: /data }
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: knot-data
```

### 5. Scale up and verify

```sh
kubectl scale deployment -n knot knot --replicas=1
kubectl wait --for=condition=ready pod -n knot -l app=knot --timeout=120s

# Verify repos resolve
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"

# Test push
git push --dry-run origin main
```

## Post-Restore Cleanup

Delete completed restore jobs:

```sh
kubectl delete job -n pds pds-restore pds-seq-fix 2>/dev/null
kubectl delete job -n knot knot-restore knot-acl-fix 2>/dev/null
```

Remove stale SSH host keys (knot regenerates host keys on every pod restart):

```sh
ssh-keygen -R knot.sans-self.org
```

## Known Gotchas

- **PDS blobs are not in backups.** They live natively on S3 via the PDS process. If the S3 bucket itself is lost, blobs are gone. The backup only covers SQLite databases.
- **Choose the right DB snapshot.** Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records.
- **Sequencer cursor mismatch kills federation.** Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore.
- **Knot ACLs are per-repo.** The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually.
- **Knot post-receive hooks may lose execute permissions.** After restoring from S3, hooks may not be executable due to FUSE `default_permissions`. Must chmod as the `git` user, not root.
- **SSH host keys change on pod restart.** Every knot scale-down/up regenerates sshd host keys. Run `ssh-keygen -R` to clear stale entries.