+50
PDSharp.Docs/car.md
+50
PDSharp.Docs/car.md
···
1
+
# CAR Format Implementation Notes
2
+
3
+
The **Content Addressable aRchives (CAR)** format is used to store content-addressable objects (IPLD blocks) as a sequence of bytes.
4
+
It is the standard format for repository export (`sync.getRepo`) and block transfer (`sync.getBlocks`) in the AT Protocol.
5
+
6
+
## 1. Format Overview (v1)
7
+
8
+
A CAR file consists of a **Header** followed by a sequence of **Data** sections.
9
+
10
+
```text
11
+
|--------- Header --------| |--------------- Data Section 1 ---------------| |--------------- Data Section 2 ---------------| ...
12
+
[ varint | DAG-CBOR block ] [ varint | CID bytes | Block Data bytes ] [ varint | CID bytes | Block Data bytes ] ...
13
+
```
14
+
15
+
### LEB128 Varints
16
+
17
+
All length prefixes in CAR are encoded as **unsigned LEB128 (UVarint)** integers.
18
+
19
+
- Used to prefix the Header block.
20
+
- Used to prefix each Data section.
21
+
22
+
## 2. Header
23
+
24
+
The header is a single DAG-CBOR encoded block describing the archive.
25
+
26
+
**Encoding:**
27
+
28
+
1. Construct the CBOR map: `{ "version": 1, "roots": [<cid>, ...] }`.
29
+
2. Encode as DAG-CBOR bytes.
30
+
3. Prefix with the length of those bytes (as UVarint).
31
+
32
+
## 3. Data Sections
33
+
34
+
Following the header, the file contains a concatenated sequence of data sections. Each section represents one IPLD block.
35
+
36
+
```text
37
+
[ Section Length (UVarint) ] [ CID (raw bytes) ] [ Binary Data ]
38
+
```
39
+
40
+
- **Section Length**: The total length of the *CID bytes* + *Binary Data*.
41
+
- **CID**: The raw binary representation of the block's CID (usually CIDv1 + DAG-CBOR + SHA2-256).
42
+
- **Binary Data**: The actual content of the block.
43
+
44
+
The Section Length *includes* the length of the CID.
45
+
46
+
This is slightly different from some other framing formats where length might only cover the payload.
47
+
48
+
## 4. References
49
+
50
+
- [IPLD CARv1 Specification](https://ipld.io/specs/transport/car/carv1/)
+46
PDSharp.Docs/cbor.md
+46
PDSharp.Docs/cbor.md
···
1
+
# DAG-CBOR Implementation Notes
2
+
3
+
DAG-CBOR is the canonical data serialization format for the AT Protocol.
4
+
It is a strict subset of CBOR (RFC 8949) with specific rules for determinism and linking.
5
+
6
+
## 1. Canonicalization Rules
7
+
8
+
To ensure consistent Content IDs (CIDs) for the same data, specific canonicalization rules must be followed during encoding.
9
+
10
+
### Map Key Sorting
11
+
12
+
Maps must be sorted by keys. The sorting order is **NOT** standard lexicographical order.
13
+
14
+
1. **Length**: Shorter keys come first.
15
+
2. **Bytes**: keys of the same length are sorted lexicographically by their UTF-8 byte representation.
16
+
17
+
**Example:**
18
+
19
+
- `"a"` (len 1) comes before `"aa"` (len 2).
20
+
- `"b"` (len 1) comes before `"aa"` (len 2).
21
+
- `"a"` comes before `"b"`.
22
+
23
+
### Integer Encoding
24
+
25
+
Integers must be encoded using the smallest possible representation.
26
+
27
+
`System.Formats.Cbor` (in Strict mode) generally handles this, but care must be taken to treat `int`, `int64`, and `uint64` consistently.
28
+
29
+
## 2. Content Addressing (CIDs)
30
+
31
+
Links to other nodes (CIDs) are encoded using **CBOR Tag 42**.
32
+
33
+
### Format
34
+
35
+
1. **Tag**: `42` (Major type 6, value 42).
36
+
2. **Payload**: A byte string containing:
37
+
- The `0x00` byte (Multibase identity prefix, required by IPLD specs for binary CID inclusion).
38
+
- The raw bytes of the CID.
39
+
40
+
## 3. Known Gotchas
41
+
42
+
- **Float vs Int**:
43
+
AT Protocol generally discourages floats where integers suffice.
44
+
F# types must be matched carefully to avoid encoding `2.0` instead of `2`.
45
+
- **String Encoding**:
46
+
Must be UTF-8. Indefinite length strings are prohibited in DAG-CBOR.
+69
PDSharp.Docs/mst.md
+69
PDSharp.Docs/mst.md
···
1
+
# Merkle Search Tree (MST) Implementation Notes
2
+
3
+
The Merkle Search Tree (MST) is a probabilistic, balanced search tree used by the AT Protocol to store repository records.
4
+
5
+
## Overview
6
+
7
+
MSTs combine properties of B-trees and Merkle trees to ensure:
8
+
9
+
1. **Determinism**: The tree structure is determined by the keys (and their hashes), not insertion order.
10
+
2. **Verifyability**: Every node is content-addressed (CID), allowing the entire state to be verified via a single root hash.
11
+
3. **Efficiency**: Efficient key-value lookups and delta-based sync (subtrees that haven't changed share the same CIDs).
12
+
13
+
## Core Concepts
14
+
15
+
### Layering (Probabilistic Balance)
16
+
17
+
MSTs do not use rotation for balancing. Instead, they assign every key a "layer" based on its hash.
18
+
19
+
- **Formula**:
20
+
`Layer(key) = countLeadingZeros(SHA256(key)) / 2`.
21
+
- **Fanout**:
22
+
The divisor `2` implies a fanout of roughly 4 (2 bits per layer increment).
23
+
- Keys with higher layers appear higher in the tree, splitting the range of keys below them.
24
+
25
+
### Data Structure (`MstNode`)
26
+
27
+
An MST node consists of:
28
+
29
+
- **Left Child (`l`)**: Use to traverse to keys lexicographically smaller than the first entry in this node.
30
+
- **Entries (`e`)**: A sorted list of entries. Each entry contains:
31
+
- **Prefix Length (`p`)**: Length of the shared prefix with the *previous* key in the node (or the split key).
32
+
- **Key Suffix (`k`)**: The remaining bytes of the key.
33
+
- **Value (`v`)**: The CID of the record data.
34
+
- **Tree (`t`)**: (Optional) CID of the subtree containing keys between this entry and the next.
35
+
36
+
**Serialization**: The node is serialized as a DAG-CBOR array: `[l, [e1, e2, ...]]`.
37
+
38
+
## Algorithms
39
+
40
+
### Insertion (`Put`)
41
+
42
+
Insertion relies on the "Layer" property:
43
+
44
+
1. Calculate `Layer(newKey)`.
45
+
2. Traverse the tree from the root.
46
+
3. **Split Condition**: If `Layer(newKey)` is **greater** than the layer of the current node, the new key belongs *above* this node.
47
+
- The current node is **split** into two children (Left and Right) based on the `newKey`.
48
+
- The `newKey` becomes a new node pointing to these two children.
49
+
4. **Recurse**: If `Layer(newKey)` is **less** than the current node, find the correct child subtree (based on key comparison) and recurse.
50
+
5. **Same Layer**: If `Layer(newKey)` equals the current node's layer:
51
+
- Insert perfectly into the sorted entries list.
52
+
- Any existing child pointer at that position must be split and redistributed if necessary (though spec usually implies layers are unique enough or handled by standard BST insert at that level).
53
+
54
+
### Deletion
55
+
56
+
1. Locate the key.
57
+
2. Remove the entry.
58
+
3. **Merge**:
59
+
Since the key acted as a separator for two subtrees (its "Left" previous child and its "Tree" child), removing it requires merging these two adjacent subtrees into a single valid MST node to preserve the tree's density and structure.
60
+
61
+
### Determinism & Prefix Compression
62
+
63
+
- **Canonical Order**: Keys must always be sorted.
64
+
- **Prefix Compression**:
65
+
Crucial for space saving.
66
+
The prefix length `p` is calculated relative to the *immediately preceding key* in the node.
67
+
- **Issues**:
68
+
Insertion order *should not* matter (commutativity).
69
+
However, implementations must be careful with `Split` and `Merge` operations to ensure exactly the same node boundaries are created regardless of history.
+72
-35
README.md
+72
-35
README.md
···
1
+
<!-- markdownlint-disable MD033 -->
1
2
# PDSharp
2
3
3
-
> A Personal Data Server (PDS) for the AT Protocol, written in F# with Giraffe.
4
+
A Personal Data Server (PDS) for the AT Protocol, written in F# with Giraffe.
4
5
5
6
## Goal
6
7
···
8
9
9
10
## Requirements
10
11
11
-
- .NET 9.0 SDK
12
-
- [Just](https://github.com/casey/just) (optional, for potential future task running)
12
+
.NET 9.0 SDK
13
13
14
14
## Getting Started
15
15
···
34
34
35
35
The server will start at `http://localhost:5000`.
36
36
37
+
## Configuration
38
+
39
+
The application uses `appsettings.json` and supports Environment Variable overrides.
40
+
41
+
| Key | Env Var | Default | Description |
42
+
| ----------- | ------------------- | ----------------------- | ------------------------- |
43
+
| `DidHost` | `PDSHARP_DidHost` | `did:web:localhost` | The DID of the PDS itself |
44
+
| `PublicUrl` | `PDSHARP_PublicUrl` | `http://localhost:5000` | Publicly reachable URL |
45
+
46
+
Example `appsettings.json`:
47
+
48
+
```json
49
+
{
50
+
"PublicUrl": "http://localhost:5000",
51
+
"DidHost": "did:web:localhost"
52
+
}
53
+
```
54
+
37
55
## API Testing
38
56
39
-
### Server Info
57
+
<details>
58
+
<summary>Server Info</summary>
40
59
41
60
```bash
42
61
curl http://localhost:5000/xrpc/com.atproto.server.describeServer
43
62
```
44
63
64
+
</details>
65
+
45
66
### Record Operations
46
67
47
-
**Create a record:**
68
+
<details>
69
+
<summary>Create a record</summary>
48
70
49
71
```bash
50
72
curl -X POST http://localhost:5000/xrpc/com.atproto.repo.createRecord \
···
52
74
-d '{"repo":"did:web:test","collection":"app.bsky.feed.post","record":{"text":"Hello, ATProto!"}}'
53
75
```
54
76
55
-
**Get a record** (use the rkey from createRecord response):
77
+
</details>
78
+
79
+
<details>
80
+
<summary>Get a record</summary>
56
81
57
82
```bash
58
83
curl "http://localhost:5000/xrpc/com.atproto.repo.getRecord?repo=did:web:test&collection=app.bsky.feed.post&rkey=<RKEY>"
59
84
```
60
85
61
-
**Put a record** (upsert with explicit rkey):
86
+
</details>
87
+
88
+
<details>
89
+
<summary>Put a record</summary>
62
90
63
91
```bash
64
92
curl -X POST http://localhost:5000/xrpc/com.atproto.repo.putRecord \
···
66
94
-d '{"repo":"did:web:test","collection":"app.bsky.feed.post","rkey":"my-post","record":{"text":"Updated!"}}'
67
95
```
68
96
97
+
</details>
98
+
69
99
### Sync & CAR Export
70
100
71
-
**Get entire repository as CAR:**
101
+
<details>
102
+
<summary>Get entire repository as CAR</summary>
72
103
73
104
```bash
74
105
curl "http://localhost:5000/xrpc/com.atproto.sync.getRepo?did=did:web:test" -o repo.car
75
106
```
76
107
77
-
**Get specific blocks** (comma-separated CIDs):
108
+
</details>
109
+
110
+
<details>
111
+
<summary>Get specific blocks</summary>
78
112
79
113
```bash
80
114
curl "http://localhost:5000/xrpc/com.atproto.sync.getBlocks?did=did:web:test&cids=<CID1>,<CID2>" -o blocks.car
81
115
```
82
116
83
-
**Get a blob by CID:**
117
+
</details>
118
+
119
+
<details>
120
+
<summary>Get a blob by CID</summary>
84
121
85
122
```bash
86
123
curl "http://localhost:5000/xrpc/com.atproto.sync.getBlob?did=did:web:test&cid=<BLOB_CID>"
87
124
```
88
125
126
+
</details>
127
+
89
128
### Firehose (WebSocket)
90
129
91
130
Subscribe to real-time commit events using [websocat](https://github.com/vi/websocat):
92
131
93
-
```bash
94
-
# Install websocat (macOS)
95
-
brew install websocat
132
+
<details>
133
+
<summary>Open a WebSocket connection</summary>
96
134
97
-
# Connect to firehose
135
+
```bash
98
136
websocat ws://localhost:5000/xrpc/com.atproto.sync.subscribeRepos
99
137
```
100
138
139
+
</details>
140
+
141
+
<br />
101
142
Then create/update records in another terminal to see CBOR-encoded commit events stream in real-time.
102
143
103
-
**With cursor for resumption:**
144
+
<br />
145
+
146
+
<details>
147
+
<summary>Open a WebSocket connection with cursor for resumption</summary>
104
148
105
149
```bash
106
150
websocat "ws://localhost:5000/xrpc/com.atproto.sync.subscribeRepos?cursor=5"
107
151
```
108
152
109
-
## Configuration
110
-
111
-
The application uses `appsettings.json` and supports Environment Variable overrides.
112
-
113
-
| Key | Env Var | Default | Description |
114
-
| ----------- | ------------------- | ----------------------- | ------------------------- |
115
-
| `DidHost` | `PDSHARP_DidHost` | `did:web:localhost` | The DID of the PDS itself |
116
-
| `PublicUrl` | `PDSHARP_PublicUrl` | `http://localhost:5000` | Publicly reachable URL |
117
-
118
-
Example `appsettings.json`:
119
-
120
-
```json
121
-
{
122
-
"PublicUrl": "http://localhost:5000",
123
-
"DidHost": "did:web:localhost"
124
-
}
125
-
```
153
+
</details>
126
154
127
155
## Architecture
128
156
129
-
### App (Giraffe)
157
+
<details>
158
+
<summary>App (Giraffe)</summary>
130
159
131
160
- `XrpcRouter`: `/xrpc/<NSID>` routing
132
161
- `Auth`: Session management (JWTs)
133
162
- `RepoApi`: Write/Read records (`putRecord`, `getRecord`)
134
163
- `ServerApi`: Server meta (`describeServer`)
135
164
136
-
### Core (Pure F#)
165
+
</details>
166
+
167
+
<details>
168
+
<summary>Core (Pure F#)</summary>
137
169
138
170
- `DidResolver`: Identity resolution
139
171
- `RepoEngine`: MST, DAG-CBOR, CIDs, Blocks
140
172
- `Models`: Data types for XRPC/Database
141
173
142
-
### Infra
174
+
</details>
175
+
176
+
<details>
177
+
<summary>Infra</summary>
143
178
144
179
- SQLite/Postgres for persistence
145
180
- S3/Disk for blob storage
181
+
182
+
</details>
+27
-35
roadmap.txt
+27
-35
roadmap.txt
···
59
59
- [x] Conformance testing: diff CIDs/CARs/signatures vs reference PDS
60
60
DoD: Same inputs → same outputs for repo/sync surfaces
61
61
--------------------------------------------------------------------------------
62
-
Milestone J: Persistence + Backups
62
+
Milestone J: Storage Backend Configuration
63
63
--------------------------------------------------------------------------------
64
-
Deliverables:
65
-
- BackupOps module in Core (scheduler unit / cron / scripts, plus Litestream config)
66
-
Backups (SQLite)
67
-
[ ] Set PDS_SQLITE_DISABLE_WAL_AUTO_CHECKPOINT=true (Litestream-friendly)
68
-
[ ] Run a scheduled backup/replication job that:
69
-
- finds recently updated DBs
70
-
- backs up /pds/actors/* and PDS-wide DBs
71
-
- runs on SIGTERM during deploys (avoid missing last writes)
72
-
Backups (Blobs)
73
-
[ ] Configurable Options (app settings):
74
-
(A) Disk blobs: include /pds/blocks in backups
75
-
(B) S3-compatible blobstore: rely on object-store durability
76
-
Guardrails
77
-
[ ] Uptime check: https://<pds>/xrpc/_health
78
-
[ ] Alert if “latest backup” is older than N minutes.
79
-
[ ] Alert on disk pressure for /pds.
64
+
- [ ] Configure SQLite WAL mode (PDS_SQLITE_DISABLE_WAL_AUTO_CHECKPOINT=true)
65
+
- [ ] Implement S3-compatible blobstore adapter (optional via config)
66
+
- [ ] Configure disk-based vs S3-based blob storage selection
67
+
DoD: PDS runs with S3 blobs (if configured) and SQLite passes Litestream checks
68
+
--------------------------------------------------------------------------------
69
+
Milestone K: Backup Automation + Guardrails
70
+
--------------------------------------------------------------------------------
71
+
- [ ] Implement BackupOps module (scheduler/cron logic)
72
+
- [ ] Automated backup jobs:
73
+
- [ ] Databases (Litestream or raw copy) + /pds/actors backup
74
+
- [ ] Local disk blobs (if applicable)
75
+
- [ ] Guardrails & Monitoring:
76
+
- [ ] Uptime check endpoint: /xrpc/_health with JSON status
77
+
- [ ] Alerts: "Latest backup" too old, Disk pressure > 90%
78
+
- [ ] Log retention policies
80
79
DoD:
81
-
- You can restore onto a fresh host and pass the P3 verification checklist.
82
-
- Backups run automatically and are observable (“last successful backup”).
83
-
- Backup set is explicitly documented (DBs + blobs decision).
80
+
- Backups run automatically and report status
81
+
- Health checks indicate system state
82
+
- Restore drill: Restore backups onto a fresh host passes verification
83
+
- Backup set is explicitly documented
84
84
================================================================================
85
85
PHASE 2: DEPLOYMENT (Self-Host)
86
86
================================================================================
87
-
Milestone J: Topology + Domain Planning
87
+
Milestone L: Topology + Domain Planning
88
88
--------------------------------------------------------------------------------
89
89
- Choose PDS hostname (pds.example.com) vs handle domain (example.com)
90
90
- Obtain domain, DNS access, VPS with static IP, reverse proxy
91
91
DoD: Clear plan for PDS location, handle, and DID resolution
92
92
--------------------------------------------------------------------------------
93
-
Milestone K: DNS + TLS + Reverse Proxy
93
+
Milestone M: DNS + TLS + Reverse Proxy
94
94
--------------------------------------------------------------------------------
95
95
- DNS A/AAAA records for PDS hostname
96
96
- TLS certs (ACME) via Caddy
97
97
DoD: https://<pds-hostname> responds with valid cert
98
98
--------------------------------------------------------------------------------
99
-
Milestone L: Deploy PDSharp
99
+
Milestone N: Deploy PDSharp
100
100
--------------------------------------------------------------------------------
101
101
- Deploy built PDS with persistence (SQLite/Postgres + blob storage)
102
102
- Verify /xrpc/com.atproto.server.describeServer
103
103
DoD: describeServer returns capabilities payload
104
104
--------------------------------------------------------------------------------
105
-
Milestone M: Account Creation
105
+
Milestone O: Account Creation
106
106
--------------------------------------------------------------------------------
107
107
- Create account using admin tooling
108
108
- Verify authentication: createSession
109
109
DoD: Obtain session and perform authenticated write
110
110
--------------------------------------------------------------------------------
111
-
Milestone N: Smoke Test Repo + Blobs
111
+
Milestone P: Smoke Test Repo + Blobs
112
112
--------------------------------------------------------------------------------
113
113
- Write record via putRecord
114
114
- Upload blob, verify retrieval via sync.getBlob
115
115
DoD: Posts appear in clients, media loads reliably
116
116
--------------------------------------------------------------------------------
117
-
Milestone O: Account Migration
117
+
Milestone Q: Account Migration
118
118
--------------------------------------------------------------------------------
119
119
- Export/import from bsky.social
120
120
- Update DID service endpoint
121
121
- Verify handle/DID resolution
122
122
DoD: Handle unchanged, DID points to your PDS
123
123
--------------------------------------------------------------------------------
124
-
Milestone P: Reliability
125
-
--------------------------------------------------------------------------------
126
-
- Backups: repo storage + database + blobs
127
-
- Restore drill on fresh instance
128
-
- Monitoring: uptime checks for describeServer + getBlob
129
-
DoD: Restore from backup passes smoke tests
130
-
--------------------------------------------------------------------------------
131
-
Milestone Q: Updates + Security
124
+
Milestone R: Updates + Security
132
125
--------------------------------------------------------------------------------
133
126
- Update cadence with rollback plan
134
127
- Rate limits and access controls at proxy
135
-
- Log retention and disk growth alerts
136
128
DoD: Update smoothly, maintain stable federation
137
129
================================================================================
138
130
QUICK CHECKLIST