+422
PRD.md
+422
PRD.md
···
1
+
# Product Requirements Document (PRD)
2
+
3
+
This document outlines the requirements for the Skywatch Capture application. It serves as a reference for developers, designers, and stakeholders to ensure that the product meets the needs of its users.
4
+
5
+
`labels.uri` is the URI against which the label is applied. It can take two forms, a reference to a post in the form of an at-uri: `at://did:plc:7i7s4avtaolnrgc3ubcoqrq3/app.bsky.feed.post/3lf5u32pxwk2f` or a reference to a user in the form of a did: `did:plc:piwuaowuiykzaare644i5fre`.
6
+
7
+
`labels.val` is the label value being emitted.
8
+
`labels.neg` is a boolean indicating whether this label is a negation label, overwriting a previous label.
9
+
10
+
## Core Use Case
11
+
12
+
The primary purpose of this application is to subscribe to a Bluesky labeler's firehose, capture all emitted label events, hydrate the associated data (posts and user profiles), and store this comprehensive dataset in a local database. This data is intended for future use in training machine learning classifiers for content moderation.
13
+
14
+
## Functional Requirements
15
+
16
+
- **Firehose Subscription:** Connect to and process a DAG-CBOR encoded firehose from a specified Bluesky labeler service.
17
+
- **Data Hydration:** For each label received, fetch the full context of the labeled content.
18
+
- **Post Hydration:** If the label URI is an `at-uri` (post), fetch the full `app.bsky.feed.post` record and store the following fields: `did`, `text`, `facets`, `embeds`, `langs`, `tags`, `createdAt`, and reply status.
19
+
- **Profile Hydration:** If the label URI is a `did` (user), fetch the full `app.bsky.actor.profile` record and store the `displayName` and `description`. Additionally, resolve and store the user's `handle`.
20
+
- **Image & Blob Handling:**
21
+
- An option (`HYDRATE_BLOBS`) must be provided to control whether to download image/video blobs. This is a safety feature for users labeling sensitive content.
22
+
- In all cases, both a **SHA-256 (cryptographic) hash** and a **perceptual hash (pHash)** of any referenced image blobs must be captured to ensure compatibility with various moderation toolkits.
23
+
- If `HYDRATE_BLOBS` is true, the application must support storing the downloaded blobs either on the local filesystem or in an AWS S3 bucket, configurable via environment variables.
24
+
- **Data Storage:**
25
+
- All captured and hydrated data should be stored in a DuckDB database file.
26
+
- The database schema should be structured to link labels to their hydrated content.
27
+
- **Filtering:** The user must be able to optionally provide a comma-separated list of labels to capture (`CAPTURE_LABELS`). If provided, any label not in this list will be ignored.
28
+
29
+
## Technical Requirements
30
+
31
+
- **Language/Runtime:** Use TypeScript with Bun.
32
+
- **Containerization:** The application must be containerized using Docker. The DuckDB database file must be stored on a volume outside the container to ensure data persistence. A `docker-compose.yml` file should be provided to manage services.
33
+
- **Key Libraries:**
34
+
- `@atcute/cbor` and `@atcute/car` for parsing the firehose.
35
+
- `@atproto/api` for all Bluesky API interactions.
36
+
- `pino` and `pino-pretty` for logging.
37
+
- `dotenv` for environment variable management.
38
+
- **Portability:** The application should be designed to be portable and easily configurable for use by other moderation services or researchers.
39
+
- **Rate Limits:** Be mindful of Bluesky API rate limits during hydration.
40
+
41
+
## Configuration
42
+
43
+
The application will be configured via a `.env` file with the following variables:
44
+
45
+
```env
46
+
# Bluesky Credentials
47
+
BSKY_HANDLE=your-bluesky-handle.bsky.social
48
+
BSKY_PASSWORD=your-app-password
49
+
50
+
# Bluesky PDS and Labeler URL
51
+
PDS=bsky.social
52
+
WSS_URL=wss://your-labeler-service.com/xrpc/com.atproto.label.subscribeLabels
53
+
54
+
# Blob & Image Handling
55
+
HYDRATE_BLOBS=false # Set to true to download images/videos
56
+
BLOB_STORAGE_TYPE=local # 'local' or 's3'
57
+
BLOB_STORAGE_PATH=./data/blobs # Path for local storage
58
+
59
+
# S3 Configuration (only required if BLOB_STORAGE_TYPE is 's3')
60
+
S3_BUCKET=your-s3-bucket-name
61
+
S3_REGION=us-east-1
62
+
AWS_ACCESS_KEY_ID=your-aws-access-key
63
+
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
64
+
65
+
# Database
66
+
DB_PATH=./data/skywatch.duckdb
67
+
68
+
# Filtering (Optional)
69
+
# Comma-separated list of labels to capture, e.g., "spam,hate-speech"
70
+
CAPTURE_LABELS=
71
+
72
+
# Logging
73
+
LOG_LEVEL=info
74
+
```
75
+
76
+
## Data Schema
77
+
78
+
The database will contain the following tables:
79
+
80
+
#### `labels`
81
+
Stores the raw label event data.
82
+
- `id` (INTEGER, Primary Key, Auto-incrementing)
83
+
- `uri` (TEXT) - The `at-uri` or `did` of the labeled content.
84
+
- `cid` (TEXT) - The CID of the specific record version.
85
+
- `val` (TEXT) - The label value (e.g., "spam").
86
+
- `neg` (BOOLEAN) - If the label is a negation.
87
+
- `cts` (DATETIME) - Timestamp of label creation.
88
+
- `exp` (DATETIME, nullable) - Expiration timestamp of the label.
89
+
- `src` (TEXT) - The DID of the labeler.
90
+
91
+
#### `posts`
92
+
Stores hydrated data for labeled posts. Linked to `labels.uri`.
93
+
- `uri` (TEXT, Primary Key)
94
+
- `did` (TEXT) - Author of the post.
95
+
- `text` (TEXT)
96
+
- `facets` (JSON)
97
+
- `embeds` (JSON)
98
+
- `langs` (JSON)
99
+
- `tags` (JSON)
100
+
- `createdAt` (DATETIME)
101
+
- `is_reply` (BOOLEAN)
102
+
103
+
#### `profiles`
104
+
Stores hydrated data for labeled user accounts. Linked to `labels.uri`.
105
+
- `did` (TEXT, Primary Key)
106
+
- `handle` (TEXT)
107
+
- `displayName` (TEXT)
108
+
- `description` (TEXT)
109
+
110
+
#### `blobs`
111
+
Stores information about image blobs found in posts.
112
+
- `post_uri` (TEXT) - Foreign key to `posts.uri`.
113
+
- `blob_cid` (TEXT) - CID of the blob.
114
+
- `sha256` (TEXT) - Cryptographic hash for exact file matching.
115
+
- `phash` (TEXT) - Perceptual hash for finding visually similar images.
116
+
- `storage_path` (TEXT, nullable) - Local or S3 path if downloaded.
117
+
- `mimetype` (TEXT)
118
+
- PRIMARY KEY (`post_uri`, `blob_cid`)
119
+
120
+
121
+
## Lexicons
122
+
The following bluesky lexicons are necessary for this tool:
123
+
124
+
### `com.atproto.label.subscribeLabels`
125
+
Skywatch emits a DAG-CBOR encoded firehose of moderation decisions at `wss://ozone.skywatch.blue/xrpc/com.atproto.label.subscribeLabels
126
+
A label event looks like the following:
127
+
128
+
```json
129
+
"label": {
130
+
"type": "object",
131
+
"description": "Metadata tag on an atproto resource (eg, repo or record).",
132
+
"required": ["src", "uri", "val", "cts"],
133
+
"properties": {
134
+
"ver": {
135
+
"type": "integer",
136
+
"description": "The AT Protocol version of the label object."
137
+
},
138
+
"src": {
139
+
"type": "string",
140
+
"format": "did",
141
+
"description": "DID of the actor who created this label."
142
+
},
143
+
"uri": {
144
+
"type": "string",
145
+
"format": "uri",
146
+
"description": "AT URI of the record, repository (account), or other resource that this label applies to."
147
+
},
148
+
"cid": {
149
+
"type": "string",
150
+
"format": "cid",
151
+
"description": "Optionally, CID specifying the specific version of 'uri' resource this label applies to."
152
+
},
153
+
"val": {
154
+
"type": "string",
155
+
"maxLength": 128,
156
+
"description": "The short string name of the value or type of this label."
157
+
},
158
+
"neg": {
159
+
"type": "boolean",
160
+
"description": "If true, this is a negation label, overwriting a previous label."
161
+
},
162
+
"cts": {
163
+
"type": "string",
164
+
"format": "datetime",
165
+
"description": "Timestamp when this label was created."
166
+
},
167
+
"exp": {
168
+
"type": "string",
169
+
"format": "datetime",
170
+
"description": "Timestamp at which this label expires (no longer applies)."
171
+
},
172
+
"sig": {
173
+
"type": "bytes",
174
+
"description": "Signature of dag-cbor encoded label."
175
+
}
176
+
}
177
+
},
178
+
```
179
+
180
+
### `app.bsky.feed.post`
181
+
Post are structured as the following:
182
+
183
+
```json
184
+
{
185
+
"lexicon": 1,
186
+
"id": "app.bsky.feed.post",
187
+
"defs": {
188
+
"main": {
189
+
"type": "record",
190
+
"description": "Record containing a Bluesky post.",
191
+
"key": "tid",
192
+
"record": {
193
+
"type": "object",
194
+
"required": ["text", "createdAt"],
195
+
"properties": {
196
+
"text": {
197
+
"type": "string",
198
+
"maxLength": 3000,
199
+
"maxGraphemes": 300,
200
+
"description": "The primary post content. May be an empty string, if there are embeds."
201
+
},
202
+
"entities": {
203
+
"type": "array",
204
+
"description": "DEPRECATED: replaced by app.bsky.richtext.facet.",
205
+
"items": { "type": "ref", "ref": "#entity" }
206
+
},
207
+
"facets": {
208
+
"type": "array",
209
+
"description": "Annotations of text (mentions, URLs, hashtags, etc)",
210
+
"items": { "type": "ref", "ref": "app.bsky.richtext.facet" }
211
+
},
212
+
"reply": { "type": "ref", "ref": "#replyRef" },
213
+
"embed": {
214
+
"type": "union",
215
+
"refs": [
216
+
"app.bsky.embed.images",
217
+
"app.bsky.embed.video",
218
+
"app.bsky.embed.external",
219
+
"app.bsky.embed.record",
220
+
"app.bsky.embed.recordWithMedia"
221
+
]
222
+
},
223
+
"langs": {
224
+
"type": "array",
225
+
"description": "Indicates human language of post primary text content.",
226
+
"maxLength": 3,
227
+
"items": { "type": "string", "format": "language" }
228
+
},
229
+
"labels": {
230
+
"type": "union",
231
+
"description": "Self-label values for this post. Effectively content warnings.",
232
+
"refs": ["com.atproto.label.defs#selfLabels"]
233
+
},
234
+
"tags": {
235
+
"type": "array",
236
+
"description": "Additional hashtags, in addition to any included in post text and facets.",
237
+
"maxLength": 8,
238
+
"items": { "type": "string", "maxLength": 640, "maxGraphemes": 64 }
239
+
},
240
+
"createdAt": {
241
+
"type": "string",
242
+
"format": "datetime",
243
+
"description": "Client-declared timestamp when this post was originally created."
244
+
}
245
+
}
246
+
}
247
+
},
248
+
"replyRef": {
249
+
"type": "object",
250
+
"required": ["root", "parent"],
251
+
"properties": {
252
+
"root": { "type": "ref", "ref": "com.atproto.repo.strongRef" },
253
+
"parent": { "type": "ref", "ref": "com.atproto.repo.strongRef" }
254
+
}
255
+
},
256
+
"entity": {
257
+
"type": "object",
258
+
"description": "Deprecated: use facets instead.",
259
+
"required": ["index", "type", "value"],
260
+
"properties": {
261
+
"index": { "type": "ref", "ref": "#textSlice" },
262
+
"type": {
263
+
"type": "string",
264
+
"description": "Expected values are 'mention' and 'link'."
265
+
},
266
+
"value": { "type": "string" }
267
+
}
268
+
},
269
+
"textSlice": {
270
+
"type": "object",
271
+
"description": "Deprecated. Use app.bsky.richtext instead -- A text segment. Start is inclusive, end is exclusive. Indices are for utf16-encoded strings.",
272
+
"required": ["start", "end"],
273
+
"properties": {
274
+
"start": { "type": "integer", "minimum": 0 },
275
+
"end": { "type": "integer", "minimum": 0 }
276
+
}
277
+
}
278
+
}
279
+
}
280
+
```
281
+
282
+
With posts we are interested in the `app.bsky.embeds.images` lexicon in particular. The blob reference can be used to retriexe the image from the PDS and then saved to local storage or hashed.
283
+
284
+
```json
285
+
{
286
+
"lexicon": 1,
287
+
"id": "app.bsky.embed.images",
288
+
"description": "A set of images embedded in a Bluesky record (eg, a post).",
289
+
"defs": {
290
+
"main": {
291
+
"type": "object",
292
+
"required": ["images"],
293
+
"properties": {
294
+
"images": {
295
+
"type": "array",
296
+
"items": { "type": "ref", "ref": "#image" },
297
+
"maxLength": 4
298
+
}
299
+
}
300
+
},
301
+
"image": {
302
+
"type": "object",
303
+
"required": ["image", "alt"],
304
+
"properties": {
305
+
"image": {
306
+
"type": "blob",
307
+
"accept": ["image/*"],
308
+
"maxSize": 1000000
309
+
},
310
+
"alt": {
311
+
"type": "string",
312
+
"description": "Alt text description of the image, for accessibility."
313
+
},
314
+
"aspectRatio": {
315
+
"type": "ref",
316
+
"ref": "app.bsky.embed.defs#aspectRatio"
317
+
}
318
+
}
319
+
},
320
+
"view": {
321
+
"type": "object",
322
+
"required": ["images"],
323
+
"properties": {
324
+
"images": {
325
+
"type": "array",
326
+
"items": { "type": "ref", "ref": "#viewImage" },
327
+
"maxLength": 4
328
+
}
329
+
}
330
+
},
331
+
"viewImage": {
332
+
"type": "object",
333
+
"required": ["thumb", "fullsize", "alt"],
334
+
"properties": {
335
+
"thumb": {
336
+
"type": "string",
337
+
"format": "uri",
338
+
"description": "Fully-qualified URL where a thumbnail of the image can be fetched. For example, CDN location provided by the App View."
339
+
},
340
+
"fullsize": {
341
+
"type": "string",
342
+
"format": "uri",
343
+
"description": "Fully-qualified URL where a large version of the image can be fetched. May or may not be the exact original blob. For example, CDN location provided by the App View."
344
+
},
345
+
"alt": {
346
+
"type": "string",
347
+
"description": "Alt text description of the image, for accessibility."
348
+
},
349
+
"aspectRatio": {
350
+
"type": "ref",
351
+
"ref": "app.bsky.embed.defs#aspectRatio"
352
+
}
353
+
}
354
+
}
355
+
}
356
+
}
357
+
```
358
+
359
+
### `app.bsky.actor.profile`
360
+
361
+
```json
362
+
{
363
+
"lexicon": 1,
364
+
"id": "app.bsky.actor.profile",
365
+
"defs": {
366
+
"main": {
367
+
"type": "record",
368
+
"description": "A declaration of a Bluesky account profile.",
369
+
"key": "literal:self",
370
+
"record": {
371
+
"type": "object",
372
+
"properties": {
373
+
"displayName": {
374
+
"type": "string",
375
+
"maxGraphemes": 64,
376
+
"maxLength": 640
377
+
},
378
+
"description": {
379
+
"type": "string",
380
+
"description": "Free-form profile description text.",
381
+
"maxGraphemes": 256,
382
+
"maxLength": 2560
383
+
},
384
+
"pronouns": {
385
+
"type": "string",
386
+
"description": "Free-form pronouns text.",
387
+
"maxGraphemes": 20,
388
+
"maxLength": 200
389
+
},
390
+
"website": { "type": "string", "format": "uri" },
391
+
"avatar": {
392
+
"type": "blob",
393
+
"description": "Small image to be displayed next to posts from account. AKA, 'profile picture'",
394
+
"accept": ["image/png", "image/jpeg"],
395
+
"maxSize": 1000000
396
+
},
397
+
"banner": {
398
+
"type": "blob",
399
+
"description": "Larger horizontal image to display behind profile view.",
400
+
"accept": ["image/png", "image/jpeg"],
401
+
"maxSize": 1000000
402
+
},
403
+
"labels": {
404
+
"type": "union",
405
+
"description": "Self-label values, specific to the Bluesky application, on the overall account.",
406
+
"refs": ["com.atproto.label.defs#selfLabels"]
407
+
},
408
+
"joinedViaStarterPack": {
409
+
"type": "ref",
410
+
"ref": "com.atproto.repo.strongRef"
411
+
},
412
+
"pinnedPost": {
413
+
"type": "ref",
414
+
"ref": "com.atproto.repo.strongRef"
415
+
},
416
+
"createdAt": { "type": "string", "format": "datetime" }
417
+
}
418
+
}
419
+
}
420
+
}
421
+
}
422
+
```