Product Requirements Document (PRD)#
This document outlines the requirements for the Skywatch Capture application. It serves as a reference for developers, designers, and stakeholders to ensure that the product meets the needs of its users.
labels.uri is the URI against which the label is applied. It can take two forms, a reference to a post in the form of an at-uri: at://did:plc:7i7s4avtaolnrgc3ubcoqrq3/app.bsky.feed.post/3lf5u32pxwk2f or a reference to a user in the form of a did: did:plc:piwuaowuiykzaare644i5fre.
labels.val is the label value being emitted.
labels.neg is a boolean indicating whether this label is a negation label, overwriting a previous label.
Core Use Case#
The primary purpose of this application is to subscribe to a Bluesky labeler's firehose, capture all emitted label events, hydrate the associated data (posts and user profiles), and store this comprehensive dataset in a local database. This data is intended for future use in training machine learning classifiers for content moderation.
Functional Requirements#
- Firehose Subscription: Connect to and process a DAG-CBOR encoded firehose from a specified Bluesky labeler service.
- Data Hydration: For each label received, fetch the full context of the labeled content.
- Post Hydration: If the label URI is an
at-uri(post), fetch the fullapp.bsky.feed.postrecord and store the following fields:did,text,facets,embeds,langs,tags,createdAt, and reply status. - Profile Hydration: If the label URI is a
did(user), fetch the fullapp.bsky.actor.profilerecord and store thedisplayNameanddescription. Additionally, resolve and store the user'shandle.
- Post Hydration: If the label URI is an
- Image & Blob Handling:
- An option (
HYDRATE_BLOBS) must be provided to control whether to download image/video blobs. This is a safety feature for users labeling sensitive content. - In all cases, both a SHA-256 (cryptographic) hash and a perceptual hash (pHash) of any referenced image blobs must be captured to ensure compatibility with various moderation toolkits.
- If
HYDRATE_BLOBSis true, the application must support storing the downloaded blobs either on the local filesystem or in an AWS S3 bucket, configurable via environment variables.
- An option (
- Data Storage:
- All captured and hydrated data should be stored in a DuckDB database file.
- The database schema should be structured to link labels to their hydrated content.
- Filtering: The user must be able to optionally provide a comma-separated list of labels to capture (
CAPTURE_LABELS). If provided, any label not in this list will be ignored.
Technical Requirements#
- Language/Runtime: Use TypeScript with Bun.
- Containerization: The application must be containerized using Docker. The DuckDB database file must be stored on a volume outside the container to ensure data persistence. A
docker-compose.ymlfile should be provided to manage services. - Key Libraries:
@atcute/cborand@atcute/carfor parsing the firehose.@atproto/apifor all Bluesky API interactions.pinoandpino-prettyfor logging.dotenvfor environment variable management.
- Portability: The application should be designed to be portable and easily configurable for use by other moderation services or researchers.
- Rate Limits: Be mindful of Bluesky API rate limits during hydration.
Configuration#
The application will be configured via a .env file with the following variables:
# Bluesky Credentials
BSKY_HANDLE=your-bluesky-handle.bsky.social
BSKY_PASSWORD=your-app-password
# Bluesky PDS and Labeler URL
PDS=bsky.social
WSS_URL=wss://your-labeler-service.com/xrpc/com.atproto.label.subscribeLabels
# Blob & Image Handling
HYDRATE_BLOBS=false # Set to true to download images/videos
BLOB_STORAGE_TYPE=local # 'local' or 's3'
BLOB_STORAGE_PATH=./data/blobs # Path for local storage
# S3 Configuration (only required if BLOB_STORAGE_TYPE is 's3')
S3_BUCKET=your-s3-bucket-name
S3_REGION=us-east-1
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
# Database
DB_PATH=./data/skywatch.duckdb
# Filtering (Optional)
# Comma-separated list of labels to capture, e.g., "spam,hate-speech"
CAPTURE_LABELS=
# Logging
LOG_LEVEL=info
Data Schema#
The database will contain the following tables:
labels#
Stores the raw label event data.
id(INTEGER, Primary Key, Auto-incrementing)uri(TEXT) - Theat-uriordidof the labeled content.cid(TEXT) - The CID of the specific record version.val(TEXT) - The label value (e.g., "spam").neg(BOOLEAN) - If the label is a negation.cts(DATETIME) - Timestamp of label creation.exp(DATETIME, nullable) - Expiration timestamp of the label.src(TEXT) - The DID of the labeler.
posts#
Stores hydrated data for labeled posts. Linked to labels.uri.
uri(TEXT, Primary Key)did(TEXT) - Author of the post.text(TEXT)facets(JSON)embeds(JSON)langs(JSON)tags(JSON)createdAt(DATETIME)is_reply(BOOLEAN)
profiles#
Stores hydrated data for labeled user accounts. Linked to labels.uri.
did(TEXT, Primary Key)handle(TEXT)displayName(TEXT)description(TEXT)
blobs#
Stores information about image blobs found in posts.
post_uri(TEXT) - Foreign key toposts.uri.blob_cid(TEXT) - CID of the blob.sha256(TEXT) - Cryptographic hash for exact file matching.phash(TEXT) - Perceptual hash for finding visually similar images.storage_path(TEXT, nullable) - Local or S3 path if downloaded.mimetype(TEXT)- PRIMARY KEY (
post_uri,blob_cid)
Lexicons#
The following bluesky lexicons are necessary for this tool:
com.atproto.label.subscribeLabels#
Skywatch emits a DAG-CBOR encoded firehose of moderation decisions at `wss://ozone.skywatch.blue/xrpc/com.atproto.label.subscribeLabels A label event looks like the following:
"label": {
"type": "object",
"description": "Metadata tag on an atproto resource (eg, repo or record).",
"required": ["src", "uri", "val", "cts"],
"properties": {
"ver": {
"type": "integer",
"description": "The AT Protocol version of the label object."
},
"src": {
"type": "string",
"format": "did",
"description": "DID of the actor who created this label."
},
"uri": {
"type": "string",
"format": "uri",
"description": "AT URI of the record, repository (account), or other resource that this label applies to."
},
"cid": {
"type": "string",
"format": "cid",
"description": "Optionally, CID specifying the specific version of 'uri' resource this label applies to."
},
"val": {
"type": "string",
"maxLength": 128,
"description": "The short string name of the value or type of this label."
},
"neg": {
"type": "boolean",
"description": "If true, this is a negation label, overwriting a previous label."
},
"cts": {
"type": "string",
"format": "datetime",
"description": "Timestamp when this label was created."
},
"exp": {
"type": "string",
"format": "datetime",
"description": "Timestamp at which this label expires (no longer applies)."
},
"sig": {
"type": "bytes",
"description": "Signature of dag-cbor encoded label."
}
}
},
app.bsky.feed.post#
Post are structured as the following:
{
"lexicon": 1,
"id": "app.bsky.feed.post",
"defs": {
"main": {
"type": "record",
"description": "Record containing a Bluesky post.",
"key": "tid",
"record": {
"type": "object",
"required": ["text", "createdAt"],
"properties": {
"text": {
"type": "string",
"maxLength": 3000,
"maxGraphemes": 300,
"description": "The primary post content. May be an empty string, if there are embeds."
},
"entities": {
"type": "array",
"description": "DEPRECATED: replaced by app.bsky.richtext.facet.",
"items": { "type": "ref", "ref": "#entity" }
},
"facets": {
"type": "array",
"description": "Annotations of text (mentions, URLs, hashtags, etc)",
"items": { "type": "ref", "ref": "app.bsky.richtext.facet" }
},
"reply": { "type": "ref", "ref": "#replyRef" },
"embed": {
"type": "union",
"refs": [
"app.bsky.embed.images",
"app.bsky.embed.video",
"app.bsky.embed.external",
"app.bsky.embed.record",
"app.bsky.embed.recordWithMedia"
]
},
"langs": {
"type": "array",
"description": "Indicates human language of post primary text content.",
"maxLength": 3,
"items": { "type": "string", "format": "language" }
},
"labels": {
"type": "union",
"description": "Self-label values for this post. Effectively content warnings.",
"refs": ["com.atproto.label.defs#selfLabels"]
},
"tags": {
"type": "array",
"description": "Additional hashtags, in addition to any included in post text and facets.",
"maxLength": 8,
"items": { "type": "string", "maxLength": 640, "maxGraphemes": 64 }
},
"createdAt": {
"type": "string",
"format": "datetime",
"description": "Client-declared timestamp when this post was originally created."
}
}
}
},
"replyRef": {
"type": "object",
"required": ["root", "parent"],
"properties": {
"root": { "type": "ref", "ref": "com.atproto.repo.strongRef" },
"parent": { "type": "ref", "ref": "com.atproto.repo.strongRef" }
}
},
"entity": {
"type": "object",
"description": "Deprecated: use facets instead.",
"required": ["index", "type", "value"],
"properties": {
"index": { "type": "ref", "ref": "#textSlice" },
"type": {
"type": "string",
"description": "Expected values are 'mention' and 'link'."
},
"value": { "type": "string" }
}
},
"textSlice": {
"type": "object",
"description": "Deprecated. Use app.bsky.richtext instead -- A text segment. Start is inclusive, end is exclusive. Indices are for utf16-encoded strings.",
"required": ["start", "end"],
"properties": {
"start": { "type": "integer", "minimum": 0 },
"end": { "type": "integer", "minimum": 0 }
}
}
}
}
With posts we are interested in the app.bsky.embeds.images lexicon in particular. The blob reference can be used to retriexe the image from the PDS and then saved to local storage or hashed.
{
"lexicon": 1,
"id": "app.bsky.embed.images",
"description": "A set of images embedded in a Bluesky record (eg, a post).",
"defs": {
"main": {
"type": "object",
"required": ["images"],
"properties": {
"images": {
"type": "array",
"items": { "type": "ref", "ref": "#image" },
"maxLength": 4
}
}
},
"image": {
"type": "object",
"required": ["image", "alt"],
"properties": {
"image": {
"type": "blob",
"accept": ["image/*"],
"maxSize": 1000000
},
"alt": {
"type": "string",
"description": "Alt text description of the image, for accessibility."
},
"aspectRatio": {
"type": "ref",
"ref": "app.bsky.embed.defs#aspectRatio"
}
}
},
"view": {
"type": "object",
"required": ["images"],
"properties": {
"images": {
"type": "array",
"items": { "type": "ref", "ref": "#viewImage" },
"maxLength": 4
}
}
},
"viewImage": {
"type": "object",
"required": ["thumb", "fullsize", "alt"],
"properties": {
"thumb": {
"type": "string",
"format": "uri",
"description": "Fully-qualified URL where a thumbnail of the image can be fetched. For example, CDN location provided by the App View."
},
"fullsize": {
"type": "string",
"format": "uri",
"description": "Fully-qualified URL where a large version of the image can be fetched. May or may not be the exact original blob. For example, CDN location provided by the App View."
},
"alt": {
"type": "string",
"description": "Alt text description of the image, for accessibility."
},
"aspectRatio": {
"type": "ref",
"ref": "app.bsky.embed.defs#aspectRatio"
}
}
}
}
}
app.bsky.actor.profile#
{
"lexicon": 1,
"id": "app.bsky.actor.profile",
"defs": {
"main": {
"type": "record",
"description": "A declaration of a Bluesky account profile.",
"key": "literal:self",
"record": {
"type": "object",
"properties": {
"displayName": {
"type": "string",
"maxGraphemes": 64,
"maxLength": 640
},
"description": {
"type": "string",
"description": "Free-form profile description text.",
"maxGraphemes": 256,
"maxLength": 2560
},
"pronouns": {
"type": "string",
"description": "Free-form pronouns text.",
"maxGraphemes": 20,
"maxLength": 200
},
"website": { "type": "string", "format": "uri" },
"avatar": {
"type": "blob",
"description": "Small image to be displayed next to posts from account. AKA, 'profile picture'",
"accept": ["image/png", "image/jpeg"],
"maxSize": 1000000
},
"banner": {
"type": "blob",
"description": "Larger horizontal image to display behind profile view.",
"accept": ["image/png", "image/jpeg"],
"maxSize": 1000000
},
"labels": {
"type": "union",
"description": "Self-label values, specific to the Bluesky application, on the overall account.",
"refs": ["com.atproto.label.defs#selfLabels"]
},
"joinedViaStarterPack": {
"type": "ref",
"ref": "com.atproto.repo.strongRef"
},
"pinnedPost": {
"type": "ref",
"ref": "com.atproto.repo.strongRef"
},
"createdAt": { "type": "string", "format": "datetime" }
}
}
}
}
}