docs: audit and update documentation (#547)

* docs: audit and update documentation

- promote connection-pool-exhaustion runbook from sandbox to docs/runbooks/
- update CLAUDE.md files with current module descriptions
- _internal: add background tasks, moderation, jobs
- api: add albums, playlists, exports, moderation, stats
- docs: add runbooks, testing, moderation, lexicons
- refresh docs/README.md with streamlined navigation
- update STATUS.md:
- add Dec 9 docket/concurrent exports section
- update immediate priorities (moderation cleanup, cost dashboard)
- de-emphasize transcoder (moved to backlog)
- add docket to technical architecture

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: update README with current state

- add docket, logfire to tech stack
- add moderation/transcoder services section
- update local dev to include dev-services (redis)
- add useful commands section with actual justfile targets
- update features (playlists, scrobbling, timed comments, support links)
- add data ownership section
- update project structure (_internal services)
- consolidate links + documentation into single section
- add mirrors section (github, tangled)
- remove stale atproto fork reference

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

authored by zzstoatzz.io Claude and committed by GitHub 1e9aa38a 8fdef170

Changed files
+304 -350
backend
src
backend
_internal
api
docs
+67 -39
README.md
··· 11 11 - **framework**: [FastAPI](https://fastapi.tiangolo.com) 12 12 - **database**: [Neon PostgreSQL](https://neon.com) 13 13 - **storage**: [Cloudflare R2](https://developers.cloudflare.com/r2/) 14 + - **background tasks**: [docket](https://github.com/zzstoatzz/docket) (Redis-backed) 14 15 - **hosting**: [Fly.io](https://fly.io) 15 - - **auth**: [atproto OAuth 2.1](https://atproto.com/specs/oauth) ([fork with OAuth implementation](https://github.com/zzstoatzz/atproto)) 16 + - **observability**: [Pydantic Logfire](https://logfire.pydantic.dev) 17 + - **auth**: [atproto OAuth 2.1](https://atproto.com/specs/oauth) 16 18 17 19 ### frontend 18 - - **framework**: [SvelteKit](https://kit.svelte.dev) 20 + - **framework**: [SvelteKit](https://kit.svelte.dev) with Svelte 5 runes 19 21 - **runtime**: [Bun](https://bun.sh) 20 22 - **hosting**: [Cloudflare Pages](https://pages.cloudflare.com) 21 23 - **styling**: vanilla CSS (lowercase aesthetic) 22 24 25 + ### services 26 + - **moderation**: Rust ATProto labeler for copyright/sensitive content 27 + - **transcoder**: Rust audio conversion service (ffmpeg) 28 + 23 29 </details> 24 30 25 31 <details> ··· 27 33 28 34 ### prerequisites 29 35 30 - - [uv](https://docs.astral.sh/uv/) for Python package management 31 - - [bun](https://bun.sh/) for frontend development 32 - - [just](https://github.com/casey/just) for task running (recommended) 36 + - [uv](https://docs.astral.sh/uv/) for Python 37 + - [bun](https://bun.sh/) for frontend 38 + - [just](https://github.com/casey/just) for task running 39 + - [docker](https://www.docker.com/) for dev services (redis) 33 40 34 41 ### quick start 35 42 36 - using [just](https://github.com/casey/just): 43 + ```bash 44 + # install dependencies 45 + uv sync 46 + cd frontend && bun install && cd .. 37 47 38 - ```bash 39 - # install dependencies (uv handles backend venv automatically) 40 - uv sync # For root-level deps, if any, and initializes uv 41 - just frontend install 48 + # start dev services (redis for background tasks) 49 + just dev-services 42 50 43 51 # run backend (hot reloads at http://localhost:8001) 44 52 just backend run 45 53 46 54 # run frontend (hot reloads at http://localhost:5173) 47 55 just frontend dev 56 + ``` 57 + 58 + ### useful commands 48 59 49 - # run transcoder (hot reloads at http://localhost:8082) 50 - just transcoder run 60 + ```bash 61 + # run tests 62 + just backend test 63 + 64 + # run linting 65 + just backend lint 66 + just frontend check 67 + 68 + # database migrations 69 + just backend migrate "migration message" 70 + just backend migrate-up 71 + 72 + # stop dev services 73 + just dev-services-down 51 74 ``` 52 75 53 76 </details> ··· 56 79 <summary>features</summary> 57 80 58 81 ### listening 59 - - audio playback with persistent queue across tabs/devices 60 - - like tracks with counts visible to all listeners 82 + - audio playback with persistent queue across tabs 83 + - like tracks, add to playlists 61 84 - browse artist profiles and discographies 62 - - share tracks and albums with nice link previews 85 + - share tracks, albums, and playlists with link previews 86 + - unified search with Cmd/Ctrl+K 87 + - teal.fm scrobbling 63 88 64 89 ### creating 65 - - well-scoped OAuth authentication via ATProto (bluesky accounts) 66 - - upload tracks with title, artwork, and featured artists 67 - - organize tracks into albums with cover art 68 - - edit metadata and replace artwork anytime 69 - - track play counts and like analytics 70 - - publish ATProto track and like records to your PDS 90 + - OAuth authentication via ATProto (bluesky accounts) 91 + - upload tracks with title, artwork, tags, and featured artists 92 + - organize tracks into albums and playlists 93 + - drag-and-drop reordering 94 + - timed comments with clickable timestamps 95 + - artist support links (ko-fi, patreon, etc.) 96 + 97 + ### data ownership 98 + - tracks, likes, playlists synced to your PDS as ATProto records 99 + - portable identity - your data travels with you 100 + - public by default - any client can read your music records 71 101 72 102 </details> 73 - 74 103 75 104 <details> 76 105 <summary>project structure</summary> 77 106 78 107 ``` 79 108 plyr.fm/ 80 - ├── backend/ # FastAPI app & Python tooling 109 + ├── backend/ # FastAPI app 81 110 │ ├── src/backend/ # application code 82 111 │ │ ├── api/ # public endpoints 83 - │ │ ├── _internal/ # internal services 112 + │ │ ├── _internal/ # services (auth, atproto, background tasks) 84 113 │ │ ├── models/ # database schemas 85 - │ │ └── storage/ # storage adapters 114 + │ │ └── storage/ # R2 adapter 86 115 │ ├── tests/ # pytest suite 87 - │ └── alembic/ # database migrations 116 + │ └── alembic/ # migrations 88 117 ├── frontend/ # SvelteKit app 89 118 │ ├── src/lib/ # components & state 90 119 │ └── src/routes/ # pages 120 + ├── moderation/ # Rust labeler service 91 121 ├── transcoder/ # Rust audio service 92 122 ├── docs/ # documentation 93 123 └── justfile # task runner ··· 99 129 <summary>costs</summary> 100 130 101 131 ~$35-40/month: 102 - - fly.io backend (production): ~$5/month (shared-cpu-1x, 256MB RAM) 103 - - fly.io backend (staging): ~$5/month (shared-cpu-1x, 256MB RAM) 104 - - fly.io transcoder: ~$0-5/month (auto-scales to zero when idle) 105 - - neon postgres: $5/month (starter plan) 106 - - audd audio fingerprinting: ~$10/month (enterprise API for copyright detection) 107 - - cloudflare pages: free (frontend hosting) 108 - - cloudflare r2: ~$0.16/month (6 buckets across dev/staging/prod) 132 + - fly.io backend (prod + staging): ~$10/month 133 + - fly.io transcoder: ~$0-5/month (auto-scales to zero) 134 + - neon postgres: $5/month 135 + - audd audio fingerprinting: ~$10/month 136 + - cloudflare (pages + r2): ~$0.16/month 109 137 110 138 </details> 111 139 112 140 ## links 113 141 114 142 - **production**: https://plyr.fm 143 + - **staging**: https://stg.plyr.fm 115 144 - **API docs**: https://api.plyr.fm/docs 116 145 - **python SDK / MCP server**: [plyrfm](https://github.com/zzstoatzz/plyr-python-client) ([PyPI](https://pypi.org/project/plyrfm/)) 117 - - **repository**: https://github.com/zzstoatzz/plyr.fm 146 + - **documentation**: [docs/README.md](docs/README.md) 147 + - **status**: [STATUS.md](STATUS.md) 118 148 119 - ## documentation 120 - 121 - - [deployment guide](docs/deployment/environments.md) 122 - - [configuration](docs/backend/configuration.md) 123 - - [full documentation](docs/README.md) 149 + ### mirrors 150 + - **github**: https://github.com/zzstoatzz/plyr.fm 151 + - **tangled**: https://tangled.sh/@zzstoatzz.io/plyr.fm
+43 -19
STATUS.md
··· 47 47 48 48 ### December 2025 49 49 50 + #### docket background tasks & concurrent exports (PRs #534-546, Dec 9) 51 + 52 + **docket integration** (PRs #534, #536, #539): 53 + - migrated background tasks from inline asyncio to docket (Redis-backed task queue) 54 + - copyright scanning, media export, ATProto sync, and teal scrobbling now run via docket 55 + - graceful fallback to asyncio for local development without Redis 56 + - parallel test execution with xdist template databases (#540) 57 + 58 + **concurrent export downloads** (PR #545): 59 + - exports now download tracks in parallel (up to 4 concurrent) instead of sequentially 60 + - significantly faster for users with many tracks or large files 61 + - zip creation remains sequential (zipfile constraint) 62 + 63 + **ATProto refactor** (PR #534): 64 + - reorganized ATProto record code into `_internal/atproto/records/` by lexicon namespace 65 + - extracted `client.py` for low-level PDS operations 66 + - cleaner separation between plyr.fm and teal.fm lexicons 67 + 68 + **documentation & observability**: 69 + - AudD API cost tracking dashboard (#546) 70 + - promoted runbooks from sandbox to `docs/runbooks/` 71 + - updated CLAUDE.md files across the codebase 72 + 73 + --- 74 + 50 75 #### artist support links & inline playlist editing (PRs #520-532, Dec 8) 51 76 52 77 **artist support link** (PR #532): ··· 156 181 157 182 ## immediate priorities 158 183 159 - ### high priority features 160 - 1. **audio transcoding pipeline integration** (issue #153) 161 - - ✅ standalone transcoder service deployed at https://plyr-transcoder.fly.dev/ 162 - - ⏳ next: integrate into plyr.fm upload pipeline 163 - 164 184 ### known issues 165 185 - playback auto-start on refresh (#225) 166 - - no AIFF/AIF transcoding support (#153) 167 186 - iOS PWA audio may hang on first play after backgrounding 168 187 169 - ### new features 170 - - issue #146: content-addressable storage (hash-based deduplication) 171 - - issue #155: add track metadata (genres, tags, descriptions) 188 + ### immediate focus 189 + - **moderation cleanup**: consolidate copyright detection, reduce AudD API costs, streamline labeler integration (issues #541-544) 190 + - **public cost dashboard**: interactive page showing platform running costs (fly.io, neon, audd, r2) - transparency + prompts for community support 191 + 192 + ### feature ideas 172 193 - issue #334: add 'share to bluesky' option for tracks 173 194 - issue #373: lyrics field and Genius-style annotations 174 - - issue #393: moderation - represent confirmed takedown state in labeler 195 + 196 + ### backlog 197 + - audio transcoding pipeline integration (#153) - transcoder service deployed, integration deferred 175 198 176 199 ## technical state 177 200 ··· 182 205 - framework: FastAPI with uvicorn 183 206 - database: Neon PostgreSQL (serverless) 184 207 - storage: Cloudflare R2 (S3-compatible) 208 + - background tasks: docket (Redis-backed) 185 209 - hosting: Fly.io (2x shared-cpu VMs) 186 210 - observability: Pydantic Logfire 187 211 - auth: ATProto OAuth 2.1 ··· 214 238 - ✅ unified search with Cmd/Ctrl+K 215 239 - ✅ teal.fm scrobbling 216 240 - ✅ copyright moderation with ATProto labeler 241 + - ✅ docket background tasks (copyright scan, export, atproto sync, scrobble) 242 + - ✅ media export with concurrent downloads 217 243 218 244 **albums** 219 245 - ✅ album CRUD with cover art ··· 309 335 310 336 ## documentation 311 337 312 - - [deployment overview](docs/deployment/overview.md) 313 - - [configuration guide](docs/configuration.md) 314 - - [queue design](docs/queue-design.md) 315 - - [logfire querying](docs/logfire-querying.md) 316 - - [moderation & labeler](docs/moderation/atproto-labeler.md) 317 - - [unified search](docs/frontend/search.md) 318 - - [keyboard shortcuts](docs/frontend/keyboard-shortcuts.md) 319 - - [lexicons overview](docs/lexicons/overview.md) 338 + - [docs/README.md](docs/README.md) - documentation index 339 + - [runbooks](docs/runbooks/) - production incident procedures 340 + - [background tasks](docs/backend/background-tasks.md) - docket task system 341 + - [logfire querying](docs/tools/logfire.md) - observability queries 342 + - [moderation & labeler](docs/moderation/atproto-labeler.md) - copyright, sensitive content 343 + - [lexicons overview](docs/lexicons/overview.md) - ATProto record schemas 320 344 321 345 --- 322 346 323 - this is a living document. last updated 2025-12-08. 347 + this is a living document. last updated 2025-12-09.
+7 -1
backend/src/backend/_internal/CLAUDE.md
··· 8 8 - `records/fm_plyr/`: plyr.fm lexicons (track, like, comment, list, profile) 9 9 - `records/fm_teal/`: teal.fm lexicons (play, status) 10 10 - `sync.py`: high-level sync orchestration (profile, albums, liked list) 11 + - **background**: docket-based background task system 12 + - `background.py`: docket client initialization and configuration 13 + - `background_tasks.py`: task functions (copyright scan, export, atproto sync, teal scrobble) 11 14 - **queue**: fisher-yates shuffle with retry, postgres LISTEN/NOTIFY for cache invalidation 12 15 - **uploads**: streaming chunked uploads to R2/filesystem, duplicate detection via file_id 16 + - **moderation**: copyright scanning via AudD, sensitive image flagging 17 + - **jobs**: job tracking for long-running operations (exports) 13 18 14 19 gotchas: 15 20 - ATProto records organized under `_internal/atproto/records/` by lexicon namespace 16 21 - file_id is sha256 hash truncated to 16 chars 17 - - queue cache is TTL-based (5min), hydration includes duplicate track_ids 22 + - queue cache is TTL-based (5min), hydration includes duplicate track_ids 23 + - background tasks use docket (Redis-backed) with asyncio fallback for local dev
+8 -3
backend/src/backend/api/CLAUDE.md
··· 8 8 - session management in `_internal/auth.py` 9 9 10 10 resources: 11 - - **tracks**: upload, edit, delete, like/unlike, play count tracking 12 - - **artists**: profiles synced from ATProto identities 11 + - **tracks**: upload, edit, delete, like/unlike, play count tracking, timed comments 12 + - **albums**: CRUD with cover art, track ordering, ATProto list records 13 + - **playlists**: CRUD with drag-and-drop reordering, ATProto list records 14 + - **artists**: profiles synced from ATProto identities, support links 13 15 - **audio**: streaming via 307 redirects to R2 CDN 14 16 - **queue**: server-authoritative with optimistic client updates 15 - - **preferences**: user settings (accent color, auto-play) 17 + - **preferences**: user settings (accent color, auto-play, teal scrobbling, sensitive artwork) 18 + - **exports**: media export with SSE progress tracking, concurrent downloads 19 + - **moderation**: sensitive image management, copyright label checking 20 + - **stats**: platform statistics (track count, play count, total duration)
+7 -3
docs/CLAUDE.md
··· 3 3 organized knowledge base - check here before researching. 4 4 5 5 structure: 6 - - **frontend/** - svelte 5 state patterns, ui components 7 - - **backend/** - config system, features, transcoder service 6 + - **frontend/** - svelte 5 state patterns, ui components, navigation 7 + - **backend/** - config system, background tasks, database, transcoder 8 8 - **deployment/** - environments, migrations, fly.io 9 - - **tools/** - logfire queries, neon mcp, pdsx cli 9 + - **tools/** - logfire queries, neon mcp, pdsx cli, plyr.fm mcp 10 10 - **local-development/** - setup guide for new contributors 11 + - **moderation/** - copyright detection, sensitive content, ATProto labeler 12 + - **lexicons/** - ATProto record schemas (track, like, comment, list, profile) 13 + - **runbooks/** - operational procedures for production incidents 14 + - **testing/** - pytest patterns, parallel execution, template databases 11 15 12 16 when you solve a problem or make a design choice, document it here with as much detail as needed
+50 -285
docs/README.md
··· 1 1 # plyr.fm documentation 2 2 3 - this directory contains all documentation for the plyr.fm project. 3 + organized knowledge base for plyr.fm development. 4 4 5 - ## documentation index 5 + ## quick navigation 6 6 7 - ### authentication & security 8 - - **[authentication.md](./authentication.md)** - secure cookie-based authentication, HttpOnly cookies, XSS protection, environment architecture, migration from localStorage 9 - 10 - ### frontend 11 - - **[state-management.md](./frontend/state-management.md)** - global state management with Svelte 5 runes (toast notifications, tracks cache, upload manager, queue management, liked tracks, preferences, localStorage persistence) 12 - - **[toast-notifications.md](./frontend/toast-notifications.md)** - user feedback system for async operations with smooth transitions and auto-dismiss 13 - - **[queue.md](./frontend/queue.md)** - music queue management with server sync 14 - - **[keyboard-shortcuts.md](./frontend/keyboard-shortcuts.md)** - global keyboard shortcuts with context-aware filtering (Q for queue toggle, patterns for adding new shortcuts) 7 + ### operations 8 + - **[runbooks/](./runbooks/)** - production incident procedures 9 + - [connection-pool-exhaustion](./runbooks/connection-pool-exhaustion.md) - 500s, stuck connections 15 10 16 11 ### backend 17 - - **[configuration.md](./backend/configuration.md)** - backend configuration and environment setup 18 - - **[liked-tracks.md](./backend/liked-tracks.md)** - ATProto-backed track likes with error handling and consistency guarantees 19 - - **[streaming-uploads.md](./backend/streaming-uploads.md)** - SSE-based progress tracking for file uploads with fire-and-forget pattern 20 - - **[transcoder.md](./backend/transcoder.md)** - rust-based HTTP service for audio format conversion (ffmpeg integration, authentication, fly.io deployment) 12 + - **[background-tasks.md](./backend/background-tasks.md)** - docket-based task system (copyright scan, export, scrobble) 13 + - **[configuration.md](./backend/configuration.md)** - environment setup and settings 14 + - **[database/](./backend/database/)** - connection pooling, neon-specific patterns 15 + - **[streaming-uploads.md](./backend/streaming-uploads.md)** - SSE progress tracking 16 + - **[transcoder.md](./backend/transcoder.md)** - rust audio conversion service 17 + 18 + ### frontend 19 + - **[state-management.md](./frontend/state-management.md)** - svelte 5 runes patterns 20 + - **[keyboard-shortcuts.md](./frontend/keyboard-shortcuts.md)** - global shortcuts 21 + - **[navigation.md](./frontend/navigation.md)** - SvelteKit routing patterns 22 + - **[search.md](./frontend/search.md)** - unified search with Cmd+K 21 23 22 24 ### deployment 23 - - **[environments.md](./deployment/environments.md)** - staging vs production environments, automated deployment via GitHub Actions, CORS, secrets management 24 - - **[database-migrations.md](./deployment/database-migrations.md)** - automated migration workflow via fly.io release commands, alembic usage, safety procedures 25 + - **[environments.md](./deployment/environments.md)** - staging vs production 26 + - **[database-migrations.md](./deployment/database-migrations.md)** - alembic workflow 25 27 26 28 ### tools 27 - - **[logfire.md](./tools/logfire.md)** - SQL query patterns for Logfire DataFusion database, finding exceptions, analyzing performance bottlenecks 28 - - **[neon.md](./tools/neon.md)** - Neon Postgres database management and best practices 29 - - **[pdsx.md](./tools/pdsx.md)** - ATProto PDS explorer and debugging tools 29 + - **[logfire.md](./tools/logfire.md)** - SQL query patterns for observability 30 + - **[neon.md](./tools/neon.md)** - postgres database management 31 + - **[pdsx.md](./tools/pdsx.md)** - ATProto PDS explorer 32 + 33 + ### atproto 34 + - **[lexicons/](./lexicons/)** - record schemas (track, like, comment, list, profile) 35 + - **[authentication.md](./authentication.md)** - OAuth 2.1 flow 36 + 37 + ### moderation 38 + - **[moderation/](./moderation/)** - copyright detection, sensitive content, labeler 39 + 40 + ### testing 41 + - **[testing/](./testing/)** - pytest patterns, parallel execution 30 42 31 43 ### local development 32 - - **[setup.md](./local-development/setup.md)** - complete local development setup guide 44 + - **[local-development/setup.md](./local-development/setup.md)** - getting started 33 45 34 - ## ATProto integration 46 + ## architecture overview 35 47 36 48 plyr.fm uses a hybrid storage model: 37 - - audio files stored in cloudflare R2 (scalable, CDN-backed) 38 - - metadata stored as ATProto records on user's PDS (decentralized, user-owned) 39 - - local database indexes for fast queries 49 + - **audio files**: cloudflare R2 (CDN-backed, zero egress) 50 + - **metadata**: ATProto records on user's PDS (decentralized, user-owned) 51 + - **indexes**: neon postgres for fast queries 40 52 41 53 key namespaces: 42 - - `fm.plyr.track` - track metadata (title, artist, album, features, image, audio file reference) 43 - - `fm.plyr.like` - user likes on tracks (subject references track URI) 54 + - `fm.plyr.track` - track metadata 55 + - `fm.plyr.like` - user likes 56 + - `fm.plyr.comment` - timed comments 57 + - `fm.plyr.list` - playlists and albums 58 + - `fm.plyr.actor.profile` - artist profiles 44 59 45 60 ## quick start 46 61 47 - ### current state 48 - 49 - plyr.fm is fully functional with: 50 - - ✅ OAuth 2.1 authentication (ATProto) 51 - - ✅ secure cookie-based sessions (HttpOnly, XSS protection) 52 - - ✅ R2 storage for audio files (cloudflare CDN) 53 - - ✅ track upload with streaming (prevents OOM) 54 - - ✅ ATProto record creation (fm.plyr.track namespace) 55 - - ✅ music player with queue management 56 - - ✅ liked tracks (fm.plyr.like namespace) 57 - - ✅ artist pages and track discovery 58 - - ✅ share buttons across track, album, and artist detail pages for quick copy-to-clipboard links 59 - - ✅ modular audio player with dedicated subcomponents for metadata, transport, progress, and volume controls 60 - - ✅ image uploads for track artwork 61 - - ✅ audio transcoding service (rust + ffmpeg) 62 - - ✅ server-sent events for upload progress 63 - - ✅ toast notifications 64 - - ✅ user preferences (accent color, auto-play) 65 - - ✅ keyboard shortcuts (Q for queue toggle) 66 - 67 - ### local development 68 - 69 - see **[local-development/setup.md](./local-development/setup.md)** for complete setup instructions. 70 - 71 - quick start: 72 62 ```bash 73 63 # backend 74 - uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8001 64 + just backend run 75 65 76 66 # frontend 77 - cd frontend && bun run dev 67 + just frontend dev 78 68 79 - # transcoder (optional) 80 - cd transcoder && just run 69 + # run tests 70 + just backend test 81 71 ``` 82 72 83 - ### deployment 84 - 85 - see **[deployment/environments.md](./deployment/environments.md)** for details on: 86 - - staging vs production environments 87 - - automated deployment via GitHub Actions 88 - - environment variables and secrets 89 - 90 - see **[deployment/database-migrations.md](./deployment/database-migrations.md)** for: 91 - - migration workflow and safety procedures 92 - - alembic usage and testing 93 - 94 - ## architecture decisions 95 - 96 - ### why R2 instead of PDS blobs? 97 - 98 - PDS blobs are designed for smaller files like images. audio files are: 99 - - larger (5-50MB per track) 100 - - require streaming 101 - - benefit from CDN distribution 102 - 103 - R2 provides: 104 - - scalable storage 105 - - free egress to cloudflare CDN 106 - - simple HTTP URLs 107 - - cost-effective (~$0.015/GB/month) 108 - 109 - ### why fm.plyr namespace? 110 - 111 - plyr.fm uses `fm.plyr.*` as the ATProto namespace: 112 - - `fm.plyr.track` for track metadata 113 - - `fm.plyr.like` for user likes 114 - 115 - this is a domain-specific lexicon that allows: 116 - - clear ownership and governance 117 - - faster iteration without formal approval 118 - - alignment with the plyr.fm brand 119 - 120 - ### why hybrid storage? 121 - 122 - storing metadata on ATProto provides: 123 - - user data sovereignty (users own their catalog) 124 - - decentralization (no single point of failure) 125 - - portability (users can move to another client) 126 - 127 - storing audio on R2 provides: 128 - - performance (fast streaming via CDN) 129 - - scalability (handles growth) 130 - - cost efficiency (cheaper than PDS blobs) 131 - 132 - ### why separate transcoder service? 133 - 134 - the transcoder runs as a separate rust service because: 135 - - ffmpeg operations are CPU-intensive and can block event loop 136 - - rust provides better performance for media processing 137 - - isolation prevents transcoding from affecting API latency 138 - - can scale independently from main backend 139 - 140 - ## testing 141 - 142 - plyr.fm uses pytest for backend testing: 143 - 144 - ```bash 145 - # run all tests 146 - just test 147 - 148 - # run specific test file 149 - just test tests/api/test_track_likes.py 150 - 151 - # run with verbose output 152 - just test -v 153 - ``` 154 - 155 - test categories: 156 - - API endpoints (`tests/api/`) 157 - - storage backends (`tests/storage/`) 158 - - ATProto integration (`tests/test_atproto.py`) 159 - - audio format validation (`tests/test_audio_formats.py`) 160 - 161 - see [`tests/CLAUDE.md`](../tests/CLAUDE.md) for testing guidelines. 162 - 163 - ## troubleshooting 164 - 165 - ### R2 upload fails 166 - 167 - ``` 168 - error: failed to upload to R2 169 - ``` 170 - 171 - **check**: 172 - - R2 credentials in `.env` 173 - - bucket exists and is accessible 174 - - account ID is correct 175 - 176 - ### ATProto record creation fails 177 - 178 - ``` 179 - error: failed to create atproto record 180 - ``` 181 - 182 - **check**: 183 - - OAuth session is valid (not expired) 184 - - user has write permissions 185 - - PDS is accessible 186 - - record format is valid 187 - 188 - ### audio won't play 189 - 190 - ``` 191 - 404: audio file not found 192 - ``` 193 - 194 - **check**: 195 - - `STORAGE_BACKEND` matches actual storage 196 - - R2 bucket has public read access 197 - - file_id matches database record 198 - 199 - ## monitoring 200 - 201 - ### key metrics to track 202 - 203 - 1. **upload success rate** 204 - - total uploads attempted 205 - - successful R2 uploads 206 - - successful record creations 207 - 208 - 2. **storage costs** 209 - - total R2 storage (GB) 210 - - monthly operations count 211 - - estimated cost 212 - 213 - 3. **playback metrics** 214 - - tracks played 215 - - average stream duration 216 - - errors/failures 217 - 218 - ### logging 219 - 220 - add structured logging for debugging: 221 - 222 - ```python 223 - import structlog 224 - 225 - logger = structlog.get_logger() 226 - 227 - logger.info( 228 - "track_uploaded", 229 - track_id=track.id, 230 - r2_url=r2_url, 231 - atproto_uri=atproto_uri, 232 - ) 233 - ``` 234 - 235 - ## security considerations 236 - 237 - ### audio file access 238 - 239 - **current**: R2 URLs are public (anyone with URL can access) 240 - 241 - **acceptable for MVP** because: 242 - - music is meant to be shared 243 - - no sensitive content 244 - - URL guessing is impractical (content-based hashes) 245 - 246 - **future enhancement**: signed URLs with expiration 247 - 248 - ### record ownership 249 - 250 - **enforced by ATProto**: only user with valid OAuth session can create records in their repo 251 - 252 - **enforced by backend**: tracks are associated with `artist_did` and only owner can delete 253 - 254 - ### rate limiting 255 - 256 - **recommended**: limit uploads to prevent abuse 257 - - 10 uploads per hour per user 258 - - 100MB total per hour per user 259 - 260 - ## cost estimates 261 - 262 - current monthly costs (~$15-20/month): 263 - - fly.io backend: $5-10/month (shared-cpu-1x, 256MB RAM) 264 - - fly.io transcoder: $5-10/month (shared-cpu-1x, 256MB RAM) 265 - - neon postgres: free tier (0.5GB storage, 3GB data transfer) 266 - - cloudflare R2: ~$0.16/month (6 buckets: audio-dev, audio-stg, audio-prod, images-dev, images-stg, images-prod) 267 - - cloudflare pages: free (frontend hosting) 268 - 269 - R2 storage scaling (audio + images): 270 - - 1,000 tracks: ~$0.16/month 271 - - 10,000 tracks: ~$1.58/month 272 - - 100,000 tracks: ~$15.81/month 273 - 274 - ## references 275 - 276 - ### ATProto documentation 277 - 278 - - [repository spec](https://atproto.com/specs/repository) 279 - - [lexicon spec](https://atproto.com/specs/lexicon) 280 - - [data model](https://atproto.com/specs/data-model) 281 - - [OAuth 2.1](https://atproto.com/specs/oauth) 282 - 283 - ### cloudflare documentation 284 - 285 - - [R2 overview](https://developers.cloudflare.com/r2/) 286 - - [R2 pricing](https://developers.cloudflare.com/r2/pricing/) 287 - - [S3 compatibility](https://developers.cloudflare.com/r2/api/s3/) 288 - 289 - ### plyr.fm project files 290 - 291 - - project instructions: `CLAUDE.md` 292 - - main readme: `README.md` 293 - - justfile: `justfile` (task runner) 294 - - backend: `src/backend/` 295 - - frontend: `frontend/` 296 - - transcoder: `transcoder/` 73 + see [local-development/setup.md](./local-development/setup.md) for complete setup. 297 74 298 75 ## contributing 299 76 300 - when working on plyr.fm: 301 - 302 - 1. **test empirically first** - run code and prove it works 303 - 2. **reference existing docs** - check docs directory before researching 304 - 3. **keep it simple** - MVP over perfection 305 - 4. **use lowercase** - respect plyr.fm's aesthetic 306 - 5. **no sprawl** - avoid creating multiple versions of files 307 - 6. **document decisions** - update docs as you work 308 - 309 - ## questions? 310 - 311 - if anything is unclear: 312 - - check the relevant phase document 313 - - review example projects in sandbox 314 - - consult ATProto official docs 315 - - look at your atproto fork implementation 77 + 1. check docs before researching externally 78 + 2. document decisions as you make them 79 + 3. keep it simple - MVP over perfection 80 + 4. use lowercase aesthetic
+29
docs/runbooks/README.md
··· 1 + # runbooks 2 + 3 + operational procedures for production incidents. 4 + 5 + ## available runbooks 6 + 7 + - [connection-pool-exhaustion](connection-pool-exhaustion.md) - 500s everywhere, queue listener down, stuck connections 8 + 9 + ## when to use 10 + 11 + runbooks are for known failure modes with established remediation steps. if you encounter a new type of incident: 12 + 13 + 1. stabilize first (restart machines if needed) 14 + 2. investigate using [logfire](../tools/logfire.md) 15 + 3. document the incident and create a new runbook 16 + 17 + ## general troubleshooting 18 + 19 + ```bash 20 + # check machine status 21 + fly status -a relay-api 22 + 23 + # view recent logs 24 + fly logs -a relay-api 25 + 26 + # restart machines 27 + fly machines list -a relay-api 28 + fly machines restart <machine-id> -a relay-api 29 + ```
+93
docs/runbooks/connection-pool-exhaustion.md
··· 1 + # connection pool exhaustion 2 + 3 + ## symptoms 4 + 5 + - 500 errors across multiple endpoints 6 + - 30-second request timeouts 7 + - logfire shows: `QueuePool limit of size 10 overflow 5 reached, connection timed out` 8 + - queue listener logs: `queue listener connection lost, attempting reconnect` 9 + - database connection errors mentioning multiple Neon IP addresses timing out 10 + 11 + ## observed behavior (2025-12-08 incident) 12 + 13 + evidence from logfire spans: 14 + 15 + | time (UTC) | event | duration | 16 + |------------|-------|----------| 17 + | 06:32:40 | queue service connected | - | 18 + | 06:32:50-06:33:29 | SQLAlchemy connects succeeding | 3-6ms | 19 + | 06:33:36 | queue heartbeat times out | 5s timeout | 20 + | 06:33:36-06:36:04 | ~2.5 min gap with no spans | - | 21 + | 06:36:04 | GET /albums starts | hangs 24 min | 22 + | 06:36:06 | GET /moderation starts | **succeeds in 14ms** | 23 + | 06:36:06 | GET /auth/me starts | hangs 18 min | 24 + | 06:36:31 | multiple requests | **succeed in 3-15ms** | 25 + 26 + key observation: **some connections succeed in 3ms while others hang for 20+ minutes simultaneously**. the stuck connections show psycopg retrying across 12 different Neon IP addresses. 27 + 28 + ## what we know 29 + 30 + 1. the queue listener heartbeat (`SELECT 1`) times out after 5 seconds 31 + 2. psycopg retries connection attempts across multiple IPs when one fails 32 + 3. each IP retry has its own timeout, so total time = timeout × number of IPs 33 + 4. some connections succeed immediately while others get stuck 34 + 5. restarting the fly machines clears the stuck connections 35 + 36 + ## what we don't know 37 + 38 + - why some connections succeed while others fail simultaneously 39 + - whether this is a Neon proxy issue, DNS issue, or application issue 40 + - why psycopg doesn't give up after a reasonable total timeout 41 + 42 + ## remediation 43 + 44 + restart the fly machines to clear stuck connections: 45 + 46 + ```bash 47 + # list machines 48 + fly machines list -a relay-api 49 + 50 + # restart both machines 51 + fly machines restart <machine-id-1> <machine-id-2> -a relay-api 52 + ``` 53 + 54 + ## verification 55 + 56 + check logfire for healthy spans after restart: 57 + 58 + ```sql 59 + SELECT 60 + span_name, 61 + message, 62 + start_timestamp, 63 + duration * 1000 as duration_ms, 64 + otel_status_code 65 + FROM records 66 + WHERE deployment_environment = 'production' 67 + AND start_timestamp > NOW() - INTERVAL '5 minutes' 68 + ORDER BY start_timestamp DESC 69 + LIMIT 30 70 + ``` 71 + 72 + you should see: 73 + - `queue service connected to database and listening` 74 + - database queries completing in <50ms 75 + - no ERROR status codes 76 + 77 + ## incident history 78 + 79 + - **2025-11-17**: first occurrence, queue listener hung indefinitely (fixed by adding timeout) 80 + - **2025-12-02**: cold start variant, 10 errors (fixed by increasing pool size) 81 + - **2025-12-08**: 37 errors in one hour, some connections stuck 20+ min while others worked 82 + 83 + ## future investigation 84 + 85 + - consider adding a total connection timeout that caps retries across all IPs 86 + - investigate whether disabling IPv6 reduces retry time 87 + - add monitoring/alerting for queue listener disconnects 88 + - consider circuit breaker pattern to fail fast when connections are failing 89 + 90 + ## related docs 91 + 92 + - [connection pooling config](../backend/database/connection-pooling.md) 93 + - [logfire querying guide](../tools/logfire.md)