docs/deployment/database-migrations.md at main · zzstoatzz.io/plyr.fm

zzstoatzz.io / plyr.fm
music on atproto plyr.fm
plyr.fm / docs / deployment / database-migrations.md
at main 21 kB view raw view rendered
  1# database migrations
  2
  3## current state (automated ✓)
  4
  5plyr.fm uses **automated database migrations** via fly.io's `release_command`.
  6
  7### how it works
  8
  9```
 101. developer creates migration locally
 11   ↓
 122. commit and push to main
 13   ↓
 143. github actions triggers deployment
 15   ↓
 164. fly.io builds docker image
 17   ↓
 185. fly.io runs release_command BEFORE deploying
 19   - temporary machine spins up
 20   - runs: uv run alembic upgrade head
 21   - if succeeds → proceed to deployment
 22   - if fails → abort, keep old version running
 23   ↓
 246. new app version deploys with updated schema
 25```
 26
 27### configuration
 28
 29**fly.toml:**
 30```toml
 31[deploy]
 32  release_command = "uv run alembic upgrade head"
 33```
 34
 35**benefits:**
 36- zero manual intervention required
 37- migrations run before new code serves traffic (no inconsistent state)
 38- automatic rollback if migration fails
 39- clear deployment logs showing migration output
 40
 41### how dev/prod database separation works
 42
 43plyr.fm uses **environment-based database configuration** to ensure migrations always target the correct database.
 44
 45**the key mechanism: `DATABASE_URL` environment variable**
 46
 47```
 48┌─────────────────────────────────────────────────────────────┐
 49│ alembic/env.py (migration runtime)                          │
 50│                                                              │
 51│ 1. imports backend.config.settings                          │
 52│ 2. reads settings.database.url                              │
 53│ 3. sets alembic connection string                           │
 54│                                                              │
 55│ config.set_main_option("sqlalchemy.url", settings.database.url)
 56└─────────────────────────────────────────────────────────────┘
 57                              │
 58                              ▼
 59┌─────────────────────────────────────────────────────────────┐
 60│ src/backend/config.py (pydantic-settings)                   │
 61│                                                              │
 62│ class Settings(BaseSettings):                               │
 63│     database_url: str = Field(                              │
 64│         default="postgresql+asyncpg://localhost/plyr"       │
 65│     )                                                        │
 66│                                                              │
 67│ reads from DATABASE_URL environment variable                │
 68└─────────────────────────────────────────────────────────────┘
 69                              │
 70                              ▼
 71              ┌───────────────┴───────────────┐
 72              │                               │
 73              ▼                               ▼
 74┌──────────────────────────┐    ┌──────────────────────────┐
 75│ local development        │    │ production (fly.io)      │
 76│                          │    │                          │
 77│ .env file:               │    │ fly secrets:             │
 78│ DATABASE_URL=            │    │ DATABASE_URL=            │
 79│   postgresql+asyncpg://  │    │   postgresql://          │
 80│   [neon dev connection]  │    │   [neon prod connection] │
 81│                          │    │                          │
 82│ when you run:            │    │ when fly.io runs:        │
 83│ uv run alembic upgrade   │    │ release_command:         │
 84│                          │    │ uv run alembic upgrade   │
 85│ → migrates DEV db        │    │ → migrates PROD db       │
 86└──────────────────────────┘    └──────────────────────────┘
 87```
 88
 89**why this is safe:**
 90
 911. **no shared configuration**: local and production environments have completely separate `DATABASE_URL` values
 922. **environment-specific secrets**: production database URL is stored in fly.io secrets, never in code
 933. **explicit context**: you cannot accidentally run a production migration locally because your local `DATABASE_URL` points to the Neon dev database
 944. **explicit context**: fly.io cannot run migrations against your dev database because it only knows about the production `DATABASE_URL`
 95
 96**concrete example:**
 97
 98```bash
 99# local development
100$ cat .env
101DATABASE_URL=postgresql+asyncpg://neon_user:***@ep-muddy-flower-98795112.us-east-2.aws.neon.tech/plyr-dev
102
103$ uv run alembic upgrade head
104# connects to neon dev database (plyr-dev)
105# migrates your development database
106
107# production (inside fly.io release machine)
108$ echo $DATABASE_URL
109postgresql://neon_user:***@ep-cold-butterfly-11920742.us-east-1.aws.neon.tech/plyr-prd
110
111$ uv run alembic upgrade head
112# connects to neon production database (plyr-prd)
113# migrates your production database
114```
115
116**migration flow for each environment:**
117
118local development:
119```
1201. developer edits model in src/backend/models/
1212. runs: uv run alembic revision --autogenerate -m "description"
1223. alembic reads DATABASE_URL from .env (neon dev)
1234. generates migration by comparing:
124   - current model state (code)
125   - current database state (neon dev database)
1265. runs: uv run alembic upgrade head
1276. migration applies to dev database
128```
129
130production deployment:
131```
1321. developer commits migration file to git
1332. pushes to main branch
1343. github actions triggers deployment
1354. fly.io builds docker image (includes migration files)
1365. fly.io starts temporary release machine
1376. release machine has DATABASE_URL from fly secrets (production neon)
1387. release machine runs: uv run alembic upgrade head
1398. alembic reads DATABASE_URL (production neon)
1409. migration applies to production database
14110. if successful, deployment proceeds
14211. if failed, deployment aborts (old version keeps running)
143```
144
145**test database (separate again):**
146
147tests use a third database entirely:
148
149```python
150# tests/conftest.py
151def test_database_url(worker_id: str) -> str:
152    return "postgresql+asyncpg://plyr_test:plyr_test@localhost:5433/plyr_test"
153```
154
155this ensures:
156- tests never touch dev or prod databases
157- tests can run in parallel (separate databases per worker)
158- test data is isolated and can be cleared between tests
159
160**the complete picture:**
161
162```
163┌──────────────────────────────────────────────────────────────┐
164│ four separate databases (three neon instances + local test): │
165│                                                               │
166│ 1. dev (neon: plyr-dev / muddy-flower-98795112)              │
167│    - for local development                                   │
168│    - set via .env: DATABASE_URL=postgresql+asyncpg://...     │
169│    - migrations run manually: uv run alembic upgrade head    │
170│                                                               │
171│ 2. staging (neon: plyr-stg / frosty-math-37367092)           │
172│    - for staging environment                                 │
173│    - set via fly secrets on relay-api-staging                │
174│    - migrations run automatically via release_command        │
175│                                                               │
176│ 3. prod (neon: plyr-prd / cold-butterfly-11920742)           │
177│    - for production traffic                                  │
178│    - set via fly secrets: DATABASE_URL=postgresql://...      │
179│    - migrations run automatically via release_command        │
180│                                                               │
181│ 4. test (localhost:5433/plyr_test)                           │
182│    - for automated tests only                                │
183│    - set via conftest.py fixture                             │
184│    - schema created from models directly (no migrations)     │
185│                                                               │
186│ these databases never interact or share configuration        │
187└──────────────────────────────────────────────────────────────┘
188```
189
190### recent pain points (2025-11-02)
191
192when deploying timezone support migration `31e69ba0c570`:
193
1941. **dockerfile didn't include migration files** - had to create PR #14 to add `COPY alembic.ini` and `COPY alembic ./alembic`
1952. **alembic version tracking out of sync** - production database had `user_preferences` table but alembic thought version was older, causing "relation already exists" errors
1963. **manual stamp needed** - had to run `flyctl ssh console -a relay-api -C "uv run alembic stamp 9e8c7aa5b945"` to fix version tracking
1974. **manual migration execution** - had to run `flyctl ssh console -a relay-api -C "uv run alembic upgrade head"` after deployment
1985. **blocked deployment** - couldn't deploy until all manual steps completed
199
200this took ~30 minutes of manual intervention for what should be automatic.
201
202## how reference project N does it (the right way)
203
204reference project N has a sophisticated automated migration system:
205
206### architecture overview
207
208```
209merge to main
210    ↓
211detect changed migration files (paths-filter)
212    ↓
213build docker image with migrations included
214    ↓
215run migrations via google cloud run jobs
216    ├─ separate job per database (auth, background, events)
217    ├─ jobs execute before app deployment
218    ├─ jobs block until migration completes
219    └─ deployment fails if migration fails
220    ↓
221update deployment manifests
222    ↓
223kubernetes/flux picks up new image
224    ↓
225app starts serving traffic with new schema
226```
227
228### key components
229
230**1. migration detection**
231```yaml
232# .github/paths-filter-pg-migrations.yml
233service_a:
234  - 'src/project/service_a/migrations/**'
235service_b:
236  - 'src/project/service_b/migrations/**'
237  - 'src/project/utilities/database.py'
238```
239
240uses `dorny/paths-filter@v3` to detect which migrations changed in a PR.
241
242**2. cloud run migration jobs**
243
244separate jobs for running migrations:
245- `service-a-db-migration-dev`
246- `service-a-db-migration-stg`
247- `service-b-events-migration-dev`
248- etc.
249
250jobs are updated with new image tag and executed with `--wait` flag.
251
252**3. python migration utilities**
253
254```python
255# from reference project N's database.py
256def alembic_upgrade(database: str, revision: str = "head"):
257    """Run alembic upgrades on database"""
258    import alembic.command
259    alembic.command.upgrade(alembic_config(database), revision)
260```
261
262clean python API for running migrations programmatically.
263
264**4. cli commands**
265
266```bash
267# reference project N provides CLI for manual operations
268project-cli database upgrade
269project-cli database downgrade -r <revision>
270project-cli database reset
271```
272
273**5. safety mechanisms**
274
275- migrations run in dedicated jobs (not in app containers)
276- `ALEMBIC_LOCK` prevents concurrent execution
277- global roles updated after migrations complete
278- separate environments (dev, stg, prod) test migrations before production
279- deployment fails if migration fails
280
281### why this works
282
283**isolation**: migrations run in separate containers, not in the app
284- prevents race conditions between multiple app instances
285- allows long-running migrations without blocking app startup
286- easier to debug when migrations fail
287
288**ordering**: migrations complete before app deployment
289- ensures schema is ready when app starts
290- no "app running on old schema" bugs
291- rollback is automatic (deployment doesn't proceed if migration fails)
292
293**observability**: cloud run jobs provide logs and status
294- easy to see what migration ran and when
295- clear failure messages
296- can re-run jobs manually if needed
297
298**automation**: zero manual intervention
299- no SSH required
300- no human error
301- consistent across environments
302
303## implementation history
304
305### what we tried
306
307**initial approach (broken)**: bash wrapper script
308```toml
309[deploy]
310  release_command = './scripts/migrate.sh'
311```
312
313**problem**: timed out during VM startup, likely due to script overhead or environment setup issues.
314
315**solution**: direct command execution
316```toml
317[deploy]
318  release_command = "uv run alembic upgrade head"
319```
320
321**result**: works perfectly. migrations complete in ~3 seconds, no timeouts.
322
323**key lesson**: fly.io's `release_command` works reliably when you give it a direct command instead of wrapping it in a shell script. the wrapper script was the problem, not VM resources or timeouts.
324
325### current implementation (working ✓)
326
327**fly.toml:**
328```toml
329[deploy]
330  release_command = "uv run alembic upgrade head"
331```
332
333**github actions (.github/workflows/deploy-backend.yml):**
334```yaml
335- name: detect changes
336  uses: dorny/paths-filter@v3
337  id: changes
338  with:
339    filters: .github/path-filters.yml
340
341- name: deploy to fly.io
342  run: |
343    if [ "${{ steps.changes.outputs.migrations }}" == "true" ]; then
344      echo "🔄 migrations detected - will run via release_command before deployment"
345    fi
346    flyctl deploy --remote-only
347```
348
349**path filters (.github/path-filters.yml):**
350```yaml
351migrations:
352  - "alembic/versions/**"
353  - "alembic/env.py"
354  - "alembic.ini"
355```
356
357this setup:
3581. detects when migrations change
3592. logs that migrations will run (for visibility)
3603. deploys normally
3614. fly.io automatically runs `release_command` before deployment
3625. migrations succeed → deployment proceeds
3636. migrations fail → deployment aborts, old version keeps running
364
365### alternative approaches considered
366
367**option: post-deployment github actions job**
368
369run migrations via SSH after fly deployment completes:
370
371```yaml
372- name: run database migrations
373  run: flyctl ssh console -a plyr-api -C "uv run alembic upgrade head"
374```
375
376**why we didn't use this**:
377- migrations run AFTER app starts (brief window where app has wrong schema)
378- requires implementing deployment polling logic
379- SSH from CI is a security consideration
380- fly.io `release_command` is simpler and runs migrations BEFORE deployment
381
382**option: neon branch-based migrations**
383
384use neon's branch features for zero-downtime migrations (test on branch, then promote):
385
386**why we didn't use this**:
387- adds complexity for marginal benefit at current scale
388- our migrations are simple and fast (~3 seconds)
389- can revisit when we have complex, long-running migrations
390
391## current capabilities
392
393**migration isolation via release_command** (already implemented):
394- fly.io's `release_command` runs migrations in a separate temporary machine before deployment
395- migrations complete before app serves traffic (no inconsistent state)
396- deployment automatically aborts if migration fails
397- this provides similar benefits to kubernetes init containers
398
399**multi-environment pipeline** (already implemented):
400- dev → staging → production progression via three neon databases
401- migrations tested locally against neon dev first
402- staging deployment validates migrations before production
403- automated via GitHub Actions
404
405## future considerations
406
407as plyr.fm scales, we may want to explore:
408
409**neon branch-based migrations** (for complex changes):
410- test migrations on database branch first
411- promote branch to production (instant swap)
412- zero downtime, instant rollback
413- useful for high-risk schema changes
414
415**automated smoke tests**:
416- run basic API health checks after migration completes
417- verify critical queries still work
418- alert if performance degrades significantly
419
420## migration best practices
421
422### before creating migration
423
4241. **check current state**
425   ```bash
426   # local
427   uv run alembic current
428
429   # production
430   flyctl ssh console -a relay-api -C "uv run alembic current"
431   ```
432
4332. **ensure schemas are in sync**
434   ```bash
435   # generate test migration
436   uv run alembic revision --autogenerate -m "test"
437
438   # if file is empty, schemas match
439   # if file has changes, schemas are out of sync (fix this first)
440
441   # remove test file
442   rm alembic/versions/*_test.py
443   ```
444
445### creating migration
446
4471. **make model changes**
448   - edit files in `src/backend/models/`
449   - keep changes focused (one logical change per migration)
450
4512. **generate migration**
452   ```bash
453   uv run alembic revision --autogenerate -m "descriptive name"
454   ```
455
4563. **review generated migration**
457   - alembic autogenerate is not perfect
458   - verify the upgrade() function does what you expect
459   - check for missing operations (autogenerate doesn't detect everything)
460   - ensure downgrade() is correct
461
4624. **test locally**
463   ```bash
464   # upgrade
465   uv run alembic upgrade head
466
467   # verify schema
468   uv run python -c "from backend.models import Track; print(Track.__table__.columns)"
469
470   # test downgrade
471   uv run alembic downgrade -1
472   uv run alembic upgrade head
473   ```
474
4755. **test with actual app**
476   ```bash
477   # start backend with new schema
478   uv run uvicorn backend.main:app --reload
479
480   # verify endpoints work
481   curl http://localhost:8001/health
482   ```
483
484### deploying migration
485
4861. **commit migration file only**
487   ```bash
488   git add alembic/versions/<hash>_<description>.py
489   git commit -m "migration: <description>"
490   ```
491
4922. **create PR and review**
493   - include migration details in PR description
494   - note any backward compatibility concerns
495   - tag for review if complex
496
4973. **merge and deploy**
498   - merge to main
499   - watch deployment
500   - run migration (manual for now, automated soon)
501
5024. **verify deployment**
503   ```bash
504   # check revision
505   flyctl ssh console -a relay-api -C "uv run alembic current"
506
507   # check app health
508   curl https://relay-api.fly.dev/health
509
510   # check logs
511   flyctl logs --app relay-api
512   ```
513
514### handling failed migrations
515
516**if migration fails during upgrade**:
517
5181. **check what failed**
519   ```bash
520   flyctl logs --app relay-api | grep alembic
521   ```
522
5232. **check database state**
524   ```bash
525   flyctl ssh console -a relay-api -C "uv run alembic current"
526   ```
527
5283. **fix the issue**
529   - if migration was partially applied, may need manual SQL to fix
530   - if migration didn't apply, fix and re-run
531   - if data is corrupted, may need to restore from backup
532
5334. **options**:
534
535   a. **downgrade and fix**:
536   ```bash
537   flyctl ssh console -a relay-api -C "uv run alembic downgrade -1"
538   # fix migration file locally
539   # commit and redeploy
540   ```
541
542   b. **create fix migration**:
543   ```bash
544   # create new migration that fixes the issue
545   uv run alembic revision -m "fix previous migration"
546   # implement fix in upgrade()
547   # commit and deploy
548   ```
549
550   c. **manual SQL fix**:
551   ```bash
552   flyctl ssh console -a relay-api
553   # connect to database
554   # run manual SQL to fix state
555   # stamp to correct revision
556   # exit
557   ```
558
559### zero-downtime migration patterns
560
561for production systems, some schema changes require special handling:
562
563**adding nullable columns** (safe):
564```python
565def upgrade():
566    op.add_column('tracks', sa.Column('new_field', sa.String(), nullable=True))
567```
568
569**adding non-nullable columns** (requires two-step):
570```python
571# migration 1: add nullable
572def upgrade():
573    op.add_column('tracks', sa.Column('new_field', sa.String(), nullable=True))
574
575# migration 2: populate and make non-null
576def upgrade():
577    op.execute("UPDATE tracks SET new_field = 'default' WHERE new_field IS NULL")
578    op.alter_column('tracks', 'new_field', nullable=False)
579```
580
581**dropping columns** (requires two-step):
582```python
583# migration 1: stop using column in app code
584# deploy app
585
586# migration 2: drop column
587def upgrade():
588    op.drop_column('tracks', 'old_field')
589```
590
591**renaming columns** (requires three-step):
592```python
593# migration 1: add new column
594def upgrade():
595    op.add_column('tracks', sa.Column('new_name', sa.String()))
596    op.execute("UPDATE tracks SET new_name = old_name")
597
598# migration 2: switch app to use new column
599# deploy app
600
601# migration 3: drop old column
602def upgrade():
603    op.drop_column('tracks', 'old_name')
604```
605
606## references
607
608### internal
609- plyr.fm dockerfile: `Dockerfile`
610- plyr.fm fly config: `fly.toml`
611- alembic config: `alembic.ini`
612- alembic env: `alembic/env.py`
613- models: `src/backend/models/`
614
615### external
616- alembic documentation: https://alembic.sqlalchemy.org/
617- fly.io release commands: https://fly.io/docs/reference/configuration/#release_command
618- fly.io ssh: https://fly.io/docs/flyctl/ssh/
619- neon branching: https://neon.tech/docs/guides/branching
620
621### github actions
622- paths-filter action: https://github.com/dorny/paths-filter
623- flyctl actions: https://github.com/superfly/flyctl-actions
624
625---
626
627**last updated**: 2025-12-05
628**status**: fully automated via fly.io release_command ✓
629**owner**: @zzstoatzz