Files
stack/scratchpads/fix-db-bootstrap-migrations.md
jason.woltje ac5650d9f9 fix(db): bootstrap migrations on local-tier gateway startup
Fresh `mosaic gateway install` (npm) left the gateway DB schema empty —
sign-in 500'd with `relation "users" does not exist`, and every entry
point (auth, bootstrap setup) failed because they all query the users
table first. Five stacked bugs on the local (PGlite) tier:

1. `packages/db/package.json` `files: ["dist"]` excluded the `drizzle/`
   SQL migrations from the published tarball.
2. `runMigrations()` only supports postgres-js — unusable for embedded
   PGlite.
3. `apps/gateway/src/database/database.module.ts` never invoked
   migrations at startup.
4. `createPgliteDb` didn't load pgvector, so migration 0001's
   `CREATE EXTENSION vector` failed.
5. Drizzle's PG migrator wraps every migration in one outer
   transaction, which trips Postgres' `check_safe_enum_use` on
   migration 0009 (`ALTER TYPE ADD VALUE 'pending'` → `SET DEFAULT
   'pending'` in the same tx).

Changes:
- Ship `drizzle/` in the published tarball.
- `createPgliteDb` loads `@electric-sql/pglite/vector`.
- New `runPgliteMigrations(handle)` walks the Drizzle journal and
  runs each statement-breakpoint chunk through PGlite's `client.exec()`
  (autocommit per statement). Records into `drizzle.__drizzle_migrations`
  for interop with the postgres-js path. Per-statement try/catch
  surfaces which statement of which migration failed.
- `DatabaseModule` runs migrations in `OnModuleInit` before
  `app.listen()`. Local tier: explicit `runPgliteMigrations` then
  `storageAdapter.migrate()`. Postgres tier: just `storageAdapter.migrate()`,
  which already calls `runMigrations(url)` internally — no double-call.
- Removed `packages/storage/src/test-utils/pglite-with-vector.ts`. The
  "intentionally not exported" rationale is moot now that migration
  0001 forces pgvector load anyway. The integration test uses
  `createPgliteDb` + `runPgliteMigrations` from `@mosaicstack/db`.

Tests: BetterAuth tables exist after migrate; idempotent (re-runs 0009);
partial-failure surfaces statement-level context and leaves no ledger row.

QA on a fresh PGlite install:
- `Applying PGlite schema migrations...` then `Initializing storage
  adapter (pglite)...` in startup log.
- `GET /api/bootstrap/status` → `{"needsSetup":true}` HTTP 200 (was 500).
- `POST /api/bootstrap/setup` reaches Zod validator (was 500).

Scope: this PR fixes the local (PGlite) tier. Postgres-tier first
install still has the outer-transaction problem and a journal ordering
bug (0009's `when` < 0008's). Documented inline as TODO and in the
scratchpad — needs a separate change with real-Postgres validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:06:50 -05:00

126 lines
6.2 KiB
Markdown

# fix(db): bootstrap migrations on local-tier gateway startup
## Problem
Fresh `mosaic gateway install` (npm-installed) leaves the gateway DB schema empty:
```
relation "users" does not exist
```
Sign-in 500s, `auth users create` says "Not signed in", `admin/bootstrap setup`
also fails — every entry point queries `users` before doing anything else.
## Scope
This PR fixes the **local (PGlite) tier** end-to-end. The postgres-tier path
has additional pre-existing bugs (see "Known issues, out of scope" below) and
needs a separate change with real Postgres validation.
## Root causes addressed (5 stacked bugs on the local-tier path)
1. **`packages/db/package.json` `files: ["dist"]`** — the `drizzle/` SQL
migrations folder is excluded from the published tarball. Even if a
migrate runner existed, it would have nothing to apply.
2. **`packages/db/src/migrate.ts`** only supports `postgres-js`. Local-tier
gateways use embedded PGlite, which can't be reached over a postgres wire
protocol — so `runMigrations()` is unusable for the local tier.
3. **`apps/gateway/src/database/database.module.ts`** never invokes
migrations at startup. The module creates the DB handle and storage
adapter, but no consumer calls `.migrate()` on either. `mosaic storage
migrate` CLI even claims "pglite runs schema setup automatically on first
connection via `adapter.migrate()`" — but `adapter.migrate()` is only
called by tests, never at runtime.
4. **`createPgliteDb` does not load the pgvector extension.** Migration 0001
declares `CREATE EXTENSION IF NOT EXISTS vector;` for the
`insights.embedding` column. Bare PGlite has no pgvector — the migration
fails on extension control file lookup.
5. **Drizzle's PG migrator wraps every migration in one outer transaction.**
Migration 0009 does `ALTER TYPE grant_status ADD VALUE 'pending'` and then
`ALTER TABLE federation_grants ALTER COLUMN status SET DEFAULT 'pending'`.
Postgres' `check_safe_enum_use` rejects the second statement because the
new enum value isn't committed yet. Splitting the migration into two
files doesn't help — drizzle batches all migrations into one outer tx.
## Fix
- `packages/db/package.json` — ship `drizzle/` in `files`.
- `packages/db/src/client-pglite.ts` — load `@electric-sql/pglite/vector`.
- `packages/db/src/migrate.ts` — add `runPgliteMigrations(handle)`. Walks the
Drizzle journal and runs each statement-breakpoint chunk through PGlite's
`client.exec()` (Simple Query protocol → autocommit per statement). Writes
to the standard `drizzle.__drizzle_migrations` ledger so the result is
interoperable with `runMigrations()` on a postgres-backed deployment.
Per-statement try/catch surfaces which statement of which migration failed
and the ledger row is only written on full success.
- `packages/db/src/index.ts` — re-export.
- `apps/gateway/src/database/database.module.ts` — implement `OnModuleInit`:
- Local tier → `runPgliteMigrations(handle)`, then `storageAdapter.migrate()`
(the local storage adapter has its own kv tables in a separate PGlite dir).
- Postgres tier → `storageAdapter.migrate()` only, since
`PostgresAdapter.migrate()` already calls `runMigrations(url)` against
the same DATABASE_URL — we deliberately don't double-call.
NestJS awaits `onModuleInit` before `app.listen()`, so DB-dependent modules
see a populated schema before any HTTP traffic is accepted.
- `packages/storage/src/test-utils/pglite-with-vector.ts`**deleted**.
The "intentionally not exported" rationale is moot now that migration 0001
forces pgvector load anyway. `migrate-tier.integration.test.ts` switched
to `createPgliteDb` + `runPgliteMigrations` from `@mosaicstack/db`.
## Tests
`packages/db/src/migrate.test.ts`:
- Verifies `runPgliteMigrations` creates the BetterAuth tables (the original
failure mode).
- Idempotence (transitively re-runs migration 0009).
- Partial-failure: pre-creates a conflicting `users` table, asserts the
thrown error includes statement context (`hash=… statement #N failed`)
and that no ledger row was written.
## QA evidence
End-to-end on a fresh PGlite install:
- `[DatabaseModule] Applying PGlite schema migrations...` then
`Initializing storage adapter (pglite)...` in startup log.
- `GET /api/bootstrap/status``{"needsSetup":true}` HTTP 200 (was 500
with `relation "users" does not exist`).
- `POST /api/bootstrap/setup` with empty body → HTTP 400 with Zod
validation error (was 500), confirming the request reached the
validator past the table-existence check.
## Known issues, out of scope (file separately)
- **Postgres-tier first install is still broken.** `runMigrations()` uses
Drizzle's `migratePostgres`, which has the same outer-transaction problem
as PGlite's migrator. A fresh standalone-tier install would also fail at
migration 0009. Inline TODO in `migrate.ts:31-35` flags this. Fixing it
needs either (a) a shared per-statement loop reused for both drivers, or
(b) splitting migration 0009.
- **`drizzle/meta/_journal.json` has 0009 ordered before 0008** (`when`
values `1745280000000` < `1776822435828`). `migratePostgres` skips by
`created_at < folderMillis`, so on a postgres deployment that already
applied 0008, 0009 would be skipped forever. Our hash-based skip in the
PGlite path sidesteps this.
- **No advisory lock around the migration loop.** Two gateway processes
pointed at the same DATABASE_URL would race. PGlite is single-process by
file lock so the local tier is fine; postgres-tier deployments should add
`pg_advisory_lock(<deterministic-id>)` around the loop in a follow-up.
- **`mosaic storage migrate` CLI message is misleading** — it claims
"automatic on first connection via adapter.migrate()" but the adapter
doesn't self-migrate. With this PR the gateway invokes it explicitly, but
the CLI message could still be tightened.
- **Crash mid-migration leaves a partial-state PGlite DB without a ledger
row.** Detected loudly on next boot (the replay errors on "already
exists"), but recovery is manual (drop the partially-applied objects or
insert the migration hash into `drizzle.__drizzle_migrations`). A robust
fix would add a "started_at" column to a sidecar table to detect
half-applied state and refuse to start with actionable guidance.