Files
stack/scratchpads/fix-db-bootstrap-migrations.md
jason.woltje ac5650d9f9 fix(db): bootstrap migrations on local-tier gateway startup
Fresh `mosaic gateway install` (npm) left the gateway DB schema empty —
sign-in 500'd with `relation "users" does not exist`, and every entry
point (auth, bootstrap setup) failed because they all query the users
table first. Five stacked bugs on the local (PGlite) tier:

1. `packages/db/package.json` `files: ["dist"]` excluded the `drizzle/`
   SQL migrations from the published tarball.
2. `runMigrations()` only supports postgres-js — unusable for embedded
   PGlite.
3. `apps/gateway/src/database/database.module.ts` never invoked
   migrations at startup.
4. `createPgliteDb` didn't load pgvector, so migration 0001's
   `CREATE EXTENSION vector` failed.
5. Drizzle's PG migrator wraps every migration in one outer
   transaction, which trips Postgres' `check_safe_enum_use` on
   migration 0009 (`ALTER TYPE ADD VALUE 'pending'` → `SET DEFAULT
   'pending'` in the same tx).

Changes:
- Ship `drizzle/` in the published tarball.
- `createPgliteDb` loads `@electric-sql/pglite/vector`.
- New `runPgliteMigrations(handle)` walks the Drizzle journal and
  runs each statement-breakpoint chunk through PGlite's `client.exec()`
  (autocommit per statement). Records into `drizzle.__drizzle_migrations`
  for interop with the postgres-js path. Per-statement try/catch
  surfaces which statement of which migration failed.
- `DatabaseModule` runs migrations in `OnModuleInit` before
  `app.listen()`. Local tier: explicit `runPgliteMigrations` then
  `storageAdapter.migrate()`. Postgres tier: just `storageAdapter.migrate()`,
  which already calls `runMigrations(url)` internally — no double-call.
- Removed `packages/storage/src/test-utils/pglite-with-vector.ts`. The
  "intentionally not exported" rationale is moot now that migration
  0001 forces pgvector load anyway. The integration test uses
  `createPgliteDb` + `runPgliteMigrations` from `@mosaicstack/db`.

Tests: BetterAuth tables exist after migrate; idempotent (re-runs 0009);
partial-failure surfaces statement-level context and leaves no ledger row.

QA on a fresh PGlite install:
- `Applying PGlite schema migrations...` then `Initializing storage
  adapter (pglite)...` in startup log.
- `GET /api/bootstrap/status` → `{"needsSetup":true}` HTTP 200 (was 500).
- `POST /api/bootstrap/setup` reaches Zod validator (was 500).

Scope: this PR fixes the local (PGlite) tier. Postgres-tier first
install still has the outer-transaction problem and a journal ordering
bug (0009's `when` < 0008's). Documented inline as TODO and in the
scratchpad — needs a separate change with real-Postgres validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:06:50 -05:00

6.2 KiB

fix(db): bootstrap migrations on local-tier gateway startup

Problem

Fresh mosaic gateway install (npm-installed) leaves the gateway DB schema empty:

relation "users" does not exist

Sign-in 500s, auth users create says "Not signed in", admin/bootstrap setup also fails — every entry point queries users before doing anything else.

Scope

This PR fixes the local (PGlite) tier end-to-end. The postgres-tier path has additional pre-existing bugs (see "Known issues, out of scope" below) and needs a separate change with real Postgres validation.

Root causes addressed (5 stacked bugs on the local-tier path)

  1. packages/db/package.json files: ["dist"] — the drizzle/ SQL migrations folder is excluded from the published tarball. Even if a migrate runner existed, it would have nothing to apply.

  2. packages/db/src/migrate.ts only supports postgres-js. Local-tier gateways use embedded PGlite, which can't be reached over a postgres wire protocol — so runMigrations() is unusable for the local tier.

  3. apps/gateway/src/database/database.module.ts never invokes migrations at startup. The module creates the DB handle and storage adapter, but no consumer calls .migrate() on either. mosaic storage migrate CLI even claims "pglite runs schema setup automatically on first connection via adapter.migrate()" — but adapter.migrate() is only called by tests, never at runtime.

  4. createPgliteDb does not load the pgvector extension. Migration 0001 declares CREATE EXTENSION IF NOT EXISTS vector; for the insights.embedding column. Bare PGlite has no pgvector — the migration fails on extension control file lookup.

  5. Drizzle's PG migrator wraps every migration in one outer transaction. Migration 0009 does ALTER TYPE grant_status ADD VALUE 'pending' and then ALTER TABLE federation_grants ALTER COLUMN status SET DEFAULT 'pending'. Postgres' check_safe_enum_use rejects the second statement because the new enum value isn't committed yet. Splitting the migration into two files doesn't help — drizzle batches all migrations into one outer tx.

Fix

  • packages/db/package.json — ship drizzle/ in files.

  • packages/db/src/client-pglite.ts — load @electric-sql/pglite/vector.

  • packages/db/src/migrate.ts — add runPgliteMigrations(handle). Walks the Drizzle journal and runs each statement-breakpoint chunk through PGlite's client.exec() (Simple Query protocol → autocommit per statement). Writes to the standard drizzle.__drizzle_migrations ledger so the result is interoperable with runMigrations() on a postgres-backed deployment. Per-statement try/catch surfaces which statement of which migration failed and the ledger row is only written on full success.

  • packages/db/src/index.ts — re-export.

  • apps/gateway/src/database/database.module.ts — implement OnModuleInit:

    • Local tier → runPgliteMigrations(handle), then storageAdapter.migrate() (the local storage adapter has its own kv tables in a separate PGlite dir).
    • Postgres tier → storageAdapter.migrate() only, since PostgresAdapter.migrate() already calls runMigrations(url) against the same DATABASE_URL — we deliberately don't double-call.

    NestJS awaits onModuleInit before app.listen(), so DB-dependent modules see a populated schema before any HTTP traffic is accepted.

  • packages/storage/src/test-utils/pglite-with-vector.tsdeleted. The "intentionally not exported" rationale is moot now that migration 0001 forces pgvector load anyway. migrate-tier.integration.test.ts switched to createPgliteDb + runPgliteMigrations from @mosaicstack/db.

Tests

packages/db/src/migrate.test.ts:

  • Verifies runPgliteMigrations creates the BetterAuth tables (the original failure mode).
  • Idempotence (transitively re-runs migration 0009).
  • Partial-failure: pre-creates a conflicting users table, asserts the thrown error includes statement context (hash=… statement #N failed) and that no ledger row was written.

QA evidence

End-to-end on a fresh PGlite install:

  • [DatabaseModule] Applying PGlite schema migrations... then Initializing storage adapter (pglite)... in startup log.
  • GET /api/bootstrap/status{"needsSetup":true} HTTP 200 (was 500 with relation "users" does not exist).
  • POST /api/bootstrap/setup with empty body → HTTP 400 with Zod validation error (was 500), confirming the request reached the validator past the table-existence check.

Known issues, out of scope (file separately)

  • Postgres-tier first install is still broken. runMigrations() uses Drizzle's migratePostgres, which has the same outer-transaction problem as PGlite's migrator. A fresh standalone-tier install would also fail at migration 0009. Inline TODO in migrate.ts:31-35 flags this. Fixing it needs either (a) a shared per-statement loop reused for both drivers, or (b) splitting migration 0009.
  • drizzle/meta/_journal.json has 0009 ordered before 0008 (when values 1745280000000 < 1776822435828). migratePostgres skips by created_at < folderMillis, so on a postgres deployment that already applied 0008, 0009 would be skipped forever. Our hash-based skip in the PGlite path sidesteps this.
  • No advisory lock around the migration loop. Two gateway processes pointed at the same DATABASE_URL would race. PGlite is single-process by file lock so the local tier is fine; postgres-tier deployments should add pg_advisory_lock(<deterministic-id>) around the loop in a follow-up.
  • mosaic storage migrate CLI message is misleading — it claims "automatic on first connection via adapter.migrate()" but the adapter doesn't self-migrate. With this PR the gateway invokes it explicitly, but the CLI message could still be tightened.
  • Crash mid-migration leaves a partial-state PGlite DB without a ledger row. Detected loudly on next boot (the replay errors on "already exists"), but recovery is manual (drop the partially-applied objects or insert the migration hash into drizzle.__drizzle_migrations). A robust fix would add a "started_at" column to a sidecar table to detect half-applied state and refuse to start with actionable guidance.