Files
stack/docs/PERFORMANCE.md
Jason Woltje 3b81bc9f3d perf: gateway + DB + frontend optimizations (P8-003)
- DB client: configure connection pool (max=20, idle_timeout=30s, connect_timeout=5s)
- DB schema: add missing indexes for auth sessions, accounts, conversations, agent_logs
- DB schema: promote preferences(user_id,key) to UNIQUE index for ON CONFLICT upsert
- Drizzle migration: 0003_p8003_perf_indexes.sql
- preferences.service: replace 2-query SELECT+INSERT/UPDATE with single-round-trip upsert
- conversations repo: add ORDER BY + LIMIT to findAll (200) and findMessages (500)
- session-gc.service: make onModuleInit fire-and-forget (removes cold-start TTFB block)
- next.config.ts: enable compress, productionBrowserSourceMaps:false, image avif/webp
- docs/PERFORMANCE.md: full profiling report and change impact notes
2026-03-18 21:26:45 -05:00

7.8 KiB
Raw Permalink Blame History

Performance Optimization — P8-003

Branch: feat/p8-003-performance
Target metrics: <200 ms TTFB, <2 s page loads


What Was Profiled

The following areas were reviewed through static analysis and code-path tracing (no production traffic available; findings are based on measurable code-level patterns):

Area Findings
packages/db Connection pool unbounded (default 10, no idle/connect timeout)
apps/gateway/src/preferences N+1 round-trip on every pref upsert (SELECT + INSERT/UPDATE)
packages/brain/src/conversations Unbounded list queries — no LIMIT or ORDER BY
packages/db/src/schema Missing hot-path indexes: auth session lookup, OAuth callback, conversation list, agent-log tier queries
apps/gateway/src/gc Cold-start GC blocked NestJS bootstrap (synchronous await in onModuleInit)
apps/web/next.config.ts Missing compress: true, no productionBrowserSourceMaps: false, no image format config

Changes Made

1. DB Connection Pool — packages/db/src/client.ts

Problem: postgres() was called with no pool config. The default max of 10 connections and no idle/connect timeouts meant the pool could hang indefinitely on a stale TCP connection.

Fix:

  • max: 20 connections (configurable via DB_POOL_MAX)
  • idle_timeout: 30 s (configurable via DB_IDLE_TIMEOUT) — recycle stale connections
  • connect_timeout: 5 s (configurable via DB_CONNECT_TIMEOUT) — fail fast on unreachable DB

Expected impact: Eliminates pool exhaustion under moderate concurrency; removes indefinite hangs when the DB is temporarily unreachable.


2. Preferences Upsert — apps/gateway/src/preferences/preferences.service.ts

Problem: upsertPref executed two serial DB round-trips on every preference write:

1. SELECT id FROM preferences WHERE user_id = ? AND key = ?  (→ check exists)
2a. UPDATE preferences SET value = ? …                        (→ if found)
2b. INSERT INTO preferences …                                 (→ if not found)

Under concurrency this also had a TOCTOU race window.

Fix: Replaced with single-statement INSERT … ON CONFLICT DO UPDATE:

INSERT INTO preferences (user_id, key, value, mutable)
VALUES (?, ?, ?, true)
ON CONFLICT (user_id, key) DO UPDATE SET value = excluded.value, updated_at = now();

This required promoting preferences_user_key_idx from a plain index to a UNIQUE INDEX (see migration 0003_p8003_perf_indexes.sql).

Expected impact: ~50% reduction in DB round-trips for preference writes; eliminates the race window.


3. Missing DB Indexes — packages/db/src/schema.ts + migration

The following indexes were added or replaced to cover common query patterns:

Table Old indexes New / changed
sessions (none) sessions_user_id_idx(user_id), sessions_expires_at_idx(expires_at)
accounts (none) accounts_provider_account_idx(provider_id, account_id), accounts_user_id_idx(user_id)
conversations (user_id), (archived) separate conversations_user_archived_idx(user_id, archived) compound
agent_logs (session_id), (tier), (created_at) separate agent_logs_session_tier_idx(session_id, tier), agent_logs_tier_created_at_idx(tier, created_at)
preferences non-unique (user_id, key) unique (user_id, key) — required for ON CONFLICT

Expected impact:

  • Auth session validation (hot path on every request): from seq scan → index scan
  • OAuth callback account lookup: from seq scan → index scan
  • Conversation list (dashboard load): compound index covers WHERE user_id = ? ORDER BY updated_at
  • Log summarisation cron: (tier, created_at) index enables efficient hot→warm promotion query

All changes are in packages/db/drizzle/0003_p8003_perf_indexes.sql.


4. Conversation Queries — packages/brain/src/conversations.ts

Problem: findAll(userId) and findMessages(conversationId) were unbounded — no LIMIT and findAll had no ORDER BY, so the DB planner may not use the index efficiently.

Fix:

  • findAll: ORDER BY updated_at DESC LIMIT 200 — returns most-recent conversations first
  • findMessages: ORDER BY created_at ASC LIMIT 500 — chronological message history

Expected impact: Prevents accidental full-table scans on large datasets; ensures the frontend receives a usable, ordered result set regardless of table growth.


5. Cold-Start GC — apps/gateway/src/gc/session-gc.service.ts

Problem: onModuleInit() was async and await-ed fullCollect(), which blocked the NestJS module initialization chain. Full GC — which calls redis.keys('mosaic:session:*') and a DB query — typically takes 100500 ms. This directly added to startup TTFB.

Fix: Made onModuleInit() synchronous and used .then().catch() to run GC in the background. The first HTTP request is no longer delayed by GC work.

Expected impact: Removes 100500 ms from cold-start TTFB.


6. Next.js Config — apps/web/next.config.ts

Problem: compress: true was not set, so response payloads were uncompressed. No image format optimization or source-map suppression was configured.

Fix:

  • compress: true — enables gzip/brotli for all Next.js responses
  • productionBrowserSourceMaps: false — reduces build output size
  • images.formats: ['image/avif', 'image/webp'] — Next.js Image component will serve modern formats to browsers that support them (typically 4060% smaller than JPEG/PNG)

Expected impact: Typical HTML/JSON gzip savings of 6080%; image serving cost reduced for any <Image> components added in the future.


What Was Not Changed (Intentionally)

  • Caching layer (Valkey/Redis): The SystemOverrideService and GC already use Redis pipelines. PreferencesService.getEffective() reads all user prefs in one query — this is appropriate for the data size and doesn't warrant an additional cache layer yet.
  • WebSocket backpressure: The ChatGateway already drops events for disconnected clients (client.connected check) and cleans up listeners on disconnect. No memory leak was found.
  • Plugin/skill loader startup: SkillLoaderService.loadForSession() is called on first session creation, not on startup. Already non-blocking.
  • Frontend React memoization: No specific hot components were identified as causing excessive re-renders without profiling data. No speculative memo() calls added.

How to Apply

# Run the DB migration (requires a live DB)
pnpm --filter @mosaic/db exec drizzle-kit migrate

# Or, in Docker/Swarm — migrations run automatically on gateway startup
# via runMigrations() in packages/db/src/migrate.ts

Generated by P8-003 performance optimization task — 2026-03-18