Merge pull request 'perf: gateway + DB + frontend optimizations (P8-003)' (#211) from feat/p8-003-performance into main
Some checks failed
ci/woodpecker/push/ci Pipeline failed

Reviewed-on: mosaic/mosaic-stack#211
This commit was merged in pull request #211.
This commit is contained in:
2026-03-21 12:28:30 +00:00
11 changed files with 2823 additions and 88 deletions

164
docs/PERFORMANCE.md Normal file
View File

@@ -0,0 +1,164 @@
# Performance Optimization — P8-003
**Branch:** `feat/p8-003-performance`
**Target metrics:** <200 ms TTFB, <2 s page loads
---
## What Was Profiled
The following areas were reviewed through static analysis and code-path tracing
(no production traffic available; findings are based on measurable code-level patterns):
| Area | Findings |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `packages/db` | Connection pool unbounded (default 10, no idle/connect timeout) |
| `apps/gateway/src/preferences` | N+1 round-trip on every pref upsert (SELECT + INSERT/UPDATE) |
| `packages/brain/src/conversations` | Unbounded list queries — no `LIMIT` or `ORDER BY` |
| `packages/db/src/schema` | Missing hot-path indexes: auth session lookup, OAuth callback, conversation list, agent-log tier queries |
| `apps/gateway/src/gc` | Cold-start GC blocked NestJS bootstrap (synchronous `await` in `onModuleInit`) |
| `apps/web/next.config.ts` | Missing `compress: true`, no `productionBrowserSourceMaps: false`, no image format config |
---
## Changes Made
### 1. DB Connection Pool — `packages/db/src/client.ts`
**Problem:** `postgres()` was called with no pool config. The default max of 10 connections
and no idle/connect timeouts meant the pool could hang indefinitely on a stale TCP connection.
**Fix:**
- `max`: 20 connections (configurable via `DB_POOL_MAX`)
- `idle_timeout`: 30 s (configurable via `DB_IDLE_TIMEOUT`) — recycle stale connections
- `connect_timeout`: 5 s (configurable via `DB_CONNECT_TIMEOUT`) — fail fast on unreachable DB
**Expected impact:** Eliminates pool exhaustion under moderate concurrency; removes indefinite
hangs when the DB is temporarily unreachable.
---
### 2. Preferences Upsert — `apps/gateway/src/preferences/preferences.service.ts`
**Problem:** `upsertPref` executed two serial DB round-trips on every preference write:
```
1. SELECT id FROM preferences WHERE user_id = ? AND key = ? (→ check exists)
2a. UPDATE preferences SET value = ? … (→ if found)
2b. INSERT INTO preferences … (→ if not found)
```
Under concurrency this also had a TOCTOU race window.
**Fix:** Replaced with single-statement `INSERT … ON CONFLICT DO UPDATE`:
```sql
INSERT INTO preferences (user_id, key, value, mutable)
VALUES (?, ?, ?, true)
ON CONFLICT (user_id, key) DO UPDATE SET value = excluded.value, updated_at = now();
```
This required promoting `preferences_user_key_idx` from a plain index to a `UNIQUE INDEX`
(see migration `0003_p8003_perf_indexes.sql`).
**Expected impact:** ~50% reduction in DB round-trips for preference writes; eliminates
the race window.
---
### 3. Missing DB Indexes — `packages/db/src/schema.ts` + migration
The following indexes were added or replaced to cover common query patterns:
| Table | Old indexes | New / changed |
| --------------- | ------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `sessions` | _(none)_ | `sessions_user_id_idx(user_id)`, `sessions_expires_at_idx(expires_at)` |
| `accounts` | _(none)_ | `accounts_provider_account_idx(provider_id, account_id)`, `accounts_user_id_idx(user_id)` |
| `conversations` | `(user_id)`, `(archived)` separate | `conversations_user_archived_idx(user_id, archived)` compound |
| `agent_logs` | `(session_id)`, `(tier)`, `(created_at)` separate | `agent_logs_session_tier_idx(session_id, tier)`, `agent_logs_tier_created_at_idx(tier, created_at)` |
| `preferences` | non-unique `(user_id, key)` | **unique** `(user_id, key)` — required for `ON CONFLICT` |
**Expected impact:**
- Auth session validation (hot path on every request): from seq scan → index scan
- OAuth callback account lookup: from seq scan → index scan
- Conversation list (dashboard load): compound index covers `WHERE user_id = ? ORDER BY updated_at`
- Log summarisation cron: `(tier, created_at)` index enables efficient hot→warm promotion query
All changes are in `packages/db/drizzle/0003_p8003_perf_indexes.sql`.
---
### 4. Conversation Queries — `packages/brain/src/conversations.ts`
**Problem:** `findAll(userId)` and `findMessages(conversationId)` were unbounded — no `LIMIT`
and `findAll` had no `ORDER BY`, so the DB planner may not use the index efficiently.
**Fix:**
- `findAll`: `ORDER BY updated_at DESC LIMIT 200` — returns most-recent conversations first
- `findMessages`: `ORDER BY created_at ASC LIMIT 500` — chronological message history
**Expected impact:** Prevents accidental full-table scans on large datasets; ensures the
frontend receives a usable, ordered result set regardless of table growth.
---
### 5. Cold-Start GC — `apps/gateway/src/gc/session-gc.service.ts`
**Problem:** `onModuleInit()` was `async` and `await`-ed `fullCollect()`, which blocked the
NestJS module initialization chain. Full GC — which calls `redis.keys('mosaic:session:*')` and
a DB query — typically takes 100500 ms. This directly added to startup TTFB.
**Fix:** Made `onModuleInit()` synchronous and used `.then().catch()` to run GC in the
background. The first HTTP request is no longer delayed by GC work.
**Expected impact:** Removes 100500 ms from cold-start TTFB.
---
### 6. Next.js Config — `apps/web/next.config.ts`
**Problem:** `compress: true` was not set, so response payloads were uncompressed. No image
format optimization or source-map suppression was configured.
**Fix:**
- `compress: true` — enables gzip/brotli for all Next.js responses
- `productionBrowserSourceMaps: false` — reduces build output size
- `images.formats: ['image/avif', 'image/webp']` — Next.js Image component will serve modern
formats to browsers that support them (typically 4060% smaller than JPEG/PNG)
**Expected impact:** Typical HTML/JSON gzip savings of 6080%; image serving cost reduced
for any `<Image>` components added in the future.
---
## What Was Not Changed (Intentionally)
- **Caching layer (Valkey/Redis):** The `SystemOverrideService` and GC already use Redis
pipelines. `PreferencesService.getEffective()` reads all user prefs in one query — this
is appropriate for the data size and doesn't warrant an additional cache layer yet.
- **WebSocket backpressure:** The `ChatGateway` already drops events for disconnected clients
(`client.connected` check) and cleans up listeners on disconnect. No memory leak was found.
- **Plugin/skill loader startup:** `SkillLoaderService.loadForSession()` is called on first
session creation, not on startup. Already non-blocking.
- **Frontend React memoization:** No specific hot components were identified as causing
excessive re-renders without profiling data. No speculative `memo()` calls added.
---
## How to Apply
```bash
# Run the DB migration (requires a live DB)
pnpm --filter @mosaic/db exec drizzle-kit migrate
# Or, in Docker/Swarm — migrations run automatically on gateway startup
# via runMigrations() in packages/db/src/migrate.ts
```
---
_Generated by P8-003 performance optimization task — 2026-03-18_