diff --git a/docs/design/MS22-DB-CENTRIC-ARCHITECTURE.md b/docs/design/MS22-DB-CENTRIC-ARCHITECTURE.md index 22bc0d0..fb9f58d 100644 --- a/docs/design/MS22-DB-CENTRIC-ARCHITECTURE.md +++ b/docs/design/MS22-DB-CENTRIC-ARCHITECTURE.md @@ -1,164 +1,253 @@ -# MS22 Phase 1: DB-Centric Agent Fleet Architecture +# MS22 Phase 1: DB-Centric Multi-User Agent Architecture ## Design Principles -1. **Minimal env vars** — Only `DATABASE_URL` and `MOSAIC_SECRET_KEY` needed to start -2. **DB-centric config** — All runtime config lives in Postgres, managed via WebUI -3. **Mosaic is the gatekeeper** — Users never talk to OpenClaw directly -4. **Onboarding-first** — Breakglass user + wizard on first boot, no manual config files -5. **Generic product** — No hardcoded agent names, models, providers, or endpoints +1. **2 env vars to bootstrap** — `DATABASE_URL` + `MOSAIC_SECRET_KEY` +2. **DB-centric config** — All runtime config in Postgres, managed via WebUI +3. **Mosaic is the gatekeeper** — Users authenticate to Mosaic, never to OpenClaw directly +4. **Per-user agent isolation** — Each user gets their own OpenClaw container(s) with their own credentials +5. **Onboarding-first** — Breakglass user + wizard on first boot +6. **Generic product** — No hardcoded names, models, providers, or endpoints -## Bootstrap Flow +## Architecture Overview ``` -docker stack deploy (2 env vars) - │ - ▼ -┌─────────────────────┐ -│ Postgres migration │ ← creates tables, no seed data -└─────────────────────┘ - │ - ▼ -┌─────────────────────┐ -│ User opens WebUI │ ← detects empty config -└─────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────┐ -│ ONBOARDING WIZARD │ -│ │ -│ Step 1: Create breakglass admin │ -│ (username + password → bcrypt) │ -│ │ -│ Step 2: Configure OIDC (optional) │ -│ (provider URL, client ID, secret) │ -│ │ -│ Step 3: Add LLM provider │ -│ (type, API key, endpoint, test) │ -│ │ -│ Step 4: Configure agents │ -│ (roles, model assignments) │ -│ Auto-generates gateway tokens │ -│ │ -│ Step 5: Deploy summary + health check │ -└─────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────┐ -│ Agents pick up │ ← GET /api/internal/agent-config/:name -│ config from DB │ -└─────────────────────┘ +┌─────────────────────────────────────────────────────┐ +│ MOSAIC WEBUI │ +│ (Auth: breakglass local + OIDC via settings) │ +└──────────────────────┬──────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ MOSAIC API │ +│ │ +│ ┌──────────────┐ ┌────────────────┐ ┌─────────┐ │ +│ │ Onboarding │ │ Container │ │ Config │ │ +│ │ Wizard │ │ Lifecycle Mgr │ │ Store │ │ +│ └──────────────┘ └───────┬────────┘ └─────────┘ │ +│ │ │ +└────────────────────────────┼────────────────────────┘ + │ Docker API + ┌──────────────────┼──────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ OpenClaw │ │ OpenClaw │ │ OpenClaw │ + │ User A │ │ User B │ │ System │ + │ │ │ │ │ (admin) │ + │ Claude Max │ │ Z.ai key │ │ Shared key │ + │ own memory │ │ own memory │ │ monitoring │ + └─────────────┘ └─────────────┘ └─────────────┘ + Scale to zero Scale to zero Always on + after idle after idle ``` +## Container Lifecycle + +### User containers (on-demand) + +1. User logs in → Mosaic checks `UserContainer` table +2. No running container → Mosaic calls Docker API to create one +3. Injects user's encrypted API keys via config endpoint +4. Routes chat requests to user's container +5. Idle timeout (configurable, default 30min) → scale to zero +6. State volume persists (sessions, memory, auth tokens) +7. Next request → container restarts, picks up state from volume + +### System containers (always-on, optional) + +- Admin-provisioned for system tasks (monitoring, scheduled jobs) +- Use admin-configured shared API keys +- Not tied to any user + ## Auth Layers -| Flow | Method | Details | -| ------------------------------ | -------------------------- | ----------------------------------------------------- | -| User → Mosaic WebUI | Breakglass (local) or OIDC | Breakglass always available as fallback | -| Mosaic API → OpenClaw | Bearer token | Auto-generated per agent, stored encrypted in DB | -| OpenClaw → Mosaic API (config) | Bearer token | Same agent token, validated by Mosaic | -| OpenClaw → LLM providers | API keys | Stored encrypted in DB, delivered via config endpoint | -| Admin → Settings | RBAC | Admin role required for provider/agent/OIDC config | +| Flow | Method | +| ------------------------------- | ---------------------------------------------------------------------- | +| User → Mosaic WebUI | Breakglass (local) or OIDC (configured in settings) | +| Mosaic API → OpenClaw container | Bearer token (auto-generated per container, stored encrypted in DB) | +| OpenClaw → LLM providers | User's own API keys (delivered via config endpoint, decrypted from DB) | +| Admin → System settings | RBAC (admin role required) | +| Internal config endpoint | Bearer token (container authenticates to fetch its config) | -## Database Schema (new tables) +## Database Schema -### `SystemConfig` - -Key-value store for global settings (singleton-ish). +### System Tables ```prisma model SystemConfig { id String @id @default(cuid()) - key String @unique // "oidc.issuerUrl", "oidc.clientId", "onboarding.completed", etc. + key String @unique // "oidc.issuerUrl", "oidc.clientId", "onboarding.completed" value String // plaintext or encrypted (prefix: "enc:") encrypted Boolean @default(false) updatedAt DateTime @updatedAt } -``` -### `LlmProvider` - -LLM provider configurations. - -```prisma -model LlmProvider { - id String @id @default(cuid()) - name String @unique // "zai", "openai", "anthropic", "ollama-local", etc. - displayName String // "Z.ai", "OpenAI", "Local Ollama" - type String // "zai" | "openai" | "anthropic" | "ollama" | "custom" - baseUrl String? // null for built-in providers, URL for custom/ollama - apiKey String? // encrypted - apiType String @default("openai-completions") // openai-completions | anthropic-messages | etc. - models Json @default("[]") // available model list [{id, name, contextWindow, maxTokens}] - isActive Boolean @default(true) - createdAt DateTime @default(now()) - updatedAt DateTime @updatedAt - agents AgentModelAssignment[] -} -``` - -### `AgentConfig` - -Per-agent configuration (replaces old OpenClawAgent). - -```prisma -model AgentConfig { - id String @id @default(cuid()) - name String @unique // "mosaic-main", "mosaic-projects", etc. - displayName String // "Main Orchestrator", "Projects", etc. - role String // "orchestrator" | "developer" | "researcher" | "operations" - gatewayUrl String // internal Docker URL: "http://mosaic-main:18789" - gatewayToken String // encrypted — auto-generated - isActive Boolean @default(true) - personality String? // SOUL.md content for this agent - toolPermissions Json @default("[]") // allowed tool list - createdAt DateTime @default(now()) - updatedAt DateTime @updatedAt - modelAssignment AgentModelAssignment? -} -``` - -### `AgentModelAssignment` - -Links agents to providers and models. - -```prisma -model AgentModelAssignment { - id String @id @default(cuid()) - agentId String @unique - agent AgentConfig @relation(fields: [agentId], references: [id]) - providerId String - provider LlmProvider @relation(fields: [providerId], references: [id]) - primaryModel String // "glm-5", "claude-sonnet-4-6", "cogito", etc. - fallbacks Json @default("[]") // [{providerId, model}] - updatedAt DateTime @updatedAt -} -``` - -### `BreakglassUser` - -Local admin user (no OIDC dependency). - -```prisma model BreakglassUser { id String @id @default(cuid()) username String @unique - passwordHash String // bcrypt + passwordHash String // bcrypt isActive Boolean @default(true) createdAt DateTime @default(now()) updatedAt DateTime @updatedAt } ``` +### Provider Tables (per-user) + +```prisma +model LlmProvider { + id String @id @default(cuid()) + userId String // owner — each user manages their own providers + name String // "my-zai", "work-openai", "local-ollama" + displayName String // "Z.ai", "OpenAI (Work)", "Local Ollama" + type String // "zai" | "openai" | "anthropic" | "ollama" | "custom" + baseUrl String? // null for built-in, URL for custom/ollama + apiKey String? // encrypted + apiType String @default("openai-completions") + models Json @default("[]") // [{id, name, contextWindow, maxTokens}] + isActive Boolean @default(true) + createdAt DateTime @default(now()) + updatedAt DateTime @updatedAt + + @@unique([userId, name]) +} +``` + +### Container Tables + +```prisma +model UserContainer { + id String @id @default(cuid()) + userId String @unique + containerId String? // Docker container ID (null = not running) + containerName String // "mosaic-user-{userId}" + gatewayPort Int? // assigned port (null = not running) + gatewayToken String // encrypted — auto-generated + status String @default("stopped") // "running" | "stopped" | "starting" | "error" + lastActiveAt DateTime? + idleTimeoutMin Int @default(30) + config Json @default("{}") // cached openclaw.json for this user + createdAt DateTime @default(now()) + updatedAt DateTime @updatedAt +} + +model SystemContainer { + id String @id @default(cuid()) + name String @unique // "mosaic-system-ops", "mosaic-system-monitor" + role String // "operations" | "monitor" | "scheduler" + containerId String? + gatewayPort Int? + gatewayToken String // encrypted + status String @default("stopped") + providerId String? // references admin-level LlmProvider + primaryModel String // "zai/glm-5", etc. + isActive Boolean @default(true) + createdAt DateTime @default(now()) + updatedAt DateTime @updatedAt +} +``` + +### User Agent Preferences + +```prisma +model UserAgentConfig { + id String @id @default(cuid()) + userId String @unique + primaryModel String? // user's preferred model + fallbackModels Json @default("[]") + personality String? // custom SOUL.md content + providerId String? // default provider for this user + createdAt DateTime @default(now()) + updatedAt DateTime @updatedAt +} +``` + ## Internal Config Endpoint -`GET /api/internal/agent-config/:agentName` +`GET /api/internal/agent-config/:containerType/:id` -- Auth: Bearer token (agent's own gateway token) -- Returns: Complete `openclaw.json` generated from DB tables -- Includes: model config, provider credentials (decrypted), tool permissions +- Auth: Bearer token (container's own gateway token) +- Returns: Complete `openclaw.json` generated from DB +- For user containers: includes user's providers, model prefs, personality +- For system containers: includes admin provider config -## Docker Compose (simplified) +Response assembles openclaw.json dynamically: + +```json +{ + "gateway": { "mode": "local", "port": 18789, "bind": "lan", "auth": { "mode": "token" } ... }, + "agents": { "defaults": { "model": { "primary": "" } } }, + "models": { "providers": { "": { ... } } } +} +``` + +## Container Lifecycle Manager + +NestJS service that manages Docker containers: + +```typescript +class ContainerLifecycleService { + // Create and start a user's OpenClaw container + async ensureRunning(userId: string): Promise<{ url: string; token: string }>; + + // Stop idle containers (called by cron/scheduler) + async reapIdle(): Promise; + + // Stop a specific user's container + async stop(userId: string): Promise; + + // Health check all running containers + async healthCheckAll(): Promise; + + // Restart container with updated config + async restart(userId: string): Promise; +} +``` + +Uses Docker Engine API (`/var/run/docker.sock` or TCP) via `dockerode` npm package. + +## Onboarding Wizard + +### First-Boot Detection + +- API checks: `SystemConfig.get("onboarding.completed")` → null = first boot +- WebUI redirects to `/onboarding` if not completed + +### Steps + +**Step 1: Create Breakglass Admin** + +- Username + password → bcrypt → `BreakglassUser` table +- This user always works, even if OIDC is misconfigured + +**Step 2: Configure Authentication (optional)** + +- OIDC: provider URL, client ID, client secret → encrypted in `SystemConfig` +- Skip = breakglass-only auth (can add OIDC later in settings) + +**Step 3: Add Your First LLM Provider** + +- Pick type → enter API key/endpoint → test connection → save to `LlmProvider` +- This becomes the admin's default provider + +**Step 4: System Agents (optional)** + +- Configure always-on system agents for monitoring/ops +- Or skip — users can just use their own personal agents + +**Step 5: Complete** + +- Sets `SystemConfig("onboarding.completed") = true` +- Redirects to dashboard + +### Post-Onboarding: User Self-Service + +- Each user adds their own LLM providers in profile settings +- Each user configures their preferred model, personality +- First chat request triggers container creation + +## Docker Compose (final) ```yaml services: @@ -167,86 +256,101 @@ services: environment: DATABASE_URL: ${DATABASE_URL} MOSAIC_SECRET_KEY: ${MOSAIC_SECRET_KEY} + volumes: + - /var/run/docker.sock:/var/run/docker.sock # Docker API access + networks: + - internal mosaic-web: image: mosaic/web:latest environment: NEXT_PUBLIC_API_URL: http://mosaic-api:4000 + networks: + - internal - mosaic-main: - image: alpine/openclaw:latest - command: ["/config/entrypoint.sh"] + postgres: + image: postgres:17 environment: - DATABASE_URL: ${DATABASE_URL} - MOSAIC_API_URL: http://mosaic-api:4000 - MOSAIC_SECRET_KEY: ${MOSAIC_SECRET_KEY} - AGENT_NAME: mosaic-main + POSTGRES_DB: mosaic + POSTGRES_USER: mosaic + POSTGRES_PASSWORD: ${DATABASE_PASSWORD} volumes: - - mosaic-main-state:/home/node/.openclaw + - postgres-data:/var/lib/postgresql/data + networks: + - internal - # Additional agents follow same pattern, only AGENT_NAME differs + # System agent (optional, admin-provisioned) + # mosaic-system: + # image: alpine/openclaw:latest + # ... (managed by ContainerLifecycleService) + + # User containers are NOT in this file — + # they are dynamically created by ContainerLifecycleService + # via the Docker API at runtime. + +networks: + internal: + driver: overlay + +volumes: + postgres-data: ``` -### Entrypoint (simplified) +Note: User OpenClaw containers are **not** defined in docker-compose. They are +created dynamically by the `ContainerLifecycleService` when users start chatting. + +## Entrypoint (for dynamically created containers) ```sh #!/bin/sh -# Fetch config from Mosaic API, write openclaw.json, start gateway -CONFIG=$(curl -sf "${MOSAIC_API_URL}/api/internal/agent-config/${AGENT_NAME}" \ - -H "Authorization: Bearer ${MOSAIC_SECRET_KEY}") -echo "$CONFIG" > /tmp/openclaw.json +set -e +: "${MOSAIC_API_URL:?required}" +: "${AGENT_TOKEN:?required}" +: "${AGENT_ID:?required}" + +# Fetch config from Mosaic API +curl -sf "${MOSAIC_API_URL}/api/internal/agent-config/${AGENT_ID}" \ + -H "Authorization: Bearer ${AGENT_TOKEN}" \ + -o /tmp/openclaw.json + export OPENCLAW_CONFIG_PATH=/tmp/openclaw.json exec openclaw gateway run --bind lan --auth token ``` +Container env vars (injected by ContainerLifecycleService): + +- `MOSAIC_API_URL` — internal API URL +- `AGENT_TOKEN` — this container's bearer token (from DB) +- `AGENT_ID` — container ID for config lookup + +## Config Update Strategy + +When a user changes settings (model, provider, personality): + +1. Mosaic API updates DB +2. API calls `ContainerLifecycleService.restart(userId)` +3. Container restarts, fetches fresh config from API +4. OpenClaw gateway starts with new config +5. State volume preserves sessions/memory across restarts + ## Task Breakdown -### Phase 1a: DB Schema + Internal Config API +| Task | Phase | Scope | Dependencies | +| -------- | -------------- | --------------------------------------------------------------------------------------------------------------------- | ------------ | +| MS22-P1a | Schema | Prisma models: SystemConfig, BreakglassUser, LlmProvider, UserContainer, SystemContainer, UserAgentConfig. Migration. | — | +| MS22-P1b | Crypto | Encryption service for API keys/tokens (AES-256-GCM using MOSAIC_SECRET_KEY) | P1a | +| MS22-P1c | Config API | Internal config endpoint: assembles openclaw.json from DB | P1a, P1b | +| MS22-P1d | Container Mgr | ContainerLifecycleService: Docker API integration (dockerode), start/stop/health/reap | P1a | +| MS22-P1e | Onboarding API | Onboarding endpoints: breakglass, OIDC, provider, complete | P1a, P1b | +| MS22-P1f | Onboarding UI | Multi-step wizard in WebUI | P1e | +| MS22-P1g | Settings API | CRUD: providers, agent config, OIDC, breakglass | P1a, P1b | +| MS22-P1h | Settings UI | Settings pages: Providers, Agent Config, Auth | P1g | +| MS22-P1i | Chat Proxy | Route WebUI chat → user's OpenClaw container (SSE) | P1c, P1d | +| MS22-P1j | Docker | Entrypoint script, health checks, compose for core services | P1c | +| MS22-P1k | Idle Reaper | Cron service to stop idle user containers | P1d | -- Prisma schema: SystemConfig, LlmProvider, AgentConfig, AgentModelAssignment, BreakglassUser -- Migration -- Internal config endpoint: generates openclaw.json from DB -- Encryption/decryption service for API keys and tokens +## Open Questions (Resolved) -### Phase 1b: Onboarding Wizard (API) - -- Detect first-boot (no breakglass user exists) -- POST /api/onboarding/breakglass — create admin -- POST /api/onboarding/oidc — save OIDC config -- POST /api/onboarding/provider — add LLM provider + test connection -- POST /api/onboarding/agents — configure agent fleet -- POST /api/onboarding/complete — mark onboarding done - -### Phase 1c: Onboarding Wizard (WebUI) - -- Multi-step wizard component -- Breakglass user creation form -- OIDC config form (skip option) -- LLM provider form with connection test -- Agent configuration with model picker -- Summary + deploy health check - -### Phase 1d: Settings Pages (WebUI) - -- Settings/Providers — CRUD for LLM providers -- Settings/Agents — model assignments, personalities, status -- Settings/Auth — OIDC config, breakglass password reset -- All behind admin RBAC - -### Phase 1e: Docker Compose + Entrypoint - -- Simplified compose (AGENT_NAME + shared env vars) -- Entrypoint: curl config from API, write, start -- Health check integration - -### Phase 1f: Chat Proxy - -- Mosaic API routes WebUI chat to correct OpenClaw agent -- SSE streaming pass-through -- Agent selector in WebUI - -## Open Questions - -1. Should agents auto-restart when config changes in DB? (webhook/signal vs polling) -2. Should breakglass user be created via CLI as alternative to WebUI wizard? -3. Config cache TTL in agents? (avoid hitting API on every request) +1. ~~Config updates → restart?~~ **Yes.** Mosaic restarts the container, fresh config on boot. +2. ~~CLI alternative for breakglass?~~ **Yes.** Both WebUI wizard and CLI (`mosaic admin create-breakglass`). +3. ~~Config cache TTL?~~ **Yes.** Config fetched once at startup, changes trigger restart.