docs(design): MS22 DB-centric agent fleet architecture #604

Merged
jason.woltje merged 3 commits from docs/ms22-architecture into main 2026-03-01 14:35:15 +00:00
Showing only changes of commit 3974e08b6c - Show all commits

View File

@@ -1,164 +1,253 @@
# MS22 Phase 1: DB-Centric Agent Fleet Architecture
# MS22 Phase 1: DB-Centric Multi-User Agent Architecture
## Design Principles
1. **Minimal env vars** Only `DATABASE_URL` and `MOSAIC_SECRET_KEY` needed to start
2. **DB-centric config** — All runtime config lives in Postgres, managed via WebUI
3. **Mosaic is the gatekeeper** — Users never talk to OpenClaw directly
4. **Onboarding-first** — Breakglass user + wizard on first boot, no manual config files
5. **Generic product** — No hardcoded agent names, models, providers, or endpoints
1. **2 env vars to bootstrap**`DATABASE_URL` + `MOSAIC_SECRET_KEY`
2. **DB-centric config** — All runtime config in Postgres, managed via WebUI
3. **Mosaic is the gatekeeper** — Users authenticate to Mosaic, never to OpenClaw directly
4. **Per-user agent isolation** — Each user gets their own OpenClaw container(s) with their own credentials
5. **Onboarding-first** — Breakglass user + wizard on first boot
6. **Generic product** — No hardcoded names, models, providers, or endpoints
## Bootstrap Flow
## Architecture Overview
```
docker stack deploy (2 env vars)
─────────────────────
│ Postgres migration │ ← creates tables, no seed data
└─────────────────────┘
┌─────────────────────┐
User opens WebUI │ ← detects empty config
└─────────────────────┘
┌─────────────────────────────────────────────┐
│ ONBOARDING WIZARD │
Step 1: Create breakglass admin │
(username + password → bcrypt)
Step 2: Configure OIDC (optional) │
(provider URL, client ID, secret)
│ Step 3: Add LLM provider
(type, API key, endpoint, test)
Step 4: Configure agents │
(roles, model assignments) │
Auto-generates gateway tokens │
│ │
│ Step 5: Deploy summary + health check │
└─────────────────────────────────────────────┘
┌─────────────────────┐
│ Agents pick up │ ← GET /api/internal/agent-config/:name
│ config from DB │
└─────────────────────┘
┌─────────────────────────────────────────────────────┐
│ MOSAIC WEBUI
(Auth: breakglass local + OIDC via settings) │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
MOSAIC API │
│ │
┌──────────────┐ ┌────────────────┐ ┌─────────┐ │
│ │ Onboarding │ │ Container │ │ Config │ │
│ Wizard │ │ Lifecycle Mgr │ │ Store │
└──────────────┘ └───────┬────────┘ └─────────┘ │
│ │ │
└────────────────────────────┼────────────────────────┘
Docker API
┌──────────────────┼──────────────────┐
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ OpenClaw OpenClaw │ │ OpenClaw
User A User B System
│ │ │ (admin)
│ Claude Max │ Z.ai key │ │ Shared key
│ own memory │ own memory │ monitoring
└─────────────┘ └─────────────┘ └─────────────┘
Scale to zero Scale to zero Always on
after idle after idle
```
## Container Lifecycle
### User containers (on-demand)
1. User logs in → Mosaic checks `UserContainer` table
2. No running container → Mosaic calls Docker API to create one
3. Injects user's encrypted API keys via config endpoint
4. Routes chat requests to user's container
5. Idle timeout (configurable, default 30min) → scale to zero
6. State volume persists (sessions, memory, auth tokens)
7. Next request → container restarts, picks up state from volume
### System containers (always-on, optional)
- Admin-provisioned for system tasks (monitoring, scheduled jobs)
- Use admin-configured shared API keys
- Not tied to any user
## Auth Layers
| Flow | Method | Details |
| ------------------------------ | -------------------------- | ----------------------------------------------------- |
| User → Mosaic WebUI | Breakglass (local) or OIDC | Breakglass always available as fallback |
| Mosaic API → OpenClaw | Bearer token | Auto-generated per agent, stored encrypted in DB |
| OpenClaw → Mosaic API (config) | Bearer token | Same agent token, validated by Mosaic |
| OpenClaw → LLM providers | API keys | Stored encrypted in DB, delivered via config endpoint |
| Admin → Settings | RBAC | Admin role required for provider/agent/OIDC config |
| Flow | Method |
| ------------------------------- | ---------------------------------------------------------------------- |
| User → Mosaic WebUI | Breakglass (local) or OIDC (configured in settings) |
| Mosaic API → OpenClaw container | Bearer token (auto-generated per container, stored encrypted in DB) |
| OpenClaw → LLM providers | User's own API keys (delivered via config endpoint, decrypted from DB) |
| Admin → System settings | RBAC (admin role required) |
| Internal config endpoint | Bearer token (container authenticates to fetch its config) |
## Database Schema (new tables)
## Database Schema
### `SystemConfig`
Key-value store for global settings (singleton-ish).
### System Tables
```prisma
model SystemConfig {
id String @id @default(cuid())
key String @unique // "oidc.issuerUrl", "oidc.clientId", "onboarding.completed", etc.
key String @unique // "oidc.issuerUrl", "oidc.clientId", "onboarding.completed"
value String // plaintext or encrypted (prefix: "enc:")
encrypted Boolean @default(false)
updatedAt DateTime @updatedAt
}
```
### `LlmProvider`
LLM provider configurations.
```prisma
model LlmProvider {
id String @id @default(cuid())
name String @unique // "zai", "openai", "anthropic", "ollama-local", etc.
displayName String // "Z.ai", "OpenAI", "Local Ollama"
type String // "zai" | "openai" | "anthropic" | "ollama" | "custom"
baseUrl String? // null for built-in providers, URL for custom/ollama
apiKey String? // encrypted
apiType String @default("openai-completions") // openai-completions | anthropic-messages | etc.
models Json @default("[]") // available model list [{id, name, contextWindow, maxTokens}]
isActive Boolean @default(true)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
agents AgentModelAssignment[]
}
```
### `AgentConfig`
Per-agent configuration (replaces old OpenClawAgent).
```prisma
model AgentConfig {
id String @id @default(cuid())
name String @unique // "mosaic-main", "mosaic-projects", etc.
displayName String // "Main Orchestrator", "Projects", etc.
role String // "orchestrator" | "developer" | "researcher" | "operations"
gatewayUrl String // internal Docker URL: "http://mosaic-main:18789"
gatewayToken String // encrypted — auto-generated
isActive Boolean @default(true)
personality String? // SOUL.md content for this agent
toolPermissions Json @default("[]") // allowed tool list
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
modelAssignment AgentModelAssignment?
}
```
### `AgentModelAssignment`
Links agents to providers and models.
```prisma
model AgentModelAssignment {
id String @id @default(cuid())
agentId String @unique
agent AgentConfig @relation(fields: [agentId], references: [id])
providerId String
provider LlmProvider @relation(fields: [providerId], references: [id])
primaryModel String // "glm-5", "claude-sonnet-4-6", "cogito", etc.
fallbacks Json @default("[]") // [{providerId, model}]
updatedAt DateTime @updatedAt
}
```
### `BreakglassUser`
Local admin user (no OIDC dependency).
```prisma
model BreakglassUser {
id String @id @default(cuid())
username String @unique
passwordHash String // bcrypt
passwordHash String // bcrypt
isActive Boolean @default(true)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
```
### Provider Tables (per-user)
```prisma
model LlmProvider {
id String @id @default(cuid())
userId String // owner — each user manages their own providers
name String // "my-zai", "work-openai", "local-ollama"
displayName String // "Z.ai", "OpenAI (Work)", "Local Ollama"
type String // "zai" | "openai" | "anthropic" | "ollama" | "custom"
baseUrl String? // null for built-in, URL for custom/ollama
apiKey String? // encrypted
apiType String @default("openai-completions")
models Json @default("[]") // [{id, name, contextWindow, maxTokens}]
isActive Boolean @default(true)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@unique([userId, name])
}
```
### Container Tables
```prisma
model UserContainer {
id String @id @default(cuid())
userId String @unique
containerId String? // Docker container ID (null = not running)
containerName String // "mosaic-user-{userId}"
gatewayPort Int? // assigned port (null = not running)
gatewayToken String // encrypted — auto-generated
status String @default("stopped") // "running" | "stopped" | "starting" | "error"
lastActiveAt DateTime?
idleTimeoutMin Int @default(30)
config Json @default("{}") // cached openclaw.json for this user
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
model SystemContainer {
id String @id @default(cuid())
name String @unique // "mosaic-system-ops", "mosaic-system-monitor"
role String // "operations" | "monitor" | "scheduler"
containerId String?
gatewayPort Int?
gatewayToken String // encrypted
status String @default("stopped")
providerId String? // references admin-level LlmProvider
primaryModel String // "zai/glm-5", etc.
isActive Boolean @default(true)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
```
### User Agent Preferences
```prisma
model UserAgentConfig {
id String @id @default(cuid())
userId String @unique
primaryModel String? // user's preferred model
fallbackModels Json @default("[]")
personality String? // custom SOUL.md content
providerId String? // default provider for this user
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
```
## Internal Config Endpoint
`GET /api/internal/agent-config/:agentName`
`GET /api/internal/agent-config/:containerType/:id`
- Auth: Bearer token (agent's own gateway token)
- Returns: Complete `openclaw.json` generated from DB tables
- Includes: model config, provider credentials (decrypted), tool permissions
- Auth: Bearer token (container's own gateway token)
- Returns: Complete `openclaw.json` generated from DB
- For user containers: includes user's providers, model prefs, personality
- For system containers: includes admin provider config
## Docker Compose (simplified)
Response assembles openclaw.json dynamically:
```json
{
"gateway": { "mode": "local", "port": 18789, "bind": "lan", "auth": { "mode": "token" } ... },
"agents": { "defaults": { "model": { "primary": "<from UserAgentConfig>" } } },
"models": { "providers": { "<from LlmProvider rows>": { ... } } }
}
```
## Container Lifecycle Manager
NestJS service that manages Docker containers:
```typescript
class ContainerLifecycleService {
// Create and start a user's OpenClaw container
async ensureRunning(userId: string): Promise<{ url: string; token: string }>;
// Stop idle containers (called by cron/scheduler)
async reapIdle(): Promise<number>;
// Stop a specific user's container
async stop(userId: string): Promise<void>;
// Health check all running containers
async healthCheckAll(): Promise<HealthStatus[]>;
// Restart container with updated config
async restart(userId: string): Promise<void>;
}
```
Uses Docker Engine API (`/var/run/docker.sock` or TCP) via `dockerode` npm package.
## Onboarding Wizard
### First-Boot Detection
- API checks: `SystemConfig.get("onboarding.completed")` → null = first boot
- WebUI redirects to `/onboarding` if not completed
### Steps
**Step 1: Create Breakglass Admin**
- Username + password → bcrypt → `BreakglassUser` table
- This user always works, even if OIDC is misconfigured
**Step 2: Configure Authentication (optional)**
- OIDC: provider URL, client ID, client secret → encrypted in `SystemConfig`
- Skip = breakglass-only auth (can add OIDC later in settings)
**Step 3: Add Your First LLM Provider**
- Pick type → enter API key/endpoint → test connection → save to `LlmProvider`
- This becomes the admin's default provider
**Step 4: System Agents (optional)**
- Configure always-on system agents for monitoring/ops
- Or skip — users can just use their own personal agents
**Step 5: Complete**
- Sets `SystemConfig("onboarding.completed") = true`
- Redirects to dashboard
### Post-Onboarding: User Self-Service
- Each user adds their own LLM providers in profile settings
- Each user configures their preferred model, personality
- First chat request triggers container creation
## Docker Compose (final)
```yaml
services:
@@ -167,86 +256,101 @@ services:
environment:
DATABASE_URL: ${DATABASE_URL}
MOSAIC_SECRET_KEY: ${MOSAIC_SECRET_KEY}
volumes:
- /var/run/docker.sock:/var/run/docker.sock # Docker API access
networks:
- internal
mosaic-web:
image: mosaic/web:latest
environment:
NEXT_PUBLIC_API_URL: http://mosaic-api:4000
networks:
- internal
mosaic-main:
image: alpine/openclaw:latest
command: ["/config/entrypoint.sh"]
postgres:
image: postgres:17
environment:
DATABASE_URL: ${DATABASE_URL}
MOSAIC_API_URL: http://mosaic-api:4000
MOSAIC_SECRET_KEY: ${MOSAIC_SECRET_KEY}
AGENT_NAME: mosaic-main
POSTGRES_DB: mosaic
POSTGRES_USER: mosaic
POSTGRES_PASSWORD: ${DATABASE_PASSWORD}
volumes:
- mosaic-main-state:/home/node/.openclaw
- postgres-data:/var/lib/postgresql/data
networks:
- internal
# Additional agents follow same pattern, only AGENT_NAME differs
# System agent (optional, admin-provisioned)
# mosaic-system:
# image: alpine/openclaw:latest
# ... (managed by ContainerLifecycleService)
# User containers are NOT in this file —
# they are dynamically created by ContainerLifecycleService
# via the Docker API at runtime.
networks:
internal:
driver: overlay
volumes:
postgres-data:
```
### Entrypoint (simplified)
Note: User OpenClaw containers are **not** defined in docker-compose. They are
created dynamically by the `ContainerLifecycleService` when users start chatting.
## Entrypoint (for dynamically created containers)
```sh
#!/bin/sh
# Fetch config from Mosaic API, write openclaw.json, start gateway
CONFIG=$(curl -sf "${MOSAIC_API_URL}/api/internal/agent-config/${AGENT_NAME}" \
-H "Authorization: Bearer ${MOSAIC_SECRET_KEY}")
echo "$CONFIG" > /tmp/openclaw.json
set -e
: "${MOSAIC_API_URL:?required}"
: "${AGENT_TOKEN:?required}"
: "${AGENT_ID:?required}"
# Fetch config from Mosaic API
curl -sf "${MOSAIC_API_URL}/api/internal/agent-config/${AGENT_ID}" \
-H "Authorization: Bearer ${AGENT_TOKEN}" \
-o /tmp/openclaw.json
export OPENCLAW_CONFIG_PATH=/tmp/openclaw.json
exec openclaw gateway run --bind lan --auth token
```
Container env vars (injected by ContainerLifecycleService):
- `MOSAIC_API_URL` — internal API URL
- `AGENT_TOKEN` — this container's bearer token (from DB)
- `AGENT_ID` — container ID for config lookup
## Config Update Strategy
When a user changes settings (model, provider, personality):
1. Mosaic API updates DB
2. API calls `ContainerLifecycleService.restart(userId)`
3. Container restarts, fetches fresh config from API
4. OpenClaw gateway starts with new config
5. State volume preserves sessions/memory across restarts
## Task Breakdown
### Phase 1a: DB Schema + Internal Config API
| Task | Phase | Scope | Dependencies |
| -------- | -------------- | --------------------------------------------------------------------------------------------------------------------- | ------------ |
| MS22-P1a | Schema | Prisma models: SystemConfig, BreakglassUser, LlmProvider, UserContainer, SystemContainer, UserAgentConfig. Migration. | — |
| MS22-P1b | Crypto | Encryption service for API keys/tokens (AES-256-GCM using MOSAIC_SECRET_KEY) | P1a |
| MS22-P1c | Config API | Internal config endpoint: assembles openclaw.json from DB | P1a, P1b |
| MS22-P1d | Container Mgr | ContainerLifecycleService: Docker API integration (dockerode), start/stop/health/reap | P1a |
| MS22-P1e | Onboarding API | Onboarding endpoints: breakglass, OIDC, provider, complete | P1a, P1b |
| MS22-P1f | Onboarding UI | Multi-step wizard in WebUI | P1e |
| MS22-P1g | Settings API | CRUD: providers, agent config, OIDC, breakglass | P1a, P1b |
| MS22-P1h | Settings UI | Settings pages: Providers, Agent Config, Auth | P1g |
| MS22-P1i | Chat Proxy | Route WebUI chat → user's OpenClaw container (SSE) | P1c, P1d |
| MS22-P1j | Docker | Entrypoint script, health checks, compose for core services | P1c |
| MS22-P1k | Idle Reaper | Cron service to stop idle user containers | P1d |
- Prisma schema: SystemConfig, LlmProvider, AgentConfig, AgentModelAssignment, BreakglassUser
- Migration
- Internal config endpoint: generates openclaw.json from DB
- Encryption/decryption service for API keys and tokens
## Open Questions (Resolved)
### Phase 1b: Onboarding Wizard (API)
- Detect first-boot (no breakglass user exists)
- POST /api/onboarding/breakglass — create admin
- POST /api/onboarding/oidc — save OIDC config
- POST /api/onboarding/provider — add LLM provider + test connection
- POST /api/onboarding/agents — configure agent fleet
- POST /api/onboarding/complete — mark onboarding done
### Phase 1c: Onboarding Wizard (WebUI)
- Multi-step wizard component
- Breakglass user creation form
- OIDC config form (skip option)
- LLM provider form with connection test
- Agent configuration with model picker
- Summary + deploy health check
### Phase 1d: Settings Pages (WebUI)
- Settings/Providers — CRUD for LLM providers
- Settings/Agents — model assignments, personalities, status
- Settings/Auth — OIDC config, breakglass password reset
- All behind admin RBAC
### Phase 1e: Docker Compose + Entrypoint
- Simplified compose (AGENT_NAME + shared env vars)
- Entrypoint: curl config from API, write, start
- Health check integration
### Phase 1f: Chat Proxy
- Mosaic API routes WebUI chat to correct OpenClaw agent
- SSE streaming pass-through
- Agent selector in WebUI
## Open Questions
1. Should agents auto-restart when config changes in DB? (webhook/signal vs polling)
2. Should breakglass user be created via CLI as alternative to WebUI wizard?
3. Config cache TTL in agents? (avoid hitting API on every request)
1. ~~Config updates → restart?~~ **Yes.** Mosaic restarts the container, fresh config on boot.
2. ~~CLI alternative for breakglass?~~ **Yes.** Both WebUI wizard and CLI (`mosaic admin create-breakglass`).
3. ~~Config cache TTL?~~ **Yes.** Config fetched once at startup, changes trigger restart.