Files

ci/woodpecker/push/ci Pipeline was successful

Details

chore: bootstrap Harness Foundation mission (Phase 9) (#289 )

Co-authored-by: Jason Woltje <jason@diversecanvas.com>
Co-committed-by: Jason Woltje <jason@diversecanvas.com>

2026-03-21 20:10:48 +00:00

34 KiB

Raw Blame History

PRD: Harness Foundation — Phase 9

Metadata

Owner: Jason Woltje
Date: 2026-03-21
Status: draft
Phase: 9 (post-MVP)
Version Target: v0.2.0
Agent Harness: Pi SDK
Best-Guess Mode: true
Repo: git.mosaicstack.dev/mosaic/mosaic-stack

Problem Statement

Mosaic Stack v0.1.0 delivered a functional skeleton — gateway boots, TUI connects, single-agent chat streams, basic auth works. But the system is not usable as a daily-driver harness:

Chat messages are fire-and-forget. The WebSocket gateway never calls ConversationsRepo. Context is lost on disconnect. Conversations can't be resumed with history. Cross-interface continuity (TUI → WebUI → Matrix) is impossible.
Single provider (Ollama) with local models only. No access to frontier models (Claude Opus 4.6, Codex gpt-5.4, GLM-5). The routing engine exists but has never been tested with real providers.
No task-aware agent routing. A coding task and a summarization task route to the same agent with the same model. There is no mechanism to match tasks to agents by capability, cost tier, or specialization.
Memory is not user-scoped. Insight vector search returns all users' data. Deploying multi-user is a security violation.
Agent configs exist in DB but are ignored. Stored system prompts, model preferences, and tool allowlists don't apply to sessions. The /model and /agent slash commands are stubbed.
No job queue. Background processing (summarization, GC, tier management) runs on fragile cron. No retry, no monitoring, no async task dispatch foundation for future agent orchestration.
Plugin system is hollow. Zero implementations. No defined message protocol. Blocks all remote interfaces (Matrix, Discord, Telegram) planned for Phase 10+.

What this phase solves: Transform Mosaic from a demo into a real multi-provider, task-routing AI harness that persists everything, routes intelligently, and is architecturally ready for multi-agent and remote control.

Objectives

Persistent conversations — Every message saved, every conversation resumable, full context available across interfaces
Multi-provider LLM access — Anthropic, OpenAI, OpenRouter, Z.ai, Ollama with proper auth flows
Task-aware agent routing — Granular routing rules that match tasks to the right agent + model by capability, cost, and domain
Security isolation — All data queries user-scoped, ready for multi-user deployment
Session hardening — Agent configs apply, model/agent switching works mid-session
Reliable background processing — BullMQ job queue replaces fragile cron
Channel protocol design — Architecture for Matrix and remote interfaces, built into the foundation now

Scope

In Scope

Conversation persistence — wire ChatGateway to ConversationsRepo, context loading on resume
Multi-provider integration — Anthropic, OpenAI, OpenRouter, Z.ai, Ollama with auth flows
Task-aware agent routing — granular routing rules with task classification and fallback chains
Security isolation — user-scoped queries on all data paths (memory, conversations, agents)
Agent session hardening — configs apply, model/agent switching, session resume
Job queue — BullMQ replacing cron for background processing
Channel protocol design — architecture document for Matrix and remote interfaces
Embedding migration — Ollama-local embeddings replacing OpenAI dependency

Out of Scope

Matrix homeserver deployment + appservice (Phase 10)
Multi-agent orchestration / supervisor-worker pattern (Phase 10+)
WebUI rebuild (future)
Self-managing memory — compaction, merge, forget (future)
Team workspace isolation (future)
Remote channel plugins — WhatsApp, Discord, Telegram (Phase 10+, via Matrix)
Fine-grained RBAC — project/agent/team roles (future)
Agent-to-agent communication (Phase 10+)

User/Stakeholder Requirements

As a user, I can resume a conversation after closing the TUI and the agent remembers the full context
As a user, I can use frontier models (Claude Opus 4.6, Codex gpt-5.4) without manual provider configuration
As a user, the system automatically selects the best model for my task (coding → powerful model, simple question → cheap model)
As a user, I can override the automatic model selection with /model <name> at any time
As a user, I can switch between specialized agents mid-session with /agent <name>
As an admin, I can define routing rules that control which models handle which task types
As an admin, I can monitor background job health and retry failed jobs
As a user, my conversations, memories, and preferences are invisible to other users

Functional Requirements

FR-1: ChatGateway persists every message (user, assistant, tool call, thinking) to the conversations/messages tables
FR-2: On session resume with an existing conversationId, message history is loaded from DB and injected into the agent session context
FR-3: When conversation history exceeds 80% of the model's context window, older messages are summarized and prepended as a context checkpoint
FR-4: Five LLM providers are registered with the gateway: Anthropic (Claude Sonnet 4.6, Opus 4.6, Haiku 4.5), OpenAI (Codex gpt-5.4), OpenRouter (dynamic model list), Z.ai (GLM-5), Ollama (local models)
FR-5: Each provider supports API key auth; Anthropic and OpenAI additionally support OAuth (URL-display + callback pattern)
FR-6: Provider credentials are stored per-user in the DB (encrypted), not in environment variables
FR-7: A routing engine classifies each user message by taskType, complexity, domain, and required capabilities, then selects the optimal provider/model via priority-ordered rules
FR-8: Default routing rules are seeded on first run; admins can customize system-wide rules; users can set per-session overrides
FR-9: Routing decisions are transparent — the TUI shows which model was selected and why
FR-10: Agent configs (system prompt, default model, tool allowlist, skills) stored in DB are applied when creating agent sessions
FR-11: /model <name> switches the active model for subsequent messages in the current session
FR-12: /agent <name> switches to a different agent config, loading its system prompt, tools, and default model
FR-13: All memory queries (insight vector search, preferences) filter by userId
FR-14: BullMQ handles background jobs (summarization, GC, tier management) with retry, backoff, and monitoring
FR-15: Embeddings are served locally via Ollama (nomic-embed-text or mxbai-embed-large) with no external API dependency

Non-Functional Requirements

Security: All data queries include userId filter. Provider credentials encrypted at rest. No cross-user data leakage. OAuth tokens stored securely with refresh handling.
Performance: Message persistence adds <50ms to message relay latency. Routing classification <100ms per message. Provider health checks run on configurable interval (default 60s) without blocking requests.
Reliability: BullMQ jobs retry with exponential backoff (3 attempts default). Provider failover: if primary provider is unhealthy, fallback chain activates automatically. Conversation context survives TUI restart.
Observability: Routing decisions logged with classification details. Job execution logged to agent_logs. Provider health status exposed via /api/providers/health. Session metrics (tokens, model switches, duration) persisted in DB.

Acceptance Criteria

AC-1: Send messages in TUI → restart TUI → resume conversation → agent has full history and context
AC-2: Route a coding task to Claude Opus 4.6, a simple question to Haiku, a summarization to GLM-5 — all via granular routing rules
AC-3: Two users exist, User A's memory searches never return User B's data
AC-4: /model claude-sonnet-4-6 in TUI switches the active model for subsequent messages
AC-5: /agent coding-agent in TUI switches to a different agent with different system prompt and tools
AC-6: BullMQ jobs execute on schedule, failures retry with backoff, admin can inspect via /api/admin/jobs
AC-7: Channel protocol document exists with Matrix integration points defined, reviewed, and approved
AC-8: Embeddings run on Ollama local models (no external API dependency for vector operations)
AC-9: All five providers (Anthropic, OpenAI, OpenRouter, Z.ai, Ollama) connect, list models, and complete chat requests
AC-10: Routing transparency — TUI displays which model was selected and the routing reason for each response

Testing and Verification Expectations

Baseline checks: pnpm typecheck, pnpm lint, pnpm format:check — all green before any push
Unit tests: Routing engine rules matching, task classifier, provider adapter registration, message persistence
Integration tests: Two-user isolation (M2-007), provider round-trip (M3-012), routing end-to-end (M4-013), session resume with context (M1-008)
Situational tests per milestone: Each milestone has a verify task that exercises the delivered functionality end-to-end
Evidence format: Test output + manual verification notes in scratchpad per milestone

Constraints and Dependencies

Type	Item	Notes
Dependency	`@anthropic-ai/sdk`	npm, required for M3-002
Dependency	`openai`	npm, required for M3-003
Dependency	`bullmq`	npm, Valkey-compatible, required for M6
Dependency	Ollama embedding models	`ollama pull nomic-embed-text`, required for M3-009
Dependency	Pi SDK provider adapter support	ASSUMPTION: supported — verify in M3-001
External	Anthropic OAuth credentials	Requires Anthropic Console setup
External	OpenAI OAuth credentials	Requires OpenAI Platform setup
External	Z.ai API key	Requires Z.ai account
External	OpenRouter API key	Requires OpenRouter account
Constraint	Valkey 8 compatibility	BullMQ requires Redis 6+; Valkey 8 is compatible
Constraint	Embedding dimension migration	Switching from 1536 (OpenAI) to 768/1024 (Ollama) requires re-embedding or fresh start

Assumptions

ASSUMPTION: Pi SDK supports custom provider adapters for all target LLM providers. If not, adapters wrap native SDKs behind Pi's interface. Rationale: Gateway already uses Pi with Ollama via a custom adapter pattern.
ASSUMPTION: BullMQ is Valkey-compatible. Rationale: BullMQ documents Redis 6+ compatibility; Valkey 8 is Redis-compatible.
ASSUMPTION: Ollama can serve embedding models (nomic-embed-text, mxbai-embed-large) with acceptable quality. Rationale: Ollama supports embedding endpoints natively.
ASSUMPTION: Anthropic and OpenAI OAuth flows can be handled via URL-display + token callback pattern (same as existing provider auth). Rationale: Both providers offer standard OAuth 2.0 flows.
ASSUMPTION: Z.ai GLM-5 uses an API format compatible with OpenAI or has a documented SDK. Rationale: Most LLM providers converge on OpenAI-compatible APIs.
ASSUMPTION: The existing Pi SDK session model supports mid-session model switching without destroying session state. If not, we destroy and recreate with conversation history. Rationale: Acceptable fallback — context is persisted in DB.
ASSUMPTION: Channel protocol design can be completed without a running Matrix homeserver. Rationale: Matrix protocol is well-documented; design is architecture, not integration.

Milestones

Milestone 1: Conversation Persistence & Context

Goal: Every message persisted. Every conversation resumable with full context.

Task	Description
M1-001	Wire ChatGateway.handleMessage() → ConversationsRepo.addMessage() for user messages
M1-002	Wire agent event relay → ConversationsRepo.addMessage() for assistant responses (text, tool calls, thinking)
M1-003	Store message metadata: model used, provider, token counts, tool call details, timestamps
M1-004	On session resume (existing conversationId), load message history from DB and inject into Pi session context
M1-005	Context window management: if history exceeds model context, summarize older messages and prepend summary
M1-006	Conversation search: full-text search on messages table via `/api/conversations/search`
M1-007	TUI: `/history` command to display conversation message count and context usage
M1-008	Verify: send messages → kill TUI → resume with `-c <id>` → agent references prior context

Milestone 2: Security & Isolation

Goal: All data queries user-scoped. Safe for multi-user deployment.

Task	Description
M2-001	Audit InsightsRepo: add `userId` filter to `searchByEmbedding()` vector search
M2-002	Audit InsightsRepo: add `userId` filter to `findByUser()`, `decayOldInsights()`
M2-003	Audit PreferencesRepo: verify all queries filter by userId
M2-004	Audit agent memory tools: verify `memory_search`, `memory_save_`, `memory_get_` all scope to session user
M2-005	Audit ConversationsRepo: verify ownership check on findById, update, delete, addMessage, findMessages
M2-006	Audit AgentsRepo: verify `findAccessible()` returns only user's agents + system agents
M2-007	Add integration test: create two users, populate data for each, verify cross-user isolation on every query path
M2-008	Audit Valkey keys: verify session keys include userId or are not enumerable across users

Milestone 3: Provider Integration

Goal: Five providers operational with proper auth, health checking, and capability metadata.

Task	Description
M3-001	Refactor ProviderService into provider adapter pattern: `IProviderAdapter` interface with `register()`, `listModels()`, `healthCheck()`, `createClient()`
M3-002	Anthropic adapter: `@anthropic-ai/sdk`, register Claude Sonnet 4.6 + Opus 4.6, OAuth flow (URL display + callback), API key fallback
M3-003	OpenAI adapter: `openai` SDK, register Codex gpt-5.4, OAuth flow, API key fallback
M3-004	OpenRouter adapter: OpenAI-compatible client, API key auth, dynamic model list from `/api/v1/models`
M3-005	Z.ai GLM adapter: register GLM-5, API key auth, research and implement API format
M3-006	Ollama adapter: refactor existing Ollama integration into adapter pattern, add embedding model support
M3-007	Provider health check: periodic probe (configurable interval), status per provider, expose via `/api/providers/health`
M3-008	Model capability matrix: define per-model metadata (tier, context window, tool support, vision, streaming, embedding capable)
M3-009	Refactor EmbeddingService: replace OpenAI-hardcoded client with provider-agnostic interface, Ollama as default (nomic-embed-text or mxbai-embed-large)
M3-010	OAuth token storage: persist provider tokens per user in DB (encrypted), refresh flow
M3-011	Provider config UI support: `/api/providers` CRUD for user-scoped provider credentials
M3-012	Verify: each provider connects, lists models, completes a chat request, handles errors gracefully

Milestone 4: Agent Routing Engine

Goal: Granular, rule-based routing that matches tasks to the right agent and model by capability, cost, and domain specialization.

Task	Description
M4-001	Define routing rule schema: `RoutingRule { name, priority, conditions[], action }` stored in DB
M4-002	Condition types: `taskType` (coding, research, summarization, conversation, analysis, creative), `complexity` (simple, moderate, complex), `domain` (frontend, backend, devops, docs, general), `costTier` (cheap, standard, premium), `requiredCapabilities` (tools, vision, long-context, reasoning)
M4-003	Action types: `routeTo { provider, model, agentConfigId?, systemPromptOverride?, toolAllowlist? }`
M4-004	Default routing rules (seed data): coding → Opus 4.6, simple Q&A → Sonnet 4.6, summarization → GLM-5, research → Codex gpt-5.4, local/offline → Ollama llama3.2
M4-005	Task classification: lightweight classifier that infers taskType + complexity from user message (can be rule-based regex/keyword initially, LLM-assisted later)
M4-006	Routing decision pipeline: classify task → match rules by priority → select best available provider/model → fallback chain if primary unavailable
M4-007	Routing override: user can force a specific model via `/model <name>` regardless of routing rules
M4-008	Routing transparency: include routing decision in `session:info` event (why this model was selected)
M4-009	Routing rules CRUD: `/api/routing/rules` — list, create, update, delete, reorder priority
M4-010	Per-user routing overrides: users can customize default rules for their sessions
M4-011	Agent specialization: agents can declare capabilities in their config (domains, preferred models, tool sets)
M4-012	Routing integration: wire routing engine into ChatGateway — every new message triggers routing decision before agent dispatch
M4-013	Verify: send a coding question → routed to Opus; send "summarize this" → routed to GLM-5; send "what time is it" → routed to cheap tier

Milestone 5: Agent Session Hardening

Goal: Agent configs apply to sessions. Model and agent switching work mid-session.

Task	Description
M5-001	Wire ChatGateway: on session create, load agent config from DB (system prompt, model, provider, tool allowlist, skills)
M5-002	`/model <name>` command: end-to-end wiring — TUI → socket `command:execute` → gateway switches provider/model → new messages use new model
M5-003	`/agent <name>` command: switch to different agent config mid-session — loads new system prompt, tools, and default model
M5-004	Session ↔ conversation binding: persist sessionId on conversation record, allow session resume via conversation ID
M5-005	Session info broadcast: on model/agent switch, emit `session:info` with updated provider, model, agent name
M5-006	Agent creation from TUI: `/agent new` command creates agent config via gateway API
M5-007	Session metrics: track per-session token usage, model switches, duration — persist in DB
M5-008	Verify: start TUI → `/model claude-opus-4-6` → verify response uses Opus → `/agent research-bot` → verify system prompt changes

Milestone 6: Job Queue Foundation

Goal: Reliable background processing via BullMQ. Foundation for future agent task orchestration.

Task	Description
M6-001	Add BullMQ dependency, configure with Valkey connection
M6-002	Create queue service: typed job definitions, worker registration, error handling with exponential backoff
M6-003	Migrate summarization cron → BullMQ repeatable job
M6-004	Migrate GC (session cleanup) → BullMQ repeatable job
M6-005	Migrate tier management (log archival) → BullMQ repeatable job
M6-006	Admin jobs API: `GET /api/admin/jobs` — list active/completed/failed jobs, retry failed, pause/resume queues
M6-007	Job event logging: emit job start/complete/fail events to agent_logs for observability
M6-008	Verify: jobs execute on schedule, deliberate failure retries with backoff, admin endpoint shows job history

Milestone 7: Channel Protocol Design

Goal: Architecture document defining how remote interfaces (Matrix, Discord, Telegram) will integrate. No code — design only. Built into foundation now so Phase 10+ doesn't require gateway rewrites.

Task	Description
M7-001	Define `IChannelAdapter` interface: lifecycle (connect, disconnect, health), message flow (receiveMessage → gateway, sendMessage ← gateway), identity mapping (channel user ↔ Mosaic user)
M7-002	Define channel message protocol: canonical message format that all adapters translate to/from (content, metadata, attachments, thread context)
M7-003	Design Matrix integration: appservice registration, room ↔ conversation mapping, space ↔ team mapping, agent ghost users, power levels for human observation
M7-004	Design conversation multiplexing: same conversation accessible from TUI + WebUI + Matrix simultaneously, real-time sync via gateway events
M7-005	Design remote auth bridging: how a Matrix/Discord message authenticates to Mosaic (token linking, OAuth bridge, invite-based provisioning)
M7-006	Design agent-to-agent communication via Matrix rooms: room per agent pair, human can join to observe, message format for structured agent dialogue
M7-007	Design multi-user isolation in Matrix: space-per-team, room visibility rules, encryption considerations, admin visibility
M7-008	Publish architecture doc: `docs/architecture/channel-protocol.md` — reviewed and approved before Phase 10

Technical Approach

Pi SDK Provider Adapter Pattern

The agent layer stays on Pi SDK. Provider diversity is solved at the adapter layer below Pi:

Provider SDKs (@anthropic-ai/sdk, openai, etc.)
  → IProviderAdapter implementations
    → ProviderRegistry (Pi SDK compatible)
      → Agent Session (Pi SDK) — tool loops, streaming, context
        → AgentService — lifecycle, routing, events
          → ChatGateway — WebSocket to all interfaces

Adding a provider means implementing IProviderAdapter. Everything above stays unchanged.

Routing Decision Flow

User sends message
  → Task classifier (regex/keyword, optionally LLM-assisted)
    → { taskType, complexity, domain, requiredCapabilities }
  → RoutingEngine.resolve(classification, userOverrides, availableProviders)
    → Match rules by priority
    → Check provider health
    → Apply fallback chain
    → Return { provider, model, agentConfigId }
  → AgentService.createOrResumeSession(routingResult)
    → Session uses selected provider/model
  → Emit session:info with routing decision explanation

Embedding Strategy

Replace OpenAI-hardcoded embedding service with provider-agnostic interface:

Default: Ollama serving nomic-embed-text (768-dim) or mxbai-embed-large (1024-dim)
Fallback: Any OpenAI-compatible embedding API
Migration: Update pgvector column dimension if switching from 1536 (OpenAI) to 768/1024 (Ollama models)
No external API dependency for vector operations in default configuration

Context Window Management

When conversation history exceeds model context:

Calculate token count of full history
If exceeds 80% of model context window, trigger summarization
Summarize oldest N messages into a condensed context block
Prepend summary + keep recent messages within context budget
Store summary as a "context checkpoint" message in DB

Model Reference

Provider	Model	Tier	Context	Tools	Vision	Embedding
Anthropic	Claude Opus 4.6	premium	200K	yes	yes	no
Anthropic	Claude Sonnet 4.6	standard	200K	yes	yes	no
Anthropic	Claude Haiku 4.5	cheap	200K	yes	yes	no
OpenAI	Codex gpt-5.4	premium	128K+	yes	yes	no
Z.ai	GLM-5	standard	TBD	TBD	TBD	no
OpenRouter	varies	varies	varies	varies	varies	no
Ollama	llama3.2	local/free	128K	yes	no	no
Ollama	nomic-embed-text	—	—	—	—	yes (768-dim)
Ollama	mxbai-embed-large	—	—	—	—	yes (1024-dim)

Default Routing Rules (Seed Data)

Priority	Condition	Route To
1	taskType=coding AND complexity=complex	Opus 4.6
2	taskType=coding AND complexity=moderate	Sonnet 4.6
3	taskType=coding AND complexity=simple	Codex gpt-5.4
4	taskType=research	Codex gpt-5.4
5	taskType=summarization	GLM-5
6	taskType=analysis AND requiredCapabilities includes reasoning	Opus 4.6
7	taskType=conversation	Sonnet 4.6
8	taskType=creative	Sonnet 4.6
9	costTier=cheap OR domain=general	Haiku 4.5
10	fallback (no rule matched)	Sonnet 4.6
99	provider=ollama forced OR offline mode	llama3.2

Rules are user-customizable. Admins set system defaults; users override for their sessions.

Risks and Open Questions

Risk	Impact	Mitigation
Pi SDK doesn't support custom provider adapters cleanly	High — blocks M3	Verify in M3-001; fallback: wrap native SDKs and bypass Pi's registry, feeding responses into Pi's session format
BullMQ + Valkey incompatibility	Medium — blocks M6	Test in M6-001 before migrating jobs; fallback: use `bullmq` with `ioredis` directly
Embedding dimension migration (1536 → 768/1024)	Medium — data migration	Run migration script to re-embed existing insights; or start fresh if insight count is low
Z.ai GLM-5 API undocumented	Low — blocks one provider	Deprioritize; other 4 providers cover all use cases
Context window summarization quality	Medium — affects UX	Start with simple truncation; add LLM summarization iteratively
OAuth flow complexity in TUI (no browser redirect)	Medium	URL-display + clipboard + Valkey poll token pattern (already designed in P8-012)

Open Questions

What is the Z.ai GLM-5 API format? OpenAI-compatible or custom SDK? (Research in M3-005)
Should routing classification use LLM-assisted classification from the start, or rule-based only? (ASSUMPTION: rule-based first, LLM-assisted later)
What Ollama embedding model provides the best quality/performance tradeoff? (Test nomic-embed-text vs mxbai-embed-large in M3-009)
Should provider credentials be stored in DB per-user, or remain environment-variable based for system-wide providers? (ASSUMPTION: hybrid — env vars for system defaults, DB for per-user overrides)

Milestone / Delivery Intent

Target version: v0.2.0
Milestone count: 7
Definition of done: All 10 acceptance criteria verified with evidence, all quality gates green, PRD status updated to completed
Delivery order: M1 (persistence) → M2 (security) → M3 (providers) → M4 (routing) → M5 (sessions) → M6 (jobs) → M7 (channel design)
M1 and M2 are prerequisites — no provider or routing work begins until conversations persist and data is user-scoped

34 KiB Raw Blame History