# Mosaic Stack — Federation Implementation Milestones **Companion to:** `PRD.md` **Approach:** Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack. --- ## Milestone Dependency Graph ``` M1 (federated tier infra) └── M2 (Step-CA + grant schema + CLI) └── M3 (mTLS handshake + list/get + scope enforcement) ├── M4 (search + audit + rate limit) │ └── M5 (cache + offline degradation + OTEL) ├── M6 (revocation + auto-renewal) ◄── can start after M3 └── M7 (multi-user hardening + e2e suite) ◄── depends on M4+M5+M6 ``` M5 and M6 can run in parallel once M4 is merged. --- ## Test Strategy (applies to all milestones) Three layers, all required before a milestone ships: | Layer | Scope | Runtime | | ------------------ | --------------------------------------------- | ------------------------------------------------------------------------ | | **Unit** | Per-module logic, pure functions, adapters | Vitest, no I/O | | **Integration** | Single gateway against real PG/Valkey/Step-CA | Vitest + Docker Compose test profile | | **Federation E2E** | Two gateways on a Docker network, real mTLS | Playwright/custom harness (`tools/federation-harness/`) introduced in M3 | Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward). **Quality gates per milestone** (same as stack-wide): - `pnpm typecheck` green - `pnpm lint` green - `pnpm test` green (unit + integration) - `pnpm test:federation` green (M3+) - Independent code review passed - Docs updated (`docs/federation/`) - Merged PR on `main`, CI terminal green, linked issue closed --- ## M1 — Federated Tier Infrastructure **Goal:** A gateway can run in `federated` tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet. **Scope:** - Add `"tier": "federated"` to `mosaic.config.json` schema and validators - Docker Compose `federated` profile (`docker-compose.federated.yml`) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes - Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise - `pgvector` extension installed + verified on startup - Migration logic: safe upgrade path from `local`/`standalone` → `federated` (data export/import script, one-way) - `mosaic doctor` reports tier + service health - Gateway continues to serve as a normal standalone instance (no federation yet) **Deliverables:** - `mosaic.config.json` schema v2 (tier enum includes `federated`) - `apps/gateway/src/bootstrap/tier-detector.ts` - `docker-compose.federated.yml` - `scripts/migrate-to-federated.ts` - Updated `mosaic doctor` output - Updated `packages/storage/src/adapters/postgres.ts` with pgvector support **Acceptance tests:** | # | Test | Layer | | - | ---------------------------------------------------------------------------------------- | ----------- | | 1 | Gateway boots in `federated` tier with all services present | Integration | | 2 | Gateway refuses to boot in `federated` tier when Postgres unreachable (fail-fast, clear) | Integration | | 3 | `pgvector` extension available in target DB (`SELECT * FROM pg_extension WHERE extname='vector'`) | Integration | | 4 | Migration script moves a populated `local` (PGlite) instance to `federated` (Postgres) with no data loss | Integration | | 5 | `mosaic doctor` reports correct tier and all services green | Unit | | 6 | Existing standalone behavior regression: agent session works end-to-end, no federation references | E2E (single-gateway) | **Estimated budget:** ~20K tokens (infra + config + migration script) **Risk notes:** Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot. --- ## M2 — Step-CA + Grant Schema + Admin CLI **Goal:** An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet. **Scope:** - Embed Step-CA as a Docker Compose sidecar with a persistent CA volume - Gateway exposes a short-lived enrollment endpoint (single-use token from the grant) - DB schema: `federation_grants`, `federation_peers`, `federation_audit_log` (table only, not yet written to) - Sealed storage for `client_key_pem` using the existing credential sealing key - Admin CLI: - `mosaic federation grant create --user --peer --scope ` - `mosaic federation grant list` - `mosaic federation grant show ` - `mosaic federation peer add ` - `mosaic federation peer list` - Step-CA signs the cert with SAN OIDs for `grantId` + `subjectUserId` - Grant status transitions: `pending` → `active` on successful enrollment **Deliverables:** - `packages/db` migration: three federation tables + enum types - `apps/gateway/src/federation/ca.service.ts` (Step-CA client) - `apps/gateway/src/federation/grants.service.ts` - `apps/gateway/src/federation/enrollment.controller.ts` - `packages/mosaic/src/commands/federation/` (grant + peer subcommands) - `docker-compose.federated.yml` adds Step-CA service - Scope JSON schema + validator **Acceptance tests:** | # | Test | Layer | | - | ---------------------------------------------------------------------------------------- | ----------- | | 1 | `grant create` writes a `pending` row with a scoped bundle | Integration | | 2 | Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs | Integration | | 3 | Enrollment token is single-use; second attempt returns 410 | Integration | | 4 | Cert `subjectUserId` OID matches the grant's `subject_user_id` | Unit | | 5 | `client_key_pem` is at-rest encrypted; raw DB read shows ciphertext, not PEM | Integration | | 6 | `peer add ` on Server A yields an `active` peer record with a valid cert + key | E2E (two gateways, no traffic) | | 7 | Scope JSON with unknown resource type rejected at `grant create` | Unit | | 8 | `grant list` and `peer list` render active / pending / revoked accurately | Unit | **Estimated budget:** ~30K tokens (schema + CA integration + CLI + sealing) **Risk notes:** Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately. --- ## M3 — mTLS Handshake + `list` + `get` with Scope Enforcement **Goal:** Two federated gateways exchange real data over mTLS with scope intersecting native RBAC. **Scope:** - `FederationClient` (outbound): picks cert from `federation_peers`, does mTLS call - `FederationServer` (inbound): NestJS guard validates client cert, extracts `grantId` + `subjectUserId`, loads grant - Scope enforcement pipeline: 1. Resource allowlist / excluded-list check 2. Native RBAC evaluation as the `subjectUserId` 3. Scope filter intersection (`include_teams`, `include_personal`) 4. `max_rows_per_query` cap - Verbs: `list`, `get`, `capabilities` - Gateway query layer accepts `source: "local" | "federated:" | "all"`; fan-out for `"all"` - **Federation E2E harness** (`tools/federation-harness/`): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable **Deliverables:** - `apps/gateway/src/federation/client/federation-client.service.ts` - `apps/gateway/src/federation/server/federation-auth.guard.ts` - `apps/gateway/src/federation/server/scope.service.ts` - `apps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.ts` - `apps/gateway/src/federation/client/query-source.service.ts` (fan-out/merge) - `tools/federation-harness/` (compose + seed + test helpers) - `packages/types` — federation request/response DTOs in `federation.dto.ts` **Acceptance tests:** | # | Test | Layer | | -- | -------------------------------------------------------------------------------------------------------- | ----- | | 1 | A→B `list tasks` returns subjectUser's tasks intersected with scope | E2E | | 2 | A→B `list tasks` with `include_teams: [T1]` excludes T2 tasks the user owns | E2E | | 3 | A→B `get credential ` returns 403 when `credentials` is in `excluded_resources` | E2E | | 4 | Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation) | E2E | | 5 | Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached) | E2E | | 6 | Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403 | Integration | | 7 | `max_rows_per_query` caps response; request for more paginated | Integration | | 8 | `source: "all"` fan-out merges local + federated results, each tagged with `_source` | Integration | | 9 | Federation responses never persist: verify DB row count unchanged after `list` round-trip | E2E | | 10 | Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T | E2E | **Estimated budget:** ~40K tokens (largest milestone — core federation logic + harness) **Risk notes:** This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test. --- ## M4 — `search` Verb + Audit Log + Rate Limit **Goal:** Keyword search over allowed resources with full audit and per-grant rate limiting. **Scope:** - `search` verb across `resources` allowlist (intersection of scope + native RBAC) - Keyword search (reuse existing `packages/memory/src/adapters/keyword.ts`); pgvector search stays out of v1 search verb - Every federated request (all verbs) writes to `federation_audit_log`: `grant_id`, `verb`, `resource`, `query_hash`, `outcome`, `bytes_out`, `latency_ms` - No request body captured; `query_hash` is SHA-256 of normalized query params - Token-bucket rate limit per grant (default 60/min, override per grant) - 429 response with `Retry-After` header and structured body - 90-day hot retention for audit log; cold-tier rollover deferred to M7 **Deliverables:** - `apps/gateway/src/federation/server/verbs/search.controller.ts` - `apps/gateway/src/federation/server/audit.service.ts` (async write, no blocking) - `apps/gateway/src/federation/server/rate-limit.guard.ts` - Tests in harness **Acceptance tests:** | # | Test | Layer | | - | ------------------------------------------------------------------------------------------------- | ----------- | | 1 | `search` returns ranked hits only from allowed resources | E2E | | 2 | `search` excluding `credentials` does not return a match even when keyword matches a credential name | E2E | | 3 | Every successful request appears in `federation_audit_log` within 1s | Integration | | 4 | Denied request (403) is also audited with `outcome='denied'` | Integration | | 5 | Audit row stores query hash but NOT query body | Unit | | 6 | 61st request in 60s window returns 429 with `Retry-After` | E2E | | 7 | Per-grant override (e.g., 600/min) takes effect without restart | Integration | | 8 | Audit writes are async: request latency unchanged when audit write slow (simulated) | Integration | **Estimated budget:** ~20K tokens **Risk notes:** Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes. --- ## M5 — Cache + Offline Degradation + Observability **Goal:** Sessions feel fast and stay useful when the peer is slow or down. **Scope:** - In-memory response cache keyed by `(grant_id, verb, resource, query_hash)`, TTL 30s default - Cache NOT used for `search`; only `list` and `get` - Cache flushed on cert rotation and grant revocation - Circuit breaker per peer: after N failures, fast-fail for cooldown window - `_source` tagging extended with `_cached: true` when served from cache - Agent-visible "federation offline for ``" signal emitted once per session per peer - OTEL spans: `federation.request` with attrs `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`, `cached` - W3C `traceparent` propagated across the mTLS boundary (both directions) - `mosaic federation status` CLI subcommand **Deliverables:** - `apps/gateway/src/federation/client/response-cache.service.ts` - `apps/gateway/src/federation/client/circuit-breaker.service.ts` - `apps/gateway/src/federation/observability/` (span helpers) - `packages/mosaic/src/commands/federation/status.ts` **Acceptance tests:** | # | Test | Layer | | - | --------------------------------------------------------------------------------------------- | ----- | | 1 | Two identical `list` calls within 30s: second served from cache, flagged `_cached` | Integration | | 2 | `search` is never cached: two identical searches both hit the peer | Integration | | 3 | After grant revocation, peer's cache is flushed immediately | Integration | | 4 | After N consecutive failures, circuit opens; subsequent requests fail-fast without network call | E2E | | 5 | Circuit closes after cooldown and next success | E2E | | 6 | With peer offline, session completes using local data, one "federation offline" signal surfaced | E2E | | 7 | OTEL traces show spans on both gateways correlated by `traceparent` | E2E | | 8 | `mosaic federation status` prints peer state, cert expiry, last success/failure, circuit state | Unit | **Estimated budget:** ~20K tokens **Risk notes:** Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits. --- ## M6 — Revocation, Auto-Renewal, CRL **Goal:** Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution. **Scope:** - `mosaic federation grant revoke ` → status `revoked`, CRL updated, audit entry - DB hook: deleting a user cascades `revoke-on-delete` on all grants where that user is subject - Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s) - Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache - On renewal failure, peer marked `degraded` and admin-visible alert emitted - Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked `revoked`, user notified **Deliverables:** - `apps/gateway/src/federation/server/crl.service.ts` + endpoint - `apps/gateway/src/federation/server/revocation.service.ts` - DB cascade trigger or ORM hook for user deletion → grant revocation - `apps/gateway/src/federation/client/renewal.job.ts` (scheduled) - `packages/mosaic/src/commands/federation/grant.ts` gains `revoke` subcommand **Acceptance tests:** | # | Test | Layer | | - | ----------------------------------------------------------------------------------------- | ----- | | 1 | Admin `grant revoke` → A's next request fails with TLS-level error | E2E | | 2 | Deleting subject user on B auto-revokes all grants where that user was the subject | Integration | | 3 | CRL endpoint serves correct list; revoked cert present | Integration | | 4 | Server rejects cert listed in CRL even if cert itself is still time-valid | E2E | | 5 | Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests | E2E | | 6 | Renewal failure marks peer `degraded` and surfaces alert | Integration | | 7 | A marks peer `revoked` after a revocation-caused handshake failure (not on transient network errors) | E2E | **Estimated budget:** ~20K tokens **Risk notes:** The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call. --- ## M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite **Goal:** The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance. **Scope:** - Three-user scenario on Server B (E1, E2, E3) each with their own Server A - Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A - User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor - Negative-path test matrix (every denial path from PRD §8) - All PRD §15 acceptance criteria mapped to automated tests in the harness - Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested - Cold-storage rollover for audit log >90 days - Docs: operator runbook, onboarding guide, troubleshooting guide **Deliverables:** - Full federation acceptance suite in `tools/federation-harness/acceptance/` - `apps/web` surfaces for peer/grant/audit management - `docs/federation/RUNBOOK.md`, `docs/federation/ONBOARDING.md`, `docs/federation/TROUBLESHOOTING.md` - Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage) **Acceptance tests:** Every PRD §15 criterion must be automated and green. Additionally: | # | Test | Layer | | --- | ----------------------------------------------------------------------------------------------------- | ---------------- | | 1 | 3-employee scenario: each A sees only its user's data from B | E2E | | 2 | Grant with team scope returns team data; same grant denied access to another employee's personal data | E2E | | 3 | Concurrent sessions from E1's and E2's Server A to B interleave without any leakage | E2E | | 4 | Audit log across 3-user test shows per-grant trails with no mis-attributed rows | E2E | | 5 | Scope editor UI round-trip: edit → save → next request uses new scope | E2E | | 6 | Attempt to use a revoked grant's cert against a different grant's endpoint: rejected | E2E | | 7 | 90-day-old audit rows moved to cold tier; queryable via explicit historical query | Integration | | 8 | Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke | Manual checklist | **Estimated budget:** ~25K tokens **Risk notes:** This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge. --- ## Total Budget & Timeline Sketch | Milestone | Tokens (est.) | Can parallelize? | | --------- | ------------- | ---------------------- | | M1 | 20K | No (foundation) | | M2 | 30K | No (needs M1) | | M3 | 40K | No (needs M2) | | M4 | 20K | No (needs M3) | | M5 | 20K | Yes (with M6 after M4) | | M6 | 20K | Yes (with M5 after M3) | | M7 | 25K | No (needs all) | | **Total** | **~175K** | | Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time. --- ## Exit Criteria (federation feature complete) All of the following must be green on `main`: - Every PRD §15 acceptance criterion automated and passing - Every milestone's acceptance table green - Security review sign-off on M7 - Runbook walk-through completed by operator (not author) - `mosaic doctor` recognizes federated tier and reports peer health accurately - Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident --- ## Next Step After This Doc Is Approved 1. File tracking issues on `git.mosaicstack.dev/mosaicstack/stack` — one per milestone, labeled `epic:federation` 2. Populate `docs/TASKS.md` with M1's task breakdown (per-task agent assignment, budget, dependencies) 3. Begin M1 implementation