Files
stack/docs/federation/MILESTONES.md
Jarvis 47aac682f5
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
docs(federation): PRD, milestones, mission manifest, and M1 task breakdown
Plans the Federation v1 mission: cross-instance data federation between
Mosaic Stack gateways with asymmetric trust (home gateway sees blended
A+B at session time; work gateway sees only its own tenants), mTLS via
X.509 / Step-CA for auth, multi-tenant RBAC with no cross-user leakage,
and no data persistence across the boundary.

- docs/federation/PRD.md — 16-section product requirements (v1 locked)
- docs/federation/MILESTONES.md — 7-milestone decomposition with
  per-milestone acceptance test tables across unit/integration/E2E layers
- docs/federation/MISSION-MANIFEST.md — mission scope, success criteria,
  milestone table linked to issues #460-#466
- docs/federation/TASKS.md — FED-M1 decomposed into 12 tasks; M2-M7
  deferred to per-milestone planning to avoid speculative decomposition

Refs: #460 #461 #462 #463 #464 #465 #466

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 17:04:39 -05:00

20 KiB

Mosaic Stack — Federation Implementation Milestones

Companion to: PRD.md Approach: Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack.


Milestone Dependency Graph

M1 (federated tier infra)
  └── M2 (Step-CA + grant schema + CLI)
        └── M3 (mTLS handshake + list/get + scope enforcement)
              ├── M4 (search + audit + rate limit)
              │     └── M5 (cache + offline degradation + OTEL)
              ├── M6 (revocation + auto-renewal)  ◄── can start after M3
              └── M7 (multi-user hardening + e2e suite)  ◄── depends on M4+M5+M6

M5 and M6 can run in parallel once M4 is merged.


Test Strategy (applies to all milestones)

Three layers, all required before a milestone ships:

Layer Scope Runtime
Unit Per-module logic, pure functions, adapters Vitest, no I/O
Integration Single gateway against real PG/Valkey/Step-CA Vitest + Docker Compose test profile
Federation E2E Two gateways on a Docker network, real mTLS Playwright/custom harness (tools/federation-harness/) introduced in M3

Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward).

Quality gates per milestone (same as stack-wide):

  • pnpm typecheck green
  • pnpm lint green
  • pnpm test green (unit + integration)
  • pnpm test:federation green (M3+)
  • Independent code review passed
  • Docs updated (docs/federation/)
  • Merged PR on main, CI terminal green, linked issue closed

M1 — Federated Tier Infrastructure

Goal: A gateway can run in federated tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet.

Scope:

  • Add "tier": "federated" to mosaic.config.json schema and validators
  • Docker Compose federated profile (docker-compose.federated.yml) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes
  • Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise
  • pgvector extension installed + verified on startup
  • Migration logic: safe upgrade path from local/standalonefederated (data export/import script, one-way)
  • mosaic doctor reports tier + service health
  • Gateway continues to serve as a normal standalone instance (no federation yet)

Deliverables:

  • mosaic.config.json schema v2 (tier enum includes federated)
  • apps/gateway/src/bootstrap/tier-detector.ts
  • docker-compose.federated.yml
  • scripts/migrate-to-federated.ts
  • Updated mosaic doctor output
  • Updated packages/storage/src/adapters/postgres.ts with pgvector support

Acceptance tests:

# Test Layer
1 Gateway boots in federated tier with all services present Integration
2 Gateway refuses to boot in federated tier when Postgres unreachable (fail-fast, clear) Integration
3 pgvector extension available in target DB (SELECT * FROM pg_extension WHERE extname='vector') Integration
4 Migration script moves a populated local (PGlite) instance to federated (Postgres) with no data loss Integration
5 mosaic doctor reports correct tier and all services green Unit
6 Existing standalone behavior regression: agent session works end-to-end, no federation references E2E (single-gateway)

Estimated budget: ~20K tokens (infra + config + migration script) Risk notes: Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot.


M2 — Step-CA + Grant Schema + Admin CLI

Goal: An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet.

Scope:

  • Embed Step-CA as a Docker Compose sidecar with a persistent CA volume
  • Gateway exposes a short-lived enrollment endpoint (single-use token from the grant)
  • DB schema: federation_grants, federation_peers, federation_audit_log (table only, not yet written to)
  • Sealed storage for client_key_pem using the existing credential sealing key
  • Admin CLI:
    • mosaic federation grant create --user <id> --peer <host> --scope <file>
    • mosaic federation grant list
    • mosaic federation grant show <id>
    • mosaic federation peer add <enrollment-url>
    • mosaic federation peer list
  • Step-CA signs the cert with SAN OIDs for grantId + subjectUserId
  • Grant status transitions: pendingactive on successful enrollment

Deliverables:

  • packages/db migration: three federation tables + enum types
  • apps/gateway/src/federation/ca.service.ts (Step-CA client)
  • apps/gateway/src/federation/grants.service.ts
  • apps/gateway/src/federation/enrollment.controller.ts
  • packages/mosaic/src/commands/federation/ (grant + peer subcommands)
  • docker-compose.federated.yml adds Step-CA service
  • Scope JSON schema + validator

Acceptance tests:

# Test Layer
1 grant create writes a pending row with a scoped bundle Integration
2 Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs Integration
3 Enrollment token is single-use; second attempt returns 410 Integration
4 Cert subjectUserId OID matches the grant's subject_user_id Unit
5 client_key_pem is at-rest encrypted; raw DB read shows ciphertext, not PEM Integration
6 peer add <url> on Server A yields an active peer record with a valid cert + key E2E (two gateways, no traffic)
7 Scope JSON with unknown resource type rejected at grant create Unit
8 grant list and peer list render active / pending / revoked accurately Unit

Estimated budget: ~30K tokens (schema + CA integration + CLI + sealing) Risk notes: Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately.


M3 — mTLS Handshake + list + get with Scope Enforcement

Goal: Two federated gateways exchange real data over mTLS with scope intersecting native RBAC.

Scope:

  • FederationClient (outbound): picks cert from federation_peers, does mTLS call
  • FederationServer (inbound): NestJS guard validates client cert, extracts grantId + subjectUserId, loads grant
  • Scope enforcement pipeline:
    1. Resource allowlist / excluded-list check
    2. Native RBAC evaluation as the subjectUserId
    3. Scope filter intersection (include_teams, include_personal)
    4. max_rows_per_query cap
  • Verbs: list, get, capabilities
  • Gateway query layer accepts source: "local" | "federated:<host>" | "all"; fan-out for "all"
  • Federation E2E harness (tools/federation-harness/): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable

Deliverables:

  • apps/gateway/src/federation/client/federation-client.service.ts
  • apps/gateway/src/federation/server/federation-auth.guard.ts
  • apps/gateway/src/federation/server/scope.service.ts
  • apps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.ts
  • apps/gateway/src/federation/client/query-source.service.ts (fan-out/merge)
  • tools/federation-harness/ (compose + seed + test helpers)
  • packages/types — federation request/response DTOs in federation.dto.ts

Acceptance tests:

# Test Layer
1 A→B list tasks returns subjectUser's tasks intersected with scope E2E
2 A→B list tasks with include_teams: [T1] excludes T2 tasks the user owns E2E
3 A→B get credential <id> returns 403 when credentials is in excluded_resources E2E
4 Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation) E2E
5 Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached) E2E
6 Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403 Integration
7 max_rows_per_query caps response; request for more paginated Integration
8 source: "all" fan-out merges local + federated results, each tagged with _source Integration
9 Federation responses never persist: verify DB row count unchanged after list round-trip E2E
10 Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T E2E

Estimated budget: ~40K tokens (largest milestone — core federation logic + harness) Risk notes: This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test.


M4 — search Verb + Audit Log + Rate Limit

Goal: Keyword search over allowed resources with full audit and per-grant rate limiting.

Scope:

  • search verb across resources allowlist (intersection of scope + native RBAC)
  • Keyword search (reuse existing packages/memory/src/adapters/keyword.ts); pgvector search stays out of v1 search verb
  • Every federated request (all verbs) writes to federation_audit_log: grant_id, verb, resource, query_hash, outcome, bytes_out, latency_ms
  • No request body captured; query_hash is SHA-256 of normalized query params
  • Token-bucket rate limit per grant (default 60/min, override per grant)
  • 429 response with Retry-After header and structured body
  • 90-day hot retention for audit log; cold-tier rollover deferred to M7

Deliverables:

  • apps/gateway/src/federation/server/verbs/search.controller.ts
  • apps/gateway/src/federation/server/audit.service.ts (async write, no blocking)
  • apps/gateway/src/federation/server/rate-limit.guard.ts
  • Tests in harness

Acceptance tests:

# Test Layer
1 search returns ranked hits only from allowed resources E2E
2 search excluding credentials does not return a match even when keyword matches a credential name E2E
3 Every successful request appears in federation_audit_log within 1s Integration
4 Denied request (403) is also audited with outcome='denied' Integration
5 Audit row stores query hash but NOT query body Unit
6 61st request in 60s window returns 429 with Retry-After E2E
7 Per-grant override (e.g., 600/min) takes effect without restart Integration
8 Audit writes are async: request latency unchanged when audit write slow (simulated) Integration

Estimated budget: ~20K tokens Risk notes: Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes.


M5 — Cache + Offline Degradation + Observability

Goal: Sessions feel fast and stay useful when the peer is slow or down.

Scope:

  • In-memory response cache keyed by (grant_id, verb, resource, query_hash), TTL 30s default
  • Cache NOT used for search; only list and get
  • Cache flushed on cert rotation and grant revocation
  • Circuit breaker per peer: after N failures, fast-fail for cooldown window
  • _source tagging extended with _cached: true when served from cache
  • Agent-visible "federation offline for <peer>" signal emitted once per session per peer
  • OTEL spans: federation.request with attrs grant_id, peer, verb, resource, outcome, latency_ms, cached
  • W3C traceparent propagated across the mTLS boundary (both directions)
  • mosaic federation status CLI subcommand

Deliverables:

  • apps/gateway/src/federation/client/response-cache.service.ts
  • apps/gateway/src/federation/client/circuit-breaker.service.ts
  • apps/gateway/src/federation/observability/ (span helpers)
  • packages/mosaic/src/commands/federation/status.ts

Acceptance tests:

# Test Layer
1 Two identical list calls within 30s: second served from cache, flagged _cached Integration
2 search is never cached: two identical searches both hit the peer Integration
3 After grant revocation, peer's cache is flushed immediately Integration
4 After N consecutive failures, circuit opens; subsequent requests fail-fast without network call E2E
5 Circuit closes after cooldown and next success E2E
6 With peer offline, session completes using local data, one "federation offline" signal surfaced E2E
7 OTEL traces show spans on both gateways correlated by traceparent E2E
8 mosaic federation status prints peer state, cert expiry, last success/failure, circuit state Unit

Estimated budget: ~20K tokens Risk notes: Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits.


M6 — Revocation, Auto-Renewal, CRL

Goal: Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution.

Scope:

  • mosaic federation grant revoke <id> → status revoked, CRL updated, audit entry
  • DB hook: deleting a user cascades revoke-on-delete on all grants where that user is subject
  • Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s)
  • Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache
  • On renewal failure, peer marked degraded and admin-visible alert emitted
  • Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked revoked, user notified

Deliverables:

  • apps/gateway/src/federation/server/crl.service.ts + endpoint
  • apps/gateway/src/federation/server/revocation.service.ts
  • DB cascade trigger or ORM hook for user deletion → grant revocation
  • apps/gateway/src/federation/client/renewal.job.ts (scheduled)
  • packages/mosaic/src/commands/federation/grant.ts gains revoke subcommand

Acceptance tests:

# Test Layer
1 Admin grant revoke → A's next request fails with TLS-level error E2E
2 Deleting subject user on B auto-revokes all grants where that user was the subject Integration
3 CRL endpoint serves correct list; revoked cert present Integration
4 Server rejects cert listed in CRL even if cert itself is still time-valid E2E
5 Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests E2E
6 Renewal failure marks peer degraded and surfaces alert Integration
7 A marks peer revoked after a revocation-caused handshake failure (not on transient network errors) E2E

Estimated budget: ~20K tokens Risk notes: The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call.


M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite

Goal: The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance.

Scope:

  • Three-user scenario on Server B (E1, E2, E3) each with their own Server A
  • Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A
  • User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor
  • Negative-path test matrix (every denial path from PRD §8)
  • All PRD §15 acceptance criteria mapped to automated tests in the harness
  • Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested
  • Cold-storage rollover for audit log >90 days
  • Docs: operator runbook, onboarding guide, troubleshooting guide

Deliverables:

  • Full federation acceptance suite in tools/federation-harness/acceptance/
  • apps/web surfaces for peer/grant/audit management
  • docs/federation/RUNBOOK.md, docs/federation/ONBOARDING.md, docs/federation/TROUBLESHOOTING.md
  • Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage)

Acceptance tests: Every PRD §15 criterion must be automated and green. Additionally:

# Test Layer
1 3-employee scenario: each A sees only its user's data from B E2E
2 Grant with team scope returns team data; same grant denied access to another employee's personal data E2E
3 Concurrent sessions from E1's and E2's Server A to B interleave without any leakage E2E
4 Audit log across 3-user test shows per-grant trails with no mis-attributed rows E2E
5 Scope editor UI round-trip: edit → save → next request uses new scope E2E
6 Attempt to use a revoked grant's cert against a different grant's endpoint: rejected E2E
7 90-day-old audit rows moved to cold tier; queryable via explicit historical query Integration
8 Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke Manual checklist

Estimated budget: ~25K tokens Risk notes: This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge.


Total Budget & Timeline Sketch

Milestone Tokens (est.) Can parallelize?
M1 20K No (foundation)
M2 30K No (needs M1)
M3 40K No (needs M2)
M4 20K No (needs M3)
M5 20K Yes (with M6 after M4)
M6 20K Yes (with M5 after M3)
M7 25K No (needs all)
Total ~175K

Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time.


Exit Criteria (federation feature complete)

All of the following must be green on main:

  • Every PRD §15 acceptance criterion automated and passing
  • Every milestone's acceptance table green
  • Security review sign-off on M7
  • Runbook walk-through completed by operator (not author)
  • mosaic doctor recognizes federated tier and reports peer health accurately
  • Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident

Next Step After This Doc Is Approved

  1. File tracking issues on git.mosaicstack.dev/mosaicstack/stack — one per milestone, labeled epic:federation
  2. Populate docs/TASKS.md with M1's task breakdown (per-task agent assignment, budget, dependencies)
  3. Begin M1 implementation