20 KiB
Mosaic Stack — Federation Implementation Milestones
Companion to: PRD.md
Approach: Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack.
Milestone Dependency Graph
M1 (federated tier infra)
└── M2 (Step-CA + grant schema + CLI)
└── M3 (mTLS handshake + list/get + scope enforcement)
├── M4 (search + audit + rate limit)
│ └── M5 (cache + offline degradation + OTEL)
├── M6 (revocation + auto-renewal) ◄── can start after M3
└── M7 (multi-user hardening + e2e suite) ◄── depends on M4+M5+M6
M5 and M6 can run in parallel once M4 is merged.
Test Strategy (applies to all milestones)
Three layers, all required before a milestone ships:
| Layer | Scope | Runtime |
|---|---|---|
| Unit | Per-module logic, pure functions, adapters | Vitest, no I/O |
| Integration | Single gateway against real PG/Valkey/Step-CA | Vitest + Docker Compose test profile |
| Federation E2E | Two gateways on a Docker network, real mTLS | Playwright/custom harness (tools/federation-harness/) introduced in M3 |
Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward).
Quality gates per milestone (same as stack-wide):
pnpm typecheckgreenpnpm lintgreenpnpm testgreen (unit + integration)pnpm test:federationgreen (M3+)- Independent code review passed
- Docs updated (
docs/federation/) - Merged PR on
main, CI terminal green, linked issue closed
M1 — Federated Tier Infrastructure
Goal: A gateway can run in federated tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet.
Scope:
- Add
"tier": "federated"tomosaic.config.jsonschema and validators - Docker Compose
federatedprofile (docker-compose.federated.yml) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes - Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise
pgvectorextension installed + verified on startup- Migration logic: safe upgrade path from
local/standalone→federated(data export/import script, one-way) mosaic doctorreports tier + service health- Gateway continues to serve as a normal standalone instance (no federation yet)
Deliverables:
mosaic.config.jsonschema v2 (tier enum includesfederated)apps/gateway/src/bootstrap/tier-detector.tsdocker-compose.federated.ymlscripts/migrate-to-federated.ts- Updated
mosaic doctoroutput - Updated
packages/storage/src/adapters/postgres.tswith pgvector support
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | Gateway boots in federated tier with all services present |
Integration |
| 2 | Gateway refuses to boot in federated tier when Postgres unreachable (fail-fast, clear) |
Integration |
| 3 | pgvector extension available in target DB (SELECT * FROM pg_extension WHERE extname='vector') |
Integration |
| 4 | Migration script moves a populated local (PGlite) instance to federated (Postgres) with no data loss |
Integration |
| 5 | mosaic doctor reports correct tier and all services green |
Unit |
| 6 | Existing standalone behavior regression: agent session works end-to-end, no federation references | E2E (single-gateway) |
Estimated budget: ~20K tokens (infra + config + migration script) Risk notes: Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot.
M2 — Step-CA + Grant Schema + Admin CLI
Goal: An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet.
Scope:
- Embed Step-CA as a Docker Compose sidecar with a persistent CA volume
- Gateway exposes a short-lived enrollment endpoint (single-use token from the grant)
- DB schema:
federation_grants,federation_peers,federation_audit_log(table only, not yet written to) - Sealed storage for
client_key_pemusing the existing credential sealing key - Admin CLI:
mosaic federation grant create --user <id> --peer <host> --scope <file>mosaic federation grant listmosaic federation grant show <id>mosaic federation peer add <enrollment-url>mosaic federation peer list
- Step-CA signs the cert with SAN OIDs for
grantId+subjectUserId - Grant status transitions:
pending→activeon successful enrollment
Deliverables:
packages/dbmigration: three federation tables + enum typesapps/gateway/src/federation/ca.service.ts(Step-CA client)apps/gateway/src/federation/grants.service.tsapps/gateway/src/federation/enrollment.controller.tspackages/mosaic/src/commands/federation/(grant + peer subcommands)docker-compose.federated.ymladds Step-CA service- Scope JSON schema + validator
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | grant create writes a pending row with a scoped bundle |
Integration |
| 2 | Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs | Integration |
| 3 | Enrollment token is single-use; second attempt returns 410 | Integration |
| 4 | Cert subjectUserId OID matches the grant's subject_user_id |
Unit |
| 5 | client_key_pem is at-rest encrypted; raw DB read shows ciphertext, not PEM |
Integration |
| 6 | peer add <url> on Server A yields an active peer record with a valid cert + key |
E2E (two gateways, no traffic) |
| 7 | Scope JSON with unknown resource type rejected at grant create |
Unit |
| 8 | grant list and peer list render active / pending / revoked accurately |
Unit |
Estimated budget: ~30K tokens (schema + CA integration + CLI + sealing) Risk notes: Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately.
M3 — mTLS Handshake + list + get with Scope Enforcement
Goal: Two federated gateways exchange real data over mTLS with scope intersecting native RBAC.
Scope:
FederationClient(outbound): picks cert fromfederation_peers, does mTLS callFederationServer(inbound): NestJS guard validates client cert, extractsgrantId+subjectUserId, loads grant- Scope enforcement pipeline:
- Resource allowlist / excluded-list check
- Native RBAC evaluation as the
subjectUserId - Scope filter intersection (
include_teams,include_personal) max_rows_per_querycap
- Verbs:
list,get,capabilities - Gateway query layer accepts
source: "local" | "federated:<host>" | "all"; fan-out for"all" - Federation E2E harness (
tools/federation-harness/): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable
Deliverables:
apps/gateway/src/federation/client/federation-client.service.tsapps/gateway/src/federation/server/federation-auth.guard.tsapps/gateway/src/federation/server/scope.service.tsapps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.tsapps/gateway/src/federation/client/query-source.service.ts(fan-out/merge)tools/federation-harness/(compose + seed + test helpers)packages/types— federation request/response DTOs infederation.dto.ts
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | A→B list tasks returns subjectUser's tasks intersected with scope |
E2E |
| 2 | A→B list tasks with include_teams: [T1] excludes T2 tasks the user owns |
E2E |
| 3 | A→B get credential <id> returns 403 when credentials is in excluded_resources |
E2E |
| 4 | Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation) | E2E |
| 5 | Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached) | E2E |
| 6 | Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403 | Integration |
| 7 | max_rows_per_query caps response; request for more paginated |
Integration |
| 8 | source: "all" fan-out merges local + federated results, each tagged with _source |
Integration |
| 9 | Federation responses never persist: verify DB row count unchanged after list round-trip |
E2E |
| 10 | Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T | E2E |
Estimated budget: ~40K tokens (largest milestone — core federation logic + harness) Risk notes: This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test.
M4 — search Verb + Audit Log + Rate Limit
Goal: Keyword search over allowed resources with full audit and per-grant rate limiting.
Scope:
searchverb acrossresourcesallowlist (intersection of scope + native RBAC)- Keyword search (reuse existing
packages/memory/src/adapters/keyword.ts); pgvector search stays out of v1 search verb - Every federated request (all verbs) writes to
federation_audit_log:grant_id,verb,resource,query_hash,outcome,bytes_out,latency_ms - No request body captured;
query_hashis SHA-256 of normalized query params - Token-bucket rate limit per grant (default 60/min, override per grant)
- 429 response with
Retry-Afterheader and structured body - 90-day hot retention for audit log; cold-tier rollover deferred to M7
Deliverables:
apps/gateway/src/federation/server/verbs/search.controller.tsapps/gateway/src/federation/server/audit.service.ts(async write, no blocking)apps/gateway/src/federation/server/rate-limit.guard.ts- Tests in harness
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | search returns ranked hits only from allowed resources |
E2E |
| 2 | search excluding credentials does not return a match even when keyword matches a credential name |
E2E |
| 3 | Every successful request appears in federation_audit_log within 1s |
Integration |
| 4 | Denied request (403) is also audited with outcome='denied' |
Integration |
| 5 | Audit row stores query hash but NOT query body | Unit |
| 6 | 61st request in 60s window returns 429 with Retry-After |
E2E |
| 7 | Per-grant override (e.g., 600/min) takes effect without restart | Integration |
| 8 | Audit writes are async: request latency unchanged when audit write slow (simulated) | Integration |
Estimated budget: ~20K tokens Risk notes: Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes.
M5 — Cache + Offline Degradation + Observability
Goal: Sessions feel fast and stay useful when the peer is slow or down.
Scope:
- In-memory response cache keyed by
(grant_id, verb, resource, query_hash), TTL 30s default - Cache NOT used for
search; onlylistandget - Cache flushed on cert rotation and grant revocation
- Circuit breaker per peer: after N failures, fast-fail for cooldown window
_sourcetagging extended with_cached: truewhen served from cache- Agent-visible "federation offline for
<peer>" signal emitted once per session per peer - OTEL spans:
federation.requestwith attrsgrant_id,peer,verb,resource,outcome,latency_ms,cached - W3C
traceparentpropagated across the mTLS boundary (both directions) mosaic federation statusCLI subcommand
Deliverables:
apps/gateway/src/federation/client/response-cache.service.tsapps/gateway/src/federation/client/circuit-breaker.service.tsapps/gateway/src/federation/observability/(span helpers)packages/mosaic/src/commands/federation/status.ts
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | Two identical list calls within 30s: second served from cache, flagged _cached |
Integration |
| 2 | search is never cached: two identical searches both hit the peer |
Integration |
| 3 | After grant revocation, peer's cache is flushed immediately | Integration |
| 4 | After N consecutive failures, circuit opens; subsequent requests fail-fast without network call | E2E |
| 5 | Circuit closes after cooldown and next success | E2E |
| 6 | With peer offline, session completes using local data, one "federation offline" signal surfaced | E2E |
| 7 | OTEL traces show spans on both gateways correlated by traceparent |
E2E |
| 8 | mosaic federation status prints peer state, cert expiry, last success/failure, circuit state |
Unit |
Estimated budget: ~20K tokens Risk notes: Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits.
M6 — Revocation, Auto-Renewal, CRL
Goal: Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution.
Scope:
mosaic federation grant revoke <id>→ statusrevoked, CRL updated, audit entry- DB hook: deleting a user cascades
revoke-on-deleteon all grants where that user is subject - Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s)
- Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache
- On renewal failure, peer marked
degradedand admin-visible alert emitted - Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked
revoked, user notified
Deliverables:
apps/gateway/src/federation/server/crl.service.ts+ endpointapps/gateway/src/federation/server/revocation.service.ts- DB cascade trigger or ORM hook for user deletion → grant revocation
apps/gateway/src/federation/client/renewal.job.ts(scheduled)packages/mosaic/src/commands/federation/grant.tsgainsrevokesubcommand
Acceptance tests:
| # | Test | Layer |
|---|---|---|
| 1 | Admin grant revoke → A's next request fails with TLS-level error |
E2E |
| 2 | Deleting subject user on B auto-revokes all grants where that user was the subject | Integration |
| 3 | CRL endpoint serves correct list; revoked cert present | Integration |
| 4 | Server rejects cert listed in CRL even if cert itself is still time-valid | E2E |
| 5 | Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests | E2E |
| 6 | Renewal failure marks peer degraded and surfaces alert |
Integration |
| 7 | A marks peer revoked after a revocation-caused handshake failure (not on transient network errors) |
E2E |
Estimated budget: ~20K tokens Risk notes: The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call.
M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite
Goal: The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance.
Scope:
- Three-user scenario on Server B (E1, E2, E3) each with their own Server A
- Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A
- User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor
- Negative-path test matrix (every denial path from PRD §8)
- All PRD §15 acceptance criteria mapped to automated tests in the harness
- Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested
- Cold-storage rollover for audit log >90 days
- Docs: operator runbook, onboarding guide, troubleshooting guide
Deliverables:
- Full federation acceptance suite in
tools/federation-harness/acceptance/ apps/websurfaces for peer/grant/audit managementdocs/federation/RUNBOOK.md,docs/federation/ONBOARDING.md,docs/federation/TROUBLESHOOTING.md- Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage)
Acceptance tests: Every PRD §15 criterion must be automated and green. Additionally:
| # | Test | Layer |
|---|---|---|
| 1 | 3-employee scenario: each A sees only its user's data from B | E2E |
| 2 | Grant with team scope returns team data; same grant denied access to another employee's personal data | E2E |
| 3 | Concurrent sessions from E1's and E2's Server A to B interleave without any leakage | E2E |
| 4 | Audit log across 3-user test shows per-grant trails with no mis-attributed rows | E2E |
| 5 | Scope editor UI round-trip: edit → save → next request uses new scope | E2E |
| 6 | Attempt to use a revoked grant's cert against a different grant's endpoint: rejected | E2E |
| 7 | 90-day-old audit rows moved to cold tier; queryable via explicit historical query | Integration |
| 8 | Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke | Manual checklist |
Estimated budget: ~25K tokens Risk notes: This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge.
Total Budget & Timeline Sketch
| Milestone | Tokens (est.) | Can parallelize? |
|---|---|---|
| M1 | 20K | No (foundation) |
| M2 | 30K | No (needs M1) |
| M3 | 40K | No (needs M2) |
| M4 | 20K | No (needs M3) |
| M5 | 20K | Yes (with M6 after M4) |
| M6 | 20K | Yes (with M5 after M3) |
| M7 | 25K | No (needs all) |
| Total | ~175K |
Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time.
Exit Criteria (federation feature complete)
All of the following must be green on main:
- Every PRD §15 acceptance criterion automated and passing
- Every milestone's acceptance table green
- Security review sign-off on M7
- Runbook walk-through completed by operator (not author)
mosaic doctorrecognizes federated tier and reports peer health accurately- Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident
Next Step After This Doc Is Approved
- File tracking issues on
git.mosaicstack.dev/mosaicstack/stack— one per milestone, labeledepic:federation - Populate
docs/TASKS.mdwith M1's task breakdown (per-task agent assignment, budget, dependencies) - Begin M1 implementation