369 lines
20 KiB
Markdown
369 lines
20 KiB
Markdown
# Mosaic Stack — Federation Implementation Milestones
|
|
|
|
**Companion to:** `PRD.md`
|
|
**Approach:** Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack.
|
|
|
|
---
|
|
|
|
## Milestone Dependency Graph
|
|
|
|
```
|
|
M1 (federated tier infra)
|
|
└── M2 (Step-CA + grant schema + CLI)
|
|
└── M3 (mTLS handshake + list/get + scope enforcement)
|
|
├── M4 (search + audit + rate limit)
|
|
│ └── M5 (cache + offline degradation + OTEL)
|
|
├── M6 (revocation + auto-renewal) ◄── can start after M3
|
|
└── M7 (multi-user hardening + e2e suite) ◄── depends on M4+M5+M6
|
|
```
|
|
|
|
M5 and M6 can run in parallel once M4 is merged.
|
|
|
|
---
|
|
|
|
## Test Strategy (applies to all milestones)
|
|
|
|
Three layers, all required before a milestone ships:
|
|
|
|
| Layer | Scope | Runtime |
|
|
| ------------------ | --------------------------------------------- | ------------------------------------------------------------------------ |
|
|
| **Unit** | Per-module logic, pure functions, adapters | Vitest, no I/O |
|
|
| **Integration** | Single gateway against real PG/Valkey/Step-CA | Vitest + Docker Compose test profile |
|
|
| **Federation E2E** | Two gateways on a Docker network, real mTLS | Playwright/custom harness (`tools/federation-harness/`) introduced in M3 |
|
|
|
|
Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward).
|
|
|
|
**Quality gates per milestone** (same as stack-wide):
|
|
|
|
- `pnpm typecheck` green
|
|
- `pnpm lint` green
|
|
- `pnpm test` green (unit + integration)
|
|
- `pnpm test:federation` green (M3+)
|
|
- Independent code review passed
|
|
- Docs updated (`docs/federation/`)
|
|
- Merged PR on `main`, CI terminal green, linked issue closed
|
|
|
|
---
|
|
|
|
## M1 — Federated Tier Infrastructure
|
|
|
|
**Goal:** A gateway can run in `federated` tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet.
|
|
|
|
**Scope:**
|
|
|
|
- Add `"tier": "federated"` to `mosaic.config.json` schema and validators
|
|
- Docker Compose `federated` profile (`docker-compose.federated.yml`) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes
|
|
- Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise
|
|
- `pgvector` extension installed + verified on startup
|
|
- Migration logic: safe upgrade path from `local`/`standalone` → `federated` (data export/import script, one-way)
|
|
- `mosaic doctor` reports tier + service health
|
|
- Gateway continues to serve as a normal standalone instance (no federation yet)
|
|
|
|
**Deliverables:**
|
|
|
|
- `mosaic.config.json` schema v2 (tier enum includes `federated`)
|
|
- `apps/gateway/src/bootstrap/tier-detector.ts`
|
|
- `docker-compose.federated.yml`
|
|
- `scripts/migrate-to-federated.ts`
|
|
- Updated `mosaic doctor` output
|
|
- Updated `packages/storage/src/adapters/postgres.ts` with pgvector support
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| - | ---------------------------------------------------------------------------------------- | ----------- |
|
|
| 1 | Gateway boots in `federated` tier with all services present | Integration |
|
|
| 2 | Gateway refuses to boot in `federated` tier when Postgres unreachable (fail-fast, clear) | Integration |
|
|
| 3 | `pgvector` extension available in target DB (`SELECT * FROM pg_extension WHERE extname='vector'`) | Integration |
|
|
| 4 | Migration script moves a populated `local` (PGlite) instance to `federated` (Postgres) with no data loss | Integration |
|
|
| 5 | `mosaic doctor` reports correct tier and all services green | Unit |
|
|
| 6 | Existing standalone behavior regression: agent session works end-to-end, no federation references | E2E (single-gateway) |
|
|
|
|
**Estimated budget:** ~20K tokens (infra + config + migration script)
|
|
**Risk notes:** Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot.
|
|
|
|
---
|
|
|
|
## M2 — Step-CA + Grant Schema + Admin CLI
|
|
|
|
**Goal:** An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet.
|
|
|
|
**Scope:**
|
|
|
|
- Embed Step-CA as a Docker Compose sidecar with a persistent CA volume
|
|
- Gateway exposes a short-lived enrollment endpoint (single-use token from the grant)
|
|
- DB schema: `federation_grants`, `federation_peers`, `federation_audit_log` (table only, not yet written to)
|
|
- Sealed storage for `client_key_pem` using the existing credential sealing key
|
|
- Admin CLI:
|
|
- `mosaic federation grant create --user <id> --peer <host> --scope <file>`
|
|
- `mosaic federation grant list`
|
|
- `mosaic federation grant show <id>`
|
|
- `mosaic federation peer add <enrollment-url>`
|
|
- `mosaic federation peer list`
|
|
- Step-CA signs the cert with SAN OIDs for `grantId` + `subjectUserId`
|
|
- Grant status transitions: `pending` → `active` on successful enrollment
|
|
|
|
**Deliverables:**
|
|
|
|
- `packages/db` migration: three federation tables + enum types
|
|
- `apps/gateway/src/federation/ca.service.ts` (Step-CA client)
|
|
- `apps/gateway/src/federation/grants.service.ts`
|
|
- `apps/gateway/src/federation/enrollment.controller.ts`
|
|
- `packages/mosaic/src/commands/federation/` (grant + peer subcommands)
|
|
- `docker-compose.federated.yml` adds Step-CA service
|
|
- Scope JSON schema + validator
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| - | ---------------------------------------------------------------------------------------- | ----------- |
|
|
| 1 | `grant create` writes a `pending` row with a scoped bundle | Integration |
|
|
| 2 | Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs | Integration |
|
|
| 3 | Enrollment token is single-use; second attempt returns 410 | Integration |
|
|
| 4 | Cert `subjectUserId` OID matches the grant's `subject_user_id` | Unit |
|
|
| 5 | `client_key_pem` is at-rest encrypted; raw DB read shows ciphertext, not PEM | Integration |
|
|
| 6 | `peer add <url>` on Server A yields an `active` peer record with a valid cert + key | E2E (two gateways, no traffic) |
|
|
| 7 | Scope JSON with unknown resource type rejected at `grant create` | Unit |
|
|
| 8 | `grant list` and `peer list` render active / pending / revoked accurately | Unit |
|
|
|
|
**Estimated budget:** ~30K tokens (schema + CA integration + CLI + sealing)
|
|
**Risk notes:** Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately.
|
|
|
|
---
|
|
|
|
## M3 — mTLS Handshake + `list` + `get` with Scope Enforcement
|
|
|
|
**Goal:** Two federated gateways exchange real data over mTLS with scope intersecting native RBAC.
|
|
|
|
**Scope:**
|
|
|
|
- `FederationClient` (outbound): picks cert from `federation_peers`, does mTLS call
|
|
- `FederationServer` (inbound): NestJS guard validates client cert, extracts `grantId` + `subjectUserId`, loads grant
|
|
- Scope enforcement pipeline:
|
|
1. Resource allowlist / excluded-list check
|
|
2. Native RBAC evaluation as the `subjectUserId`
|
|
3. Scope filter intersection (`include_teams`, `include_personal`)
|
|
4. `max_rows_per_query` cap
|
|
- Verbs: `list`, `get`, `capabilities`
|
|
- Gateway query layer accepts `source: "local" | "federated:<host>" | "all"`; fan-out for `"all"`
|
|
- **Federation E2E harness** (`tools/federation-harness/`): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable
|
|
|
|
**Deliverables:**
|
|
|
|
- `apps/gateway/src/federation/client/federation-client.service.ts`
|
|
- `apps/gateway/src/federation/server/federation-auth.guard.ts`
|
|
- `apps/gateway/src/federation/server/scope.service.ts`
|
|
- `apps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.ts`
|
|
- `apps/gateway/src/federation/client/query-source.service.ts` (fan-out/merge)
|
|
- `tools/federation-harness/` (compose + seed + test helpers)
|
|
- `packages/types` — federation request/response DTOs in `federation.dto.ts`
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| -- | -------------------------------------------------------------------------------------------------------- | ----- |
|
|
| 1 | A→B `list tasks` returns subjectUser's tasks intersected with scope | E2E |
|
|
| 2 | A→B `list tasks` with `include_teams: [T1]` excludes T2 tasks the user owns | E2E |
|
|
| 3 | A→B `get credential <id>` returns 403 when `credentials` is in `excluded_resources` | E2E |
|
|
| 4 | Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation) | E2E |
|
|
| 5 | Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached) | E2E |
|
|
| 6 | Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403 | Integration |
|
|
| 7 | `max_rows_per_query` caps response; request for more paginated | Integration |
|
|
| 8 | `source: "all"` fan-out merges local + federated results, each tagged with `_source` | Integration |
|
|
| 9 | Federation responses never persist: verify DB row count unchanged after `list` round-trip | E2E |
|
|
| 10 | Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T | E2E |
|
|
|
|
**Estimated budget:** ~40K tokens (largest milestone — core federation logic + harness)
|
|
**Risk notes:** This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test.
|
|
|
|
---
|
|
|
|
## M4 — `search` Verb + Audit Log + Rate Limit
|
|
|
|
**Goal:** Keyword search over allowed resources with full audit and per-grant rate limiting.
|
|
|
|
**Scope:**
|
|
|
|
- `search` verb across `resources` allowlist (intersection of scope + native RBAC)
|
|
- Keyword search (reuse existing `packages/memory/src/adapters/keyword.ts`); pgvector search stays out of v1 search verb
|
|
- Every federated request (all verbs) writes to `federation_audit_log`: `grant_id`, `verb`, `resource`, `query_hash`, `outcome`, `bytes_out`, `latency_ms`
|
|
- No request body captured; `query_hash` is SHA-256 of normalized query params
|
|
- Token-bucket rate limit per grant (default 60/min, override per grant)
|
|
- 429 response with `Retry-After` header and structured body
|
|
- 90-day hot retention for audit log; cold-tier rollover deferred to M7
|
|
|
|
**Deliverables:**
|
|
|
|
- `apps/gateway/src/federation/server/verbs/search.controller.ts`
|
|
- `apps/gateway/src/federation/server/audit.service.ts` (async write, no blocking)
|
|
- `apps/gateway/src/federation/server/rate-limit.guard.ts`
|
|
- Tests in harness
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| - | ------------------------------------------------------------------------------------------------- | ----------- |
|
|
| 1 | `search` returns ranked hits only from allowed resources | E2E |
|
|
| 2 | `search` excluding `credentials` does not return a match even when keyword matches a credential name | E2E |
|
|
| 3 | Every successful request appears in `federation_audit_log` within 1s | Integration |
|
|
| 4 | Denied request (403) is also audited with `outcome='denied'` | Integration |
|
|
| 5 | Audit row stores query hash but NOT query body | Unit |
|
|
| 6 | 61st request in 60s window returns 429 with `Retry-After` | E2E |
|
|
| 7 | Per-grant override (e.g., 600/min) takes effect without restart | Integration |
|
|
| 8 | Audit writes are async: request latency unchanged when audit write slow (simulated) | Integration |
|
|
|
|
**Estimated budget:** ~20K tokens
|
|
**Risk notes:** Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes.
|
|
|
|
---
|
|
|
|
## M5 — Cache + Offline Degradation + Observability
|
|
|
|
**Goal:** Sessions feel fast and stay useful when the peer is slow or down.
|
|
|
|
**Scope:**
|
|
|
|
- In-memory response cache keyed by `(grant_id, verb, resource, query_hash)`, TTL 30s default
|
|
- Cache NOT used for `search`; only `list` and `get`
|
|
- Cache flushed on cert rotation and grant revocation
|
|
- Circuit breaker per peer: after N failures, fast-fail for cooldown window
|
|
- `_source` tagging extended with `_cached: true` when served from cache
|
|
- Agent-visible "federation offline for `<peer>`" signal emitted once per session per peer
|
|
- OTEL spans: `federation.request` with attrs `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`, `cached`
|
|
- W3C `traceparent` propagated across the mTLS boundary (both directions)
|
|
- `mosaic federation status` CLI subcommand
|
|
|
|
**Deliverables:**
|
|
|
|
- `apps/gateway/src/federation/client/response-cache.service.ts`
|
|
- `apps/gateway/src/federation/client/circuit-breaker.service.ts`
|
|
- `apps/gateway/src/federation/observability/` (span helpers)
|
|
- `packages/mosaic/src/commands/federation/status.ts`
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| - | --------------------------------------------------------------------------------------------- | ----- |
|
|
| 1 | Two identical `list` calls within 30s: second served from cache, flagged `_cached` | Integration |
|
|
| 2 | `search` is never cached: two identical searches both hit the peer | Integration |
|
|
| 3 | After grant revocation, peer's cache is flushed immediately | Integration |
|
|
| 4 | After N consecutive failures, circuit opens; subsequent requests fail-fast without network call | E2E |
|
|
| 5 | Circuit closes after cooldown and next success | E2E |
|
|
| 6 | With peer offline, session completes using local data, one "federation offline" signal surfaced | E2E |
|
|
| 7 | OTEL traces show spans on both gateways correlated by `traceparent` | E2E |
|
|
| 8 | `mosaic federation status` prints peer state, cert expiry, last success/failure, circuit state | Unit |
|
|
|
|
**Estimated budget:** ~20K tokens
|
|
**Risk notes:** Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits.
|
|
|
|
---
|
|
|
|
## M6 — Revocation, Auto-Renewal, CRL
|
|
|
|
**Goal:** Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution.
|
|
|
|
**Scope:**
|
|
|
|
- `mosaic federation grant revoke <id>` → status `revoked`, CRL updated, audit entry
|
|
- DB hook: deleting a user cascades `revoke-on-delete` on all grants where that user is subject
|
|
- Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s)
|
|
- Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache
|
|
- On renewal failure, peer marked `degraded` and admin-visible alert emitted
|
|
- Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked `revoked`, user notified
|
|
|
|
**Deliverables:**
|
|
|
|
- `apps/gateway/src/federation/server/crl.service.ts` + endpoint
|
|
- `apps/gateway/src/federation/server/revocation.service.ts`
|
|
- DB cascade trigger or ORM hook for user deletion → grant revocation
|
|
- `apps/gateway/src/federation/client/renewal.job.ts` (scheduled)
|
|
- `packages/mosaic/src/commands/federation/grant.ts` gains `revoke` subcommand
|
|
|
|
**Acceptance tests:**
|
|
| # | Test | Layer |
|
|
| - | ----------------------------------------------------------------------------------------- | ----- |
|
|
| 1 | Admin `grant revoke` → A's next request fails with TLS-level error | E2E |
|
|
| 2 | Deleting subject user on B auto-revokes all grants where that user was the subject | Integration |
|
|
| 3 | CRL endpoint serves correct list; revoked cert present | Integration |
|
|
| 4 | Server rejects cert listed in CRL even if cert itself is still time-valid | E2E |
|
|
| 5 | Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests | E2E |
|
|
| 6 | Renewal failure marks peer `degraded` and surfaces alert | Integration |
|
|
| 7 | A marks peer `revoked` after a revocation-caused handshake failure (not on transient network errors) | E2E |
|
|
|
|
**Estimated budget:** ~20K tokens
|
|
**Risk notes:** The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call.
|
|
|
|
---
|
|
|
|
## M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite
|
|
|
|
**Goal:** The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance.
|
|
|
|
**Scope:**
|
|
|
|
- Three-user scenario on Server B (E1, E2, E3) each with their own Server A
|
|
- Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A
|
|
- User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor
|
|
- Negative-path test matrix (every denial path from PRD §8)
|
|
- All PRD §15 acceptance criteria mapped to automated tests in the harness
|
|
- Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested
|
|
- Cold-storage rollover for audit log >90 days
|
|
- Docs: operator runbook, onboarding guide, troubleshooting guide
|
|
|
|
**Deliverables:**
|
|
|
|
- Full federation acceptance suite in `tools/federation-harness/acceptance/`
|
|
- `apps/web` surfaces for peer/grant/audit management
|
|
- `docs/federation/RUNBOOK.md`, `docs/federation/ONBOARDING.md`, `docs/federation/TROUBLESHOOTING.md`
|
|
- Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage)
|
|
|
|
**Acceptance tests:**
|
|
Every PRD §15 criterion must be automated and green. Additionally:
|
|
|
|
| # | Test | Layer |
|
|
| --- | ----------------------------------------------------------------------------------------------------- | ---------------- |
|
|
| 1 | 3-employee scenario: each A sees only its user's data from B | E2E |
|
|
| 2 | Grant with team scope returns team data; same grant denied access to another employee's personal data | E2E |
|
|
| 3 | Concurrent sessions from E1's and E2's Server A to B interleave without any leakage | E2E |
|
|
| 4 | Audit log across 3-user test shows per-grant trails with no mis-attributed rows | E2E |
|
|
| 5 | Scope editor UI round-trip: edit → save → next request uses new scope | E2E |
|
|
| 6 | Attempt to use a revoked grant's cert against a different grant's endpoint: rejected | E2E |
|
|
| 7 | 90-day-old audit rows moved to cold tier; queryable via explicit historical query | Integration |
|
|
| 8 | Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke | Manual checklist |
|
|
|
|
**Estimated budget:** ~25K tokens
|
|
**Risk notes:** This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge.
|
|
|
|
---
|
|
|
|
## Total Budget & Timeline Sketch
|
|
|
|
| Milestone | Tokens (est.) | Can parallelize? |
|
|
| --------- | ------------- | ---------------------- |
|
|
| M1 | 20K | No (foundation) |
|
|
| M2 | 30K | No (needs M1) |
|
|
| M3 | 40K | No (needs M2) |
|
|
| M4 | 20K | No (needs M3) |
|
|
| M5 | 20K | Yes (with M6 after M4) |
|
|
| M6 | 20K | Yes (with M5 after M3) |
|
|
| M7 | 25K | No (needs all) |
|
|
| **Total** | **~175K** | |
|
|
|
|
Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time.
|
|
|
|
---
|
|
|
|
## Exit Criteria (federation feature complete)
|
|
|
|
All of the following must be green on `main`:
|
|
|
|
- Every PRD §15 acceptance criterion automated and passing
|
|
- Every milestone's acceptance table green
|
|
- Security review sign-off on M7
|
|
- Runbook walk-through completed by operator (not author)
|
|
- `mosaic doctor` recognizes federated tier and reports peer health accurately
|
|
- Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident
|
|
|
|
---
|
|
|
|
## Next Step After This Doc Is Approved
|
|
|
|
1. File tracking issues on `git.mosaicstack.dev/mosaicstack/stack` — one per milestone, labeled `epic:federation`
|
|
2. Populate `docs/TASKS.md` with M1's task breakdown (per-task agent assignment, budget, dependencies)
|
|
3. Begin M1 implementation
|