mosaicstack/stack

Fork 0

Files

jason.woltje 46dd799548

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/push/publish Pipeline was successful

Details

docs(federation): PRD, milestones, mission manifest, and M1 task breakdown (#467 )

2026-04-19 22:09:20 +00:00

20 KiB

Raw Blame History

Mosaic Stack — Federation Implementation Milestones

Companion to: PRD.md Approach: Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack.

Milestone Dependency Graph

M1 (federated tier infra)
  └── M2 (Step-CA + grant schema + CLI)
        └── M3 (mTLS handshake + list/get + scope enforcement)
              ├── M4 (search + audit + rate limit)
              │     └── M5 (cache + offline degradation + OTEL)
              ├── M6 (revocation + auto-renewal)  ◄── can start after M3
              └── M7 (multi-user hardening + e2e suite)  ◄── depends on M4+M5+M6

M5 and M6 can run in parallel once M4 is merged.

Test Strategy (applies to all milestones)

Three layers, all required before a milestone ships:

Layer	Scope	Runtime
Unit	Per-module logic, pure functions, adapters	Vitest, no I/O
Integration	Single gateway against real PG/Valkey/Step-CA	Vitest + Docker Compose test profile
Federation E2E	Two gateways on a Docker network, real mTLS	Playwright/custom harness (`tools/federation-harness/`) introduced in M3

Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward).

Quality gates per milestone (same as stack-wide):

pnpm typecheck green
pnpm lint green
pnpm test green (unit + integration)
pnpm test:federation green (M3+)
Independent code review passed
Docs updated (docs/federation/)
Merged PR on main, CI terminal green, linked issue closed

M1 — Federated Tier Infrastructure

Goal: A gateway can run in federated tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet.

Scope:

Add "tier": "federated" to mosaic.config.json schema and validators
Docker Compose federated profile (docker-compose.federated.yml) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes
Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise
pgvector extension installed + verified on startup
Migration logic: safe upgrade path from local/standalone → federated (data export/import script, one-way)
mosaic doctor reports tier + service health
Gateway continues to serve as a normal standalone instance (no federation yet)

Deliverables:

mosaic.config.json schema v2 (tier enum includes federated)
apps/gateway/src/bootstrap/tier-detector.ts
docker-compose.federated.yml
scripts/migrate-to-federated.ts
Updated mosaic doctor output
Updated packages/storage/src/adapters/postgres.ts with pgvector support

Acceptance tests:

#	Test	Layer
1	Gateway boots in `federated` tier with all services present	Integration
2	Gateway refuses to boot in `federated` tier when Postgres unreachable (fail-fast, clear)	Integration
3	`pgvector` extension available in target DB (`SELECT * FROM pg_extension WHERE extname='vector'`)	Integration
4	Migration script moves a populated `local` (PGlite) instance to `federated` (Postgres) with no data loss	Integration
5	`mosaic doctor` reports correct tier and all services green	Unit
6	Existing standalone behavior regression: agent session works end-to-end, no federation references	E2E (single-gateway)

Estimated budget: ~20K tokens (infra + config + migration script) Risk notes: Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot.

M2 — Step-CA + Grant Schema + Admin CLI

Goal: An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet.

Scope:

Embed Step-CA as a Docker Compose sidecar with a persistent CA volume
Gateway exposes a short-lived enrollment endpoint (single-use token from the grant)
DB schema: federation_grants, federation_peers, federation_audit_log (table only, not yet written to)
Sealed storage for client_key_pem using the existing credential sealing key
Admin CLI:
- mosaic federation grant create --user <id> --peer <host> --scope <file>
- mosaic federation grant list
- mosaic federation grant show <id>
- mosaic federation peer add <enrollment-url>
- mosaic federation peer list
Step-CA signs the cert with SAN OIDs for grantId + subjectUserId
Grant status transitions: pending → active on successful enrollment

Deliverables:

packages/db migration: three federation tables + enum types
apps/gateway/src/federation/ca.service.ts (Step-CA client)
apps/gateway/src/federation/grants.service.ts
apps/gateway/src/federation/enrollment.controller.ts
packages/mosaic/src/commands/federation/ (grant + peer subcommands)
docker-compose.federated.yml adds Step-CA service
Scope JSON schema + validator

Acceptance tests:

#	Test	Layer
1	`grant create` writes a `pending` row with a scoped bundle	Integration
2	Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs	Integration
3	Enrollment token is single-use; second attempt returns 410	Integration
4	Cert `subjectUserId` OID matches the grant's `subject_user_id`	Unit
5	`client_key_pem` is at-rest encrypted; raw DB read shows ciphertext, not PEM	Integration
6	`peer add <url>` on Server A yields an `active` peer record with a valid cert + key	E2E (two gateways, no traffic)
7	Scope JSON with unknown resource type rejected at `grant create`	Unit
8	`grant list` and `peer list` render active / pending / revoked accurately	Unit

Estimated budget: ~30K tokens (schema + CA integration + CLI + sealing) Risk notes: Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately.

M3 — mTLS Handshake + `list` + `get` with Scope Enforcement

Goal: Two federated gateways exchange real data over mTLS with scope intersecting native RBAC.

Scope:

FederationClient (outbound): picks cert from federation_peers, does mTLS call
FederationServer (inbound): NestJS guard validates client cert, extracts grantId + subjectUserId, loads grant
Scope enforcement pipeline:
1. Resource allowlist / excluded-list check
2. Native RBAC evaluation as the subjectUserId
3. Scope filter intersection (include_teams, include_personal)
4. max_rows_per_query cap
Verbs: list, get, capabilities
Gateway query layer accepts source: "local" | "federated:<host>" | "all"; fan-out for "all"
Federation E2E harness (tools/federation-harness/): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable

Deliverables:

apps/gateway/src/federation/client/federation-client.service.ts
apps/gateway/src/federation/server/federation-auth.guard.ts
apps/gateway/src/federation/server/scope.service.ts
apps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.ts
apps/gateway/src/federation/client/query-source.service.ts (fan-out/merge)
tools/federation-harness/ (compose + seed + test helpers)
packages/types — federation request/response DTOs in federation.dto.ts

Acceptance tests:

#	Test	Layer
1	A→B `list tasks` returns subjectUser's tasks intersected with scope	E2E
2	A→B `list tasks` with `include_teams: [T1]` excludes T2 tasks the user owns	E2E
3	A→B `get credential <id>` returns 403 when `credentials` is in `excluded_resources`	E2E
4	Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation)	E2E
5	Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached)	E2E
6	Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403	Integration
7	`max_rows_per_query` caps response; request for more paginated	Integration
8	`source: "all"` fan-out merges local + federated results, each tagged with `_source`	Integration
9	Federation responses never persist: verify DB row count unchanged after `list` round-trip	E2E
10	Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T	E2E

Estimated budget: ~40K tokens (largest milestone — core federation logic + harness) Risk notes: This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test.

M4 — `search` Verb + Audit Log + Rate Limit

Goal: Keyword search over allowed resources with full audit and per-grant rate limiting.

Scope:

search verb across resources allowlist (intersection of scope + native RBAC)
Keyword search (reuse existing packages/memory/src/adapters/keyword.ts); pgvector search stays out of v1 search verb
Every federated request (all verbs) writes to federation_audit_log: grant_id, verb, resource, query_hash, outcome, bytes_out, latency_ms
No request body captured; query_hash is SHA-256 of normalized query params
Token-bucket rate limit per grant (default 60/min, override per grant)
429 response with Retry-After header and structured body
90-day hot retention for audit log; cold-tier rollover deferred to M7

Deliverables:

apps/gateway/src/federation/server/verbs/search.controller.ts
apps/gateway/src/federation/server/audit.service.ts (async write, no blocking)
apps/gateway/src/federation/server/rate-limit.guard.ts
Tests in harness

Acceptance tests:

#	Test	Layer
1	`search` returns ranked hits only from allowed resources	E2E
2	`search` excluding `credentials` does not return a match even when keyword matches a credential name	E2E
3	Every successful request appears in `federation_audit_log` within 1s	Integration
4	Denied request (403) is also audited with `outcome='denied'`	Integration
5	Audit row stores query hash but NOT query body	Unit
6	61st request in 60s window returns 429 with `Retry-After`	E2E
7	Per-grant override (e.g., 600/min) takes effect without restart	Integration
8	Audit writes are async: request latency unchanged when audit write slow (simulated)	Integration

Estimated budget: ~20K tokens Risk notes: Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes.

M5 — Cache + Offline Degradation + Observability

Goal: Sessions feel fast and stay useful when the peer is slow or down.

Scope:

In-memory response cache keyed by (grant_id, verb, resource, query_hash), TTL 30s default
Cache NOT used for search; only list and get
Cache flushed on cert rotation and grant revocation
Circuit breaker per peer: after N failures, fast-fail for cooldown window
_source tagging extended with _cached: true when served from cache
Agent-visible "federation offline for <peer>" signal emitted once per session per peer
OTEL spans: federation.request with attrs grant_id, peer, verb, resource, outcome, latency_ms, cached
W3C traceparent propagated across the mTLS boundary (both directions)
mosaic federation status CLI subcommand

Deliverables:

apps/gateway/src/federation/client/response-cache.service.ts
apps/gateway/src/federation/client/circuit-breaker.service.ts
apps/gateway/src/federation/observability/ (span helpers)
packages/mosaic/src/commands/federation/status.ts

Acceptance tests:

#	Test	Layer
1	Two identical `list` calls within 30s: second served from cache, flagged `_cached`	Integration
2	`search` is never cached: two identical searches both hit the peer	Integration
3	After grant revocation, peer's cache is flushed immediately	Integration
4	After N consecutive failures, circuit opens; subsequent requests fail-fast without network call	E2E
5	Circuit closes after cooldown and next success	E2E
6	With peer offline, session completes using local data, one "federation offline" signal surfaced	E2E
7	OTEL traces show spans on both gateways correlated by `traceparent`	E2E
8	`mosaic federation status` prints peer state, cert expiry, last success/failure, circuit state	Unit

Estimated budget: ~20K tokens Risk notes: Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits.

M6 — Revocation, Auto-Renewal, CRL

Goal: Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution.

Scope:

mosaic federation grant revoke <id> → status revoked, CRL updated, audit entry
DB hook: deleting a user cascades revoke-on-delete on all grants where that user is subject
Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s)
Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache
On renewal failure, peer marked degraded and admin-visible alert emitted
Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked revoked, user notified

Deliverables:

apps/gateway/src/federation/server/crl.service.ts + endpoint
apps/gateway/src/federation/server/revocation.service.ts
DB cascade trigger or ORM hook for user deletion → grant revocation
apps/gateway/src/federation/client/renewal.job.ts (scheduled)
packages/mosaic/src/commands/federation/grant.ts gains revoke subcommand

Acceptance tests:

#	Test	Layer
1	Admin `grant revoke` → A's next request fails with TLS-level error	E2E
2	Deleting subject user on B auto-revokes all grants where that user was the subject	Integration
3	CRL endpoint serves correct list; revoked cert present	Integration
4	Server rejects cert listed in CRL even if cert itself is still time-valid	E2E
5	Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests	E2E
6	Renewal failure marks peer `degraded` and surfaces alert	Integration
7	A marks peer `revoked` after a revocation-caused handshake failure (not on transient network errors)	E2E

Estimated budget: ~20K tokens Risk notes: The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call.

M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite

Goal: The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance.

Scope:

Three-user scenario on Server B (E1, E2, E3) each with their own Server A
Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A
User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor
Negative-path test matrix (every denial path from PRD §8)
All PRD §15 acceptance criteria mapped to automated tests in the harness
Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested
Cold-storage rollover for audit log >90 days
Docs: operator runbook, onboarding guide, troubleshooting guide

Deliverables:

Full federation acceptance suite in tools/federation-harness/acceptance/
apps/web surfaces for peer/grant/audit management
docs/federation/RUNBOOK.md, docs/federation/ONBOARDING.md, docs/federation/TROUBLESHOOTING.md
Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage)

Acceptance tests: Every PRD §15 criterion must be automated and green. Additionally:

#	Test	Layer
1	3-employee scenario: each A sees only its user's data from B	E2E
2	Grant with team scope returns team data; same grant denied access to another employee's personal data	E2E
3	Concurrent sessions from E1's and E2's Server A to B interleave without any leakage	E2E
4	Audit log across 3-user test shows per-grant trails with no mis-attributed rows	E2E
5	Scope editor UI round-trip: edit → save → next request uses new scope	E2E
6	Attempt to use a revoked grant's cert against a different grant's endpoint: rejected	E2E
7	90-day-old audit rows moved to cold tier; queryable via explicit historical query	Integration
8	Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke	Manual checklist

Estimated budget: ~25K tokens Risk notes: This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge.

Total Budget & Timeline Sketch

Milestone	Tokens (est.)	Can parallelize?
M1	20K	No (foundation)
M2	30K	No (needs M1)
M3	40K	No (needs M2)
M4	20K	No (needs M3)
M5	20K	Yes (with M6 after M4)
M6	20K	Yes (with M5 after M3)
M7	25K	No (needs all)
Total	~175K

Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time.

Exit Criteria (federation feature complete)

All of the following must be green on main:

Every PRD §15 acceptance criterion automated and passing
Every milestone's acceptance table green
Security review sign-off on M7
Runbook walk-through completed by operator (not author)
mosaic doctor recognizes federated tier and reports peer health accurately
Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident

Next Step After This Doc Is Approved

File tracking issues on git.mosaicstack.dev/mosaicstack/stack — one per milestone, labeled epic:federation
Populate docs/TASKS.md with M1's task breakdown (per-task agent assignment, budget, dependencies)
Begin M1 implementation

20 KiB Raw Blame History

Mosaic Stack — Federation Implementation Milestones

Milestone Dependency Graph

Test Strategy (applies to all milestones)

M1 — Federated Tier Infrastructure

M2 — Step-CA + Grant Schema + Admin CLI

M3 — mTLS Handshake + list + get with Scope Enforcement

M4 — search Verb + Audit Log + Rate Limit

M5 — Cache + Offline Degradation + Observability

M6 — Revocation, Auto-Renewal, CRL

M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite

Total Budget & Timeline Sketch

Exit Criteria (federation feature complete)

Next Step After This Doc Is Approved

20 KiB

Raw Blame History

M3 — mTLS Handshake + `list` + `get` with Scope Enforcement

M4 — `search` Verb + Audit Log + Rate Limit