Compare commits
1 Commits
feat/feder
...
docs/feder
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
47aac682f5 |
368
docs/federation/MILESTONES.md
Normal file
368
docs/federation/MILESTONES.md
Normal file
@@ -0,0 +1,368 @@
|
|||||||
|
# Mosaic Stack — Federation Implementation Milestones
|
||||||
|
|
||||||
|
**Companion to:** `PRD.md`
|
||||||
|
**Approach:** Each milestone is a verifiable slice. A milestone is "done" only when its acceptance tests pass in CI against a real (not mocked) dependency stack.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone Dependency Graph
|
||||||
|
|
||||||
|
```
|
||||||
|
M1 (federated tier infra)
|
||||||
|
└── M2 (Step-CA + grant schema + CLI)
|
||||||
|
└── M3 (mTLS handshake + list/get + scope enforcement)
|
||||||
|
├── M4 (search + audit + rate limit)
|
||||||
|
│ └── M5 (cache + offline degradation + OTEL)
|
||||||
|
├── M6 (revocation + auto-renewal) ◄── can start after M3
|
||||||
|
└── M7 (multi-user hardening + e2e suite) ◄── depends on M4+M5+M6
|
||||||
|
```
|
||||||
|
|
||||||
|
M5 and M6 can run in parallel once M4 is merged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test Strategy (applies to all milestones)
|
||||||
|
|
||||||
|
Three layers, all required before a milestone ships:
|
||||||
|
|
||||||
|
| Layer | Scope | Runtime |
|
||||||
|
| ------------------ | --------------------------------------------- | ------------------------------------------------------------------------ |
|
||||||
|
| **Unit** | Per-module logic, pure functions, adapters | Vitest, no I/O |
|
||||||
|
| **Integration** | Single gateway against real PG/Valkey/Step-CA | Vitest + Docker Compose test profile |
|
||||||
|
| **Federation E2E** | Two gateways on a Docker network, real mTLS | Playwright/custom harness (`tools/federation-harness/`) introduced in M3 |
|
||||||
|
|
||||||
|
Every milestone adds tests to these layers. A milestone cannot be claimed complete if the federation E2E harness fails (applies from M3 onward).
|
||||||
|
|
||||||
|
**Quality gates per milestone** (same as stack-wide):
|
||||||
|
|
||||||
|
- `pnpm typecheck` green
|
||||||
|
- `pnpm lint` green
|
||||||
|
- `pnpm test` green (unit + integration)
|
||||||
|
- `pnpm test:federation` green (M3+)
|
||||||
|
- Independent code review passed
|
||||||
|
- Docs updated (`docs/federation/`)
|
||||||
|
- Merged PR on `main`, CI terminal green, linked issue closed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M1 — Federated Tier Infrastructure
|
||||||
|
|
||||||
|
**Goal:** A gateway can run in `federated` tier with containerized Postgres + Valkey + pgvector, with no federation logic active yet.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- Add `"tier": "federated"` to `mosaic.config.json` schema and validators
|
||||||
|
- Docker Compose `federated` profile (`docker-compose.federated.yml`) adds: Postgres+pgvector (5433), Valkey (6380), dedicated volumes
|
||||||
|
- Tier detector in gateway bootstrap: reads config, asserts required services reachable, refuses to start otherwise
|
||||||
|
- `pgvector` extension installed + verified on startup
|
||||||
|
- Migration logic: safe upgrade path from `local`/`standalone` → `federated` (data export/import script, one-way)
|
||||||
|
- `mosaic doctor` reports tier + service health
|
||||||
|
- Gateway continues to serve as a normal standalone instance (no federation yet)
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `mosaic.config.json` schema v2 (tier enum includes `federated`)
|
||||||
|
- `apps/gateway/src/bootstrap/tier-detector.ts`
|
||||||
|
- `docker-compose.federated.yml`
|
||||||
|
- `scripts/migrate-to-federated.ts`
|
||||||
|
- Updated `mosaic doctor` output
|
||||||
|
- Updated `packages/storage/src/adapters/postgres.ts` with pgvector support
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| - | ---------------------------------------------------------------------------------------- | ----------- |
|
||||||
|
| 1 | Gateway boots in `federated` tier with all services present | Integration |
|
||||||
|
| 2 | Gateway refuses to boot in `federated` tier when Postgres unreachable (fail-fast, clear) | Integration |
|
||||||
|
| 3 | `pgvector` extension available in target DB (`SELECT * FROM pg_extension WHERE extname='vector'`) | Integration |
|
||||||
|
| 4 | Migration script moves a populated `local` (PGlite) instance to `federated` (Postgres) with no data loss | Integration |
|
||||||
|
| 5 | `mosaic doctor` reports correct tier and all services green | Unit |
|
||||||
|
| 6 | Existing standalone behavior regression: agent session works end-to-end, no federation references | E2E (single-gateway) |
|
||||||
|
|
||||||
|
**Estimated budget:** ~20K tokens (infra + config + migration script)
|
||||||
|
**Risk notes:** Pgvector install on existing PG installs is occasionally finicky; test the migration path on a realistic DB snapshot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M2 — Step-CA + Grant Schema + Admin CLI
|
||||||
|
|
||||||
|
**Goal:** An admin can create a federation grant and its counterparty can enroll. No runtime traffic flows yet.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- Embed Step-CA as a Docker Compose sidecar with a persistent CA volume
|
||||||
|
- Gateway exposes a short-lived enrollment endpoint (single-use token from the grant)
|
||||||
|
- DB schema: `federation_grants`, `federation_peers`, `federation_audit_log` (table only, not yet written to)
|
||||||
|
- Sealed storage for `client_key_pem` using the existing credential sealing key
|
||||||
|
- Admin CLI:
|
||||||
|
- `mosaic federation grant create --user <id> --peer <host> --scope <file>`
|
||||||
|
- `mosaic federation grant list`
|
||||||
|
- `mosaic federation grant show <id>`
|
||||||
|
- `mosaic federation peer add <enrollment-url>`
|
||||||
|
- `mosaic federation peer list`
|
||||||
|
- Step-CA signs the cert with SAN OIDs for `grantId` + `subjectUserId`
|
||||||
|
- Grant status transitions: `pending` → `active` on successful enrollment
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `packages/db` migration: three federation tables + enum types
|
||||||
|
- `apps/gateway/src/federation/ca.service.ts` (Step-CA client)
|
||||||
|
- `apps/gateway/src/federation/grants.service.ts`
|
||||||
|
- `apps/gateway/src/federation/enrollment.controller.ts`
|
||||||
|
- `packages/mosaic/src/commands/federation/` (grant + peer subcommands)
|
||||||
|
- `docker-compose.federated.yml` adds Step-CA service
|
||||||
|
- Scope JSON schema + validator
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| - | ---------------------------------------------------------------------------------------- | ----------- |
|
||||||
|
| 1 | `grant create` writes a `pending` row with a scoped bundle | Integration |
|
||||||
|
| 2 | Enrollment endpoint signs a CSR and returns a cert with expected SAN OIDs | Integration |
|
||||||
|
| 3 | Enrollment token is single-use; second attempt returns 410 | Integration |
|
||||||
|
| 4 | Cert `subjectUserId` OID matches the grant's `subject_user_id` | Unit |
|
||||||
|
| 5 | `client_key_pem` is at-rest encrypted; raw DB read shows ciphertext, not PEM | Integration |
|
||||||
|
| 6 | `peer add <url>` on Server A yields an `active` peer record with a valid cert + key | E2E (two gateways, no traffic) |
|
||||||
|
| 7 | Scope JSON with unknown resource type rejected at `grant create` | Unit |
|
||||||
|
| 8 | `grant list` and `peer list` render active / pending / revoked accurately | Unit |
|
||||||
|
|
||||||
|
**Estimated budget:** ~30K tokens (schema + CA integration + CLI + sealing)
|
||||||
|
**Risk notes:** Step-CA's API surface is well-documented but the sealing integration with existing provider-credential encryption is a cross-module concern — walk that seam deliberately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M3 — mTLS Handshake + `list` + `get` with Scope Enforcement
|
||||||
|
|
||||||
|
**Goal:** Two federated gateways exchange real data over mTLS with scope intersecting native RBAC.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- `FederationClient` (outbound): picks cert from `federation_peers`, does mTLS call
|
||||||
|
- `FederationServer` (inbound): NestJS guard validates client cert, extracts `grantId` + `subjectUserId`, loads grant
|
||||||
|
- Scope enforcement pipeline:
|
||||||
|
1. Resource allowlist / excluded-list check
|
||||||
|
2. Native RBAC evaluation as the `subjectUserId`
|
||||||
|
3. Scope filter intersection (`include_teams`, `include_personal`)
|
||||||
|
4. `max_rows_per_query` cap
|
||||||
|
- Verbs: `list`, `get`, `capabilities`
|
||||||
|
- Gateway query layer accepts `source: "local" | "federated:<host>" | "all"`; fan-out for `"all"`
|
||||||
|
- **Federation E2E harness** (`tools/federation-harness/`): docker-compose.two-gateways.yml, seed script, assertion helpers — this is its own deliverable
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `apps/gateway/src/federation/client/federation-client.service.ts`
|
||||||
|
- `apps/gateway/src/federation/server/federation-auth.guard.ts`
|
||||||
|
- `apps/gateway/src/federation/server/scope.service.ts`
|
||||||
|
- `apps/gateway/src/federation/server/verbs/{list,get,capabilities}.controller.ts`
|
||||||
|
- `apps/gateway/src/federation/client/query-source.service.ts` (fan-out/merge)
|
||||||
|
- `tools/federation-harness/` (compose + seed + test helpers)
|
||||||
|
- `packages/types` — federation request/response DTOs in `federation.dto.ts`
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| -- | -------------------------------------------------------------------------------------------------------- | ----- |
|
||||||
|
| 1 | A→B `list tasks` returns subjectUser's tasks intersected with scope | E2E |
|
||||||
|
| 2 | A→B `list tasks` with `include_teams: [T1]` excludes T2 tasks the user owns | E2E |
|
||||||
|
| 3 | A→B `get credential <id>` returns 403 when `credentials` is in `excluded_resources` | E2E |
|
||||||
|
| 4 | Client presenting cert for grant X cannot query subjectUser of grant Y (cross-user isolation) | E2E |
|
||||||
|
| 5 | Cert signed by untrusted CA rejected at TLS layer (no NestJS handler reached) | E2E |
|
||||||
|
| 6 | Malformed SAN OIDs → 401; cert valid but grant revoked in DB → 403 | Integration |
|
||||||
|
| 7 | `max_rows_per_query` caps response; request for more paginated | Integration |
|
||||||
|
| 8 | `source: "all"` fan-out merges local + federated results, each tagged with `_source` | Integration |
|
||||||
|
| 9 | Federation responses never persist: verify DB row count unchanged after `list` round-trip | E2E |
|
||||||
|
| 10 | Scope cannot grant more than native RBAC: user without access to team T still gets [] even if scope allows T | E2E |
|
||||||
|
|
||||||
|
**Estimated budget:** ~40K tokens (largest milestone — core federation logic + harness)
|
||||||
|
**Risk notes:** This is the critical trust boundary. Code review should focus on scope enforcement bypass and cert-SAN-spoofing paths. Every 403/401 path needs a test.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M4 — `search` Verb + Audit Log + Rate Limit
|
||||||
|
|
||||||
|
**Goal:** Keyword search over allowed resources with full audit and per-grant rate limiting.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- `search` verb across `resources` allowlist (intersection of scope + native RBAC)
|
||||||
|
- Keyword search (reuse existing `packages/memory/src/adapters/keyword.ts`); pgvector search stays out of v1 search verb
|
||||||
|
- Every federated request (all verbs) writes to `federation_audit_log`: `grant_id`, `verb`, `resource`, `query_hash`, `outcome`, `bytes_out`, `latency_ms`
|
||||||
|
- No request body captured; `query_hash` is SHA-256 of normalized query params
|
||||||
|
- Token-bucket rate limit per grant (default 60/min, override per grant)
|
||||||
|
- 429 response with `Retry-After` header and structured body
|
||||||
|
- 90-day hot retention for audit log; cold-tier rollover deferred to M7
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `apps/gateway/src/federation/server/verbs/search.controller.ts`
|
||||||
|
- `apps/gateway/src/federation/server/audit.service.ts` (async write, no blocking)
|
||||||
|
- `apps/gateway/src/federation/server/rate-limit.guard.ts`
|
||||||
|
- Tests in harness
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| - | ------------------------------------------------------------------------------------------------- | ----------- |
|
||||||
|
| 1 | `search` returns ranked hits only from allowed resources | E2E |
|
||||||
|
| 2 | `search` excluding `credentials` does not return a match even when keyword matches a credential name | E2E |
|
||||||
|
| 3 | Every successful request appears in `federation_audit_log` within 1s | Integration |
|
||||||
|
| 4 | Denied request (403) is also audited with `outcome='denied'` | Integration |
|
||||||
|
| 5 | Audit row stores query hash but NOT query body | Unit |
|
||||||
|
| 6 | 61st request in 60s window returns 429 with `Retry-After` | E2E |
|
||||||
|
| 7 | Per-grant override (e.g., 600/min) takes effect without restart | Integration |
|
||||||
|
| 8 | Audit writes are async: request latency unchanged when audit write slow (simulated) | Integration |
|
||||||
|
|
||||||
|
**Estimated budget:** ~20K tokens
|
||||||
|
**Risk notes:** Ensure audit writes can't block or error-out the request path; use a bounded queue and drop-with-counter pattern rather than in-line writes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M5 — Cache + Offline Degradation + Observability
|
||||||
|
|
||||||
|
**Goal:** Sessions feel fast and stay useful when the peer is slow or down.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- In-memory response cache keyed by `(grant_id, verb, resource, query_hash)`, TTL 30s default
|
||||||
|
- Cache NOT used for `search`; only `list` and `get`
|
||||||
|
- Cache flushed on cert rotation and grant revocation
|
||||||
|
- Circuit breaker per peer: after N failures, fast-fail for cooldown window
|
||||||
|
- `_source` tagging extended with `_cached: true` when served from cache
|
||||||
|
- Agent-visible "federation offline for `<peer>`" signal emitted once per session per peer
|
||||||
|
- OTEL spans: `federation.request` with attrs `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`, `cached`
|
||||||
|
- W3C `traceparent` propagated across the mTLS boundary (both directions)
|
||||||
|
- `mosaic federation status` CLI subcommand
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `apps/gateway/src/federation/client/response-cache.service.ts`
|
||||||
|
- `apps/gateway/src/federation/client/circuit-breaker.service.ts`
|
||||||
|
- `apps/gateway/src/federation/observability/` (span helpers)
|
||||||
|
- `packages/mosaic/src/commands/federation/status.ts`
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| - | --------------------------------------------------------------------------------------------- | ----- |
|
||||||
|
| 1 | Two identical `list` calls within 30s: second served from cache, flagged `_cached` | Integration |
|
||||||
|
| 2 | `search` is never cached: two identical searches both hit the peer | Integration |
|
||||||
|
| 3 | After grant revocation, peer's cache is flushed immediately | Integration |
|
||||||
|
| 4 | After N consecutive failures, circuit opens; subsequent requests fail-fast without network call | E2E |
|
||||||
|
| 5 | Circuit closes after cooldown and next success | E2E |
|
||||||
|
| 6 | With peer offline, session completes using local data, one "federation offline" signal surfaced | E2E |
|
||||||
|
| 7 | OTEL traces show spans on both gateways correlated by `traceparent` | E2E |
|
||||||
|
| 8 | `mosaic federation status` prints peer state, cert expiry, last success/failure, circuit state | Unit |
|
||||||
|
|
||||||
|
**Estimated budget:** ~20K tokens
|
||||||
|
**Risk notes:** Caching correctness under revocation must be provable — write tests that intentionally race revocation against cached hits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M6 — Revocation, Auto-Renewal, CRL
|
||||||
|
|
||||||
|
**Goal:** Grant lifecycle works end-to-end: admin revoke, revoke-on-delete, automatic cert renewal, CRL distribution.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- `mosaic federation grant revoke <id>` → status `revoked`, CRL updated, audit entry
|
||||||
|
- DB hook: deleting a user cascades `revoke-on-delete` on all grants where that user is subject
|
||||||
|
- Step-CA CRL endpoint exposed; serving gateway enforces CRL check on every handshake (cached CRL, refresh interval 60s)
|
||||||
|
- Client-side cert renewal job: at T-7 days, submit renewal CSR; rotate cert atomically; flush cache
|
||||||
|
- On renewal failure, peer marked `degraded` and admin-visible alert emitted
|
||||||
|
- Server A detects revocation on next request (TLS handshake fails with specific error) → peer marked `revoked`, user notified
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- `apps/gateway/src/federation/server/crl.service.ts` + endpoint
|
||||||
|
- `apps/gateway/src/federation/server/revocation.service.ts`
|
||||||
|
- DB cascade trigger or ORM hook for user deletion → grant revocation
|
||||||
|
- `apps/gateway/src/federation/client/renewal.job.ts` (scheduled)
|
||||||
|
- `packages/mosaic/src/commands/federation/grant.ts` gains `revoke` subcommand
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
| # | Test | Layer |
|
||||||
|
| - | ----------------------------------------------------------------------------------------- | ----- |
|
||||||
|
| 1 | Admin `grant revoke` → A's next request fails with TLS-level error | E2E |
|
||||||
|
| 2 | Deleting subject user on B auto-revokes all grants where that user was the subject | Integration |
|
||||||
|
| 3 | CRL endpoint serves correct list; revoked cert present | Integration |
|
||||||
|
| 4 | Server rejects cert listed in CRL even if cert itself is still time-valid | E2E |
|
||||||
|
| 5 | Cert at T-7 days triggers renewal job; new cert issued and installed without dropped requests | E2E |
|
||||||
|
| 6 | Renewal failure marks peer `degraded` and surfaces alert | Integration |
|
||||||
|
| 7 | A marks peer `revoked` after a revocation-caused handshake failure (not on transient network errors) | E2E |
|
||||||
|
|
||||||
|
**Estimated budget:** ~20K tokens
|
||||||
|
**Risk notes:** The atomic cert swap during renewal is the sharpest edge here — any in-flight request mid-swap must either complete on old or retry on new, never fail mid-call.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## M7 — Multi-User RBAC Hardening + Team-Scoped Grants + Acceptance Suite
|
||||||
|
|
||||||
|
**Goal:** The full multi-tenant scenario from §4 user stories works end-to-end, with no cross-user leakage under any circumstance.
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
|
||||||
|
- Three-user scenario on Server B (E1, E2, E3) each with their own Server A
|
||||||
|
- Team-scoped grants exercised: each employee's team-data visible on their own A, but E1's personal data never visible on E2's A
|
||||||
|
- User-facing UI surfaces on both gateways for: peer list, grant list, audit log viewer, scope editor
|
||||||
|
- Negative-path test matrix (every denial path from PRD §8)
|
||||||
|
- All PRD §15 acceptance criteria mapped to automated tests in the harness
|
||||||
|
- Security review: cert-spoofing, scope-bypass, audit-bypass paths explicitly tested
|
||||||
|
- Cold-storage rollover for audit log >90 days
|
||||||
|
- Docs: operator runbook, onboarding guide, troubleshooting guide
|
||||||
|
|
||||||
|
**Deliverables:**
|
||||||
|
|
||||||
|
- Full federation acceptance suite in `tools/federation-harness/acceptance/`
|
||||||
|
- `apps/web` surfaces for peer/grant/audit management
|
||||||
|
- `docs/federation/RUNBOOK.md`, `docs/federation/ONBOARDING.md`, `docs/federation/TROUBLESHOOTING.md`
|
||||||
|
- Audit cold-tier job (daily cron, moves rows >90d to separate table or object storage)
|
||||||
|
|
||||||
|
**Acceptance tests:**
|
||||||
|
Every PRD §15 criterion must be automated and green. Additionally:
|
||||||
|
|
||||||
|
| # | Test | Layer |
|
||||||
|
| --- | ----------------------------------------------------------------------------------------------------- | ---------------- |
|
||||||
|
| 1 | 3-employee scenario: each A sees only its user's data from B | E2E |
|
||||||
|
| 2 | Grant with team scope returns team data; same grant denied access to another employee's personal data | E2E |
|
||||||
|
| 3 | Concurrent sessions from E1's and E2's Server A to B interleave without any leakage | E2E |
|
||||||
|
| 4 | Audit log across 3-user test shows per-grant trails with no mis-attributed rows | E2E |
|
||||||
|
| 5 | Scope editor UI round-trip: edit → save → next request uses new scope | E2E |
|
||||||
|
| 6 | Attempt to use a revoked grant's cert against a different grant's endpoint: rejected | E2E |
|
||||||
|
| 7 | 90-day-old audit rows moved to cold tier; queryable via explicit historical query | Integration |
|
||||||
|
| 8 | Runbook steps validated: an operator following the runbook can onboard, rotate, and revoke | Manual checklist |
|
||||||
|
|
||||||
|
**Estimated budget:** ~25K tokens
|
||||||
|
**Risk notes:** This is the security-critical milestone. Budget review time here is non-negotiable — plan for two independent code reviews (internal + security-focused) before merge.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Total Budget & Timeline Sketch
|
||||||
|
|
||||||
|
| Milestone | Tokens (est.) | Can parallelize? |
|
||||||
|
| --------- | ------------- | ---------------------- |
|
||||||
|
| M1 | 20K | No (foundation) |
|
||||||
|
| M2 | 30K | No (needs M1) |
|
||||||
|
| M3 | 40K | No (needs M2) |
|
||||||
|
| M4 | 20K | No (needs M3) |
|
||||||
|
| M5 | 20K | Yes (with M6 after M4) |
|
||||||
|
| M6 | 20K | Yes (with M5 after M3) |
|
||||||
|
| M7 | 25K | No (needs all) |
|
||||||
|
| **Total** | **~175K** | |
|
||||||
|
|
||||||
|
Parallelization of M5 and M6 after M4 saves one milestone's worth of serial time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Exit Criteria (federation feature complete)
|
||||||
|
|
||||||
|
All of the following must be green on `main`:
|
||||||
|
|
||||||
|
- Every PRD §15 acceptance criterion automated and passing
|
||||||
|
- Every milestone's acceptance table green
|
||||||
|
- Security review sign-off on M7
|
||||||
|
- Runbook walk-through completed by operator (not author)
|
||||||
|
- `mosaic doctor` recognizes federated tier and reports peer health accurately
|
||||||
|
- Two-gateway production deployment (woltje.com ↔ uscllc.com) operational for ≥7 days without incident
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Step After This Doc Is Approved
|
||||||
|
|
||||||
|
1. File tracking issues on `git.mosaicstack.dev/mosaicstack/stack` — one per milestone, labeled `epic:federation`
|
||||||
|
2. Populate `docs/TASKS.md` with M1's task breakdown (per-task agent assignment, budget, dependencies)
|
||||||
|
3. Begin M1 implementation
|
||||||
85
docs/federation/MISSION-MANIFEST.md
Normal file
85
docs/federation/MISSION-MANIFEST.md
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
# Mission Manifest — Federation v1
|
||||||
|
|
||||||
|
> Persistent document tracking full mission scope, status, and session history.
|
||||||
|
> Updated by the orchestrator at each phase transition and milestone completion.
|
||||||
|
|
||||||
|
## Mission
|
||||||
|
|
||||||
|
**ID:** federation-v1-20260419
|
||||||
|
**Statement:** Jarvis operates across 3–4 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session; a prior OpenBrain attempt caused cache, latency, and opacity pain. This mission builds asymmetric federation between Mosaic Stack gateways so that a session on a user's home gateway can query their work gateway in real time without data ever persisting across the boundary, with full multi-tenant isolation and standard-PKI (X.509 / Step-CA) trust management.
|
||||||
|
**Phase:** Planning complete — M1 implementation not started
|
||||||
|
**Current Milestone:** FED-M1
|
||||||
|
**Progress:** 0 / 7 milestones
|
||||||
|
**Status:** active
|
||||||
|
**Last Updated:** 2026-04-19 (PRD + MILESTONES + tracking issues filed)
|
||||||
|
**Parent Mission:** None — new mission
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Federation is the solution to what originally drove OpenBrain. The prior attempt coupled every agent session to a remote service, introduced cache/latency/opacity pain, and created a hard dependency that punished offline use. This redesign:
|
||||||
|
|
||||||
|
1. Makes federation **gateway-to-gateway**, not agent-to-service
|
||||||
|
2. Keeps each user's home instance as source of truth for their data
|
||||||
|
3. Exposes scoped, read-only data on demand without persisting across the boundary
|
||||||
|
4. Uses X.509 mTLS via Step-CA so rotation/revocation/CRL/OCSP are standard
|
||||||
|
5. Supports multi-tenant serving sides (employees on uscllc.com each federating back to their own home gateway) with no cross-user leakage
|
||||||
|
6. Requires federation-tier instances on both sides (PG + pgvector + Valkey) — local/standalone tiers cannot federate
|
||||||
|
7. Works over public HTTPS (no VPN required); Tailscale is an optional overlay
|
||||||
|
|
||||||
|
Key design references:
|
||||||
|
|
||||||
|
- `docs/federation/PRD.md` — 16-section product requirements
|
||||||
|
- `docs/federation/MILESTONES.md` — 7-milestone decomposition with per-milestone acceptance tests
|
||||||
|
- `docs/federation/TASKS.md` — per-task breakdown (M1 populated; M2-M7 deferred to mission planning)
|
||||||
|
- `docs/research/mempalace-evaluation/` (in jarvis-brain) — why we didn't adopt MemPalace
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- [ ] AC-1: Two Mosaic Stack gateways on different hosts can establish a federation grant via CLI-driven onboarding
|
||||||
|
- [ ] AC-2: Server A can query Server B for `tasks`, `notes`, `memory` respecting scope filters
|
||||||
|
- [ ] AC-3: User on B with no grant cannot be queried by A, even if A has a valid grant for another user (cross-user isolation)
|
||||||
|
- [ ] AC-4: Revoking a grant on B causes A's next request to fail with a clear error within one request cycle
|
||||||
|
- [ ] AC-5: Cert rotation happens automatically at T-7 days; in-progress session survives rotation without user action
|
||||||
|
- [ ] AC-6: Rate-limit enforcement returns 429 with `Retry-After`; client backs off
|
||||||
|
- [ ] AC-7: With B unreachable, a session on A completes using local data and surfaces "federation offline for `<peer>`" once per session
|
||||||
|
- [ ] AC-8: Every federated request appears in B's `federation_audit_log` within 1 second
|
||||||
|
- [ ] AC-9: Scope excluding `credentials` means credentials are never returned — even via `search` with matching keywords
|
||||||
|
- [ ] AC-10: `mosaic federation status` shows cert expiry, grant status, last success/failure per peer
|
||||||
|
- [ ] AC-11: Full 3-employee multi-tenant scenario passes with no cross-user leakage
|
||||||
|
- [ ] AC-12: Two-gateway production deployment (woltje.com ↔ uscllc.com) operational ≥7 days without incident
|
||||||
|
- [ ] AC-13: All 7 milestones ship as merged PRs with green CI and closed issues
|
||||||
|
|
||||||
|
## Milestones
|
||||||
|
|
||||||
|
| # | ID | Name | Status | Branch | Issue | Started | Completed |
|
||||||
|
| --- | ------ | --------------------------------------------- | ----------- | ------ | ----- | ------- | --------- |
|
||||||
|
| 1 | FED-M1 | Federated tier infrastructure | not-started | — | #460 | — | — |
|
||||||
|
| 2 | FED-M2 | Step-CA + grant schema + admin CLI | not-started | — | #461 | — | — |
|
||||||
|
| 3 | FED-M3 | mTLS handshake + list/get + scope enforcement | not-started | — | #462 | — | — |
|
||||||
|
| 4 | FED-M4 | search verb + audit log + rate limit | not-started | — | #463 | — | — |
|
||||||
|
| 5 | FED-M5 | Cache + offline degradation + OTEL | not-started | — | #464 | — | — |
|
||||||
|
| 6 | FED-M6 | Revocation + auto-renewal + CRL | not-started | — | #465 | — | — |
|
||||||
|
| 7 | FED-M7 | Multi-user RBAC hardening + acceptance suite | not-started | — | #466 | — | — |
|
||||||
|
|
||||||
|
## Budget
|
||||||
|
|
||||||
|
| Milestone | Est. tokens | Parallelizable? |
|
||||||
|
| --------- | ----------- | ---------------------- |
|
||||||
|
| FED-M1 | 20K | No (foundation) |
|
||||||
|
| FED-M2 | 30K | No (needs M1) |
|
||||||
|
| FED-M3 | 40K | No (needs M2) |
|
||||||
|
| FED-M4 | 20K | No (needs M3) |
|
||||||
|
| FED-M5 | 20K | Yes (with M6 after M4) |
|
||||||
|
| FED-M6 | 20K | Yes (with M5 after M3) |
|
||||||
|
| FED-M7 | 25K | No (needs all) |
|
||||||
|
| **Total** | **~175K** | |
|
||||||
|
|
||||||
|
## Session History
|
||||||
|
|
||||||
|
| Session | Date | Runtime | Outcome |
|
||||||
|
| ------- | ---------- | ------- | --------------------------------------------------- |
|
||||||
|
| S1 | 2026-04-19 | claude | PRD authored, MILESTONES decomposed, 7 issues filed |
|
||||||
|
|
||||||
|
## Next Step
|
||||||
|
|
||||||
|
Begin FED-M1 implementation: federated tier infrastructure. Breakdown in `docs/federation/TASKS.md`.
|
||||||
330
docs/federation/PRD.md
Normal file
330
docs/federation/PRD.md
Normal file
@@ -0,0 +1,330 @@
|
|||||||
|
# Mosaic Stack — Federation PRD
|
||||||
|
|
||||||
|
**Status:** Draft v1 (locked for implementation)
|
||||||
|
**Owner:** Jason
|
||||||
|
**Date:** 2026-04-19
|
||||||
|
**Scope:** Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Problem Statement
|
||||||
|
|
||||||
|
Jarvis operates across 3–4 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service).
|
||||||
|
|
||||||
|
The goal is a federation model where each user's **home instance** remains the source of truth for their personal data, and **work/shared instances** expose scoped data to that user's home instance on demand — without persisting anything across the boundary.
|
||||||
|
|
||||||
|
## 2. Goals
|
||||||
|
|
||||||
|
1. A user logged into their **home gateway** (Server A) can query their **work gateway** (Server B) in real time during a session.
|
||||||
|
2. Data returned from Server B is used in-session only; never written to Server A storage.
|
||||||
|
3. Server B has multiple users, each with their own Server A. No user's data leaks to another user.
|
||||||
|
4. Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay.
|
||||||
|
5. Sync latency target: seconds, or at the next data need of the agent.
|
||||||
|
6. Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal.
|
||||||
|
7. Teams exist on both sides. A federation grant can share **team-owned** data without exposing other team members' personal data.
|
||||||
|
8. Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box.
|
||||||
|
|
||||||
|
## 3. Non-Goals (v1)
|
||||||
|
|
||||||
|
- Mesh federation (N-to-N). v1 is strictly A↔B pairs.
|
||||||
|
- Cross-instance writes. All federation is **read-only** on the remote side.
|
||||||
|
- Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only.
|
||||||
|
- Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user.
|
||||||
|
- Realtime push from B→A. v1 is pull-only (A pulls from B during a session).
|
||||||
|
- Global search index. Federation is query-by-query, not index replication.
|
||||||
|
|
||||||
|
## 4. User Stories
|
||||||
|
|
||||||
|
- **US-1 (Solo user at home):** As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain.
|
||||||
|
- **US-2 (Cross-location):** As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them.
|
||||||
|
- **US-3 (Work admin):** As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data.
|
||||||
|
- **US-4 (Privacy boundary):** As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com.
|
||||||
|
- **US-5 (Revocation):** As a work admin, when I delete an employee, their home gateway loses access within one request cycle.
|
||||||
|
- **US-6 (Offline):** As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung.
|
||||||
|
|
||||||
|
## 5. Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────┐ mTLS / X.509 ┌─────────────────────────────────────┐
|
||||||
|
│ Server A — mosaic.woltje.com │ ───────────────────────► │ Server B — mosaic.uscllc.com │
|
||||||
|
│ (home, master for Jason) │ ◄── JSON over HTTPS │ (work, multi-tenant) │
|
||||||
|
│ │ │ │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ │ │ ┌──────────────┐ ┌──────────────┐ │
|
||||||
|
│ │ Gateway │ │ Postgres │ │ │ │ Gateway │ │ Postgres │ │
|
||||||
|
│ │ (NestJS) │──│ (local SSOT)│ │ │ │ (NestJS) │──│ (tenant SSOT)│ │
|
||||||
|
│ └──────┬───────┘ └──────────────┘ │ │ └──────┬───────┘ └──────────────┘ │
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ │ FederationClient │ │ │ FederationServer │
|
||||||
|
│ │ (outbound, scoped query) │ │ │ (inbound, RBAC-gated) │
|
||||||
|
│ └───────────────────────────┼──────────────────────────┼────────┘ │
|
||||||
|
│ │ │ │
|
||||||
|
│ Step-CA (issues A's client cert) │ │ Step-CA (issues B's server cert, │
|
||||||
|
│ │ │ trusts A's CA root on grant)│
|
||||||
|
└─────────────────────────────────────┘ └──────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
- Federation is a **transport-layer** concern between two gateways, implemented as a new internal module on each gateway.
|
||||||
|
- Both sides run the same code. Direction (client vs. server role) is per-request.
|
||||||
|
- Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote.
|
||||||
|
|
||||||
|
## 6. Transport & Authentication
|
||||||
|
|
||||||
|
**Transport:** HTTPS with mutual TLS (mTLS).
|
||||||
|
|
||||||
|
**Identity:** X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side.
|
||||||
|
|
||||||
|
**Why mTLS over HMAC bearer tokens:**
|
||||||
|
|
||||||
|
- Standard rotation/revocation semantics (renew, CRL, OCSP).
|
||||||
|
- The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity.
|
||||||
|
- Client certs never transit request bodies, so they can't be logged by accident.
|
||||||
|
- Transport is pinned at the TLS layer, not re-validated per-handler.
|
||||||
|
|
||||||
|
**Cert contents (SAN + subject):**
|
||||||
|
|
||||||
|
- `CN=grant-<uuid>`
|
||||||
|
- `O=<requesting-server-hostname>` (e.g., `mosaic.woltje.com`)
|
||||||
|
- Custom OIDs embedded in SAN otherName:
|
||||||
|
- `mosaic.federation.grantId` (UUID)
|
||||||
|
- `mosaic.federation.subjectUserId` (user on the **serving** side that this grant acts-as)
|
||||||
|
- Default lifetime: **30 days**, with auto-renewal at T-7 days if the grant is still active.
|
||||||
|
|
||||||
|
**Step-CA topology (v1):** Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1.
|
||||||
|
|
||||||
|
**Handshake:**
|
||||||
|
|
||||||
|
1. Client (A) opens HTTPS to B with its grant cert.
|
||||||
|
2. B validates cert chain against trusted CA roots for that grant.
|
||||||
|
3. B extracts `grantId` and `subjectUserId` from the cert.
|
||||||
|
4. B loads the grant record, checks it is `active`, not revoked, and not expired.
|
||||||
|
5. B enforces scope and rate-limit for this grant.
|
||||||
|
6. Request proceeds; response returned.
|
||||||
|
|
||||||
|
## 7. Data Model
|
||||||
|
|
||||||
|
All tables live on **each instance's own Postgres**. Federation grants are bilateral — each side has a record of the grant.
|
||||||
|
|
||||||
|
### 7.1 `federation_grants` (on serving side, Server B)
|
||||||
|
|
||||||
|
| Field | Type | Notes |
|
||||||
|
| --------------------------- | ----------- | ------------------------------------------------- |
|
||||||
|
| `id` | uuid PK | |
|
||||||
|
| `subject_user_id` | uuid FK | Which local user this grant acts-as |
|
||||||
|
| `requesting_server` | text | Hostname of requesting gateway (e.g., woltje.com) |
|
||||||
|
| `requesting_ca_fingerprint` | text | SHA-256 of trusted CA root |
|
||||||
|
| `active_cert_fingerprint` | text | SHA-256 of currently valid client cert |
|
||||||
|
| `scope` | jsonb | See §8 |
|
||||||
|
| `rate_limit_rpm` | int | Default 60 |
|
||||||
|
| `status` | enum | `pending`, `active`, `suspended`, `revoked` |
|
||||||
|
| `created_at` | timestamptz | |
|
||||||
|
| `activated_at` | timestamptz | |
|
||||||
|
| `revoked_at` | timestamptz | |
|
||||||
|
| `last_used_at` | timestamptz | |
|
||||||
|
| `notes` | text | Admin-visible description |
|
||||||
|
|
||||||
|
### 7.2 `federation_peers` (on requesting side, Server A)
|
||||||
|
|
||||||
|
| Field | Type | Notes |
|
||||||
|
| --------------------- | ----------- | ------------------------------------------------ |
|
||||||
|
| `id` | uuid PK | |
|
||||||
|
| `peer_hostname` | text | e.g., `mosaic.uscllc.com` |
|
||||||
|
| `peer_ca_fingerprint` | text | SHA-256 of peer's CA root |
|
||||||
|
| `grant_id` | uuid | The grant ID assigned by the peer |
|
||||||
|
| `local_user_id` | uuid FK | Who on Server A this federation belongs to |
|
||||||
|
| `client_cert_pem` | text (enc) | Current client cert (PEM); rotated automatically |
|
||||||
|
| `client_key_pem` | text (enc) | Private key (encrypted at rest) |
|
||||||
|
| `cert_expires_at` | timestamptz | |
|
||||||
|
| `status` | enum | `pending`, `active`, `degraded`, `revoked` |
|
||||||
|
| `last_success_at` | timestamptz | |
|
||||||
|
| `last_failure_at` | timestamptz | |
|
||||||
|
| `notes` | text | |
|
||||||
|
|
||||||
|
### 7.3 `federation_audit_log` (on serving side, Server B)
|
||||||
|
|
||||||
|
| Field | Type | Notes |
|
||||||
|
| ------------- | ----------- | ------------------------------------------------ |
|
||||||
|
| `id` | uuid PK | |
|
||||||
|
| `grant_id` | uuid FK | |
|
||||||
|
| `occurred_at` | timestamptz | indexed |
|
||||||
|
| `verb` | text | `query`, `handshake`, `rejected`, `rate_limited` |
|
||||||
|
| `resource` | text | e.g., `tasks`, `notes`, `credentials` |
|
||||||
|
| `query_hash` | text | SHA-256 of normalized query (no payload stored) |
|
||||||
|
| `outcome` | text | `ok`, `denied`, `error` |
|
||||||
|
| `bytes_out` | int | |
|
||||||
|
| `latency_ms` | int | |
|
||||||
|
|
||||||
|
**Audit policy:** Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage.
|
||||||
|
|
||||||
|
## 8. RBAC & Scope
|
||||||
|
|
||||||
|
Every federation grant has a scope object that answers three questions for every inbound request:
|
||||||
|
|
||||||
|
1. **Who is acting?** — `subject_user_id` from the cert.
|
||||||
|
2. **What resources?** — an allowlist of resource types (`tasks`, `notes`, `credentials`, `memory`, `teams/:id/tasks`, …).
|
||||||
|
3. **Filter expression** — predicates applied on top of the subject's normal RBAC (see below).
|
||||||
|
|
||||||
|
### 8.1 Scope schema
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"resources": ["tasks", "notes", "memory"],
|
||||||
|
"filters": {
|
||||||
|
"tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true },
|
||||||
|
"notes": { "include_personal": true, "include_teams": [] },
|
||||||
|
"memory": { "include_personal": true }
|
||||||
|
},
|
||||||
|
"excluded_resources": ["credentials", "api_keys"],
|
||||||
|
"max_rows_per_query": 500
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.2 Access rule (enforced on serving side)
|
||||||
|
|
||||||
|
For every inbound federated query on resource R:
|
||||||
|
|
||||||
|
1. Resolve effective identity → `subject_user_id`.
|
||||||
|
2. Check R is in `scope.resources` and NOT in `scope.excluded_resources`. Otherwise 403.
|
||||||
|
3. Evaluate the user's **normal RBAC** (what would they see if they logged into Server B directly)?
|
||||||
|
4. Intersect with the scope filter (e.g., only team X, only personal).
|
||||||
|
5. Apply `max_rows_per_query`.
|
||||||
|
6. Return; log to audit.
|
||||||
|
|
||||||
|
### 8.3 Team boundary guarantees
|
||||||
|
|
||||||
|
- Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves.
|
||||||
|
- `include_teams` means "only these teams," not "these teams in addition to all teams."
|
||||||
|
- `include_personal: false` hides the user's personal data entirely from federation, even if they own it — useful for work-only accounts.
|
||||||
|
|
||||||
|
### 8.4 No cross-user leakage
|
||||||
|
|
||||||
|
When Server B has multiple users (employees) all federating back to their own Server A:
|
||||||
|
|
||||||
|
- Each employee has their own grant with their own `subject_user_id`.
|
||||||
|
- The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another.
|
||||||
|
- Audit log is per-grant.
|
||||||
|
|
||||||
|
## 9. Query Model
|
||||||
|
|
||||||
|
Federation exposes a **narrow read API**, not arbitrary SQL.
|
||||||
|
|
||||||
|
### 9.1 Supported verbs (v1)
|
||||||
|
|
||||||
|
| Verb | Purpose | Returns |
|
||||||
|
| -------------- | ------------------------------------------ | ------------------------------- |
|
||||||
|
| `list` | Paginated list of a resource type | Array of resources |
|
||||||
|
| `get` | Fetch a single resource by id | One resource or 404 |
|
||||||
|
| `search` | Keyword search within allowed resources | Ranked list of hits |
|
||||||
|
| `capabilities` | What this grant is allowed to do right now | Scope object + rate-limit state |
|
||||||
|
|
||||||
|
### 9.2 Not in v1
|
||||||
|
|
||||||
|
- Write verbs.
|
||||||
|
- Aggregations / analytics.
|
||||||
|
- Streaming / subscriptions (future: see §13).
|
||||||
|
|
||||||
|
### 9.3 Agent-facing integration
|
||||||
|
|
||||||
|
Agents never call federation directly. Instead:
|
||||||
|
|
||||||
|
- The gateway query layer accepts `source: "local" | "federated:<peer_hostname>" | "all"`.
|
||||||
|
- `"all"` fans out in parallel, merges results, tags each with `_source`.
|
||||||
|
- Federation results are in-memory only; the gateway does not persist them.
|
||||||
|
|
||||||
|
## 10. Caching
|
||||||
|
|
||||||
|
- **In-memory response cache** with short TTL (default 30s) for `list` and `get`. `search` is not cached.
|
||||||
|
- Cache is keyed by `(grant_id, verb, resource, query_hash)`.
|
||||||
|
- Cache is flushed on cert rotation and on grant revocation.
|
||||||
|
- No disk cache. No cross-session cache.
|
||||||
|
|
||||||
|
## 11. Bootstrap & Onboarding
|
||||||
|
|
||||||
|
### 11.1 Instance capability tiers
|
||||||
|
|
||||||
|
| Tier | Storage | Queue | Memory | Can federate? |
|
||||||
|
| ------------ | -------- | ------- | -------- | --------------------- |
|
||||||
|
| `local` | PGlite | in-proc | keyword | No |
|
||||||
|
| `standalone` | Postgres | Valkey | keyword | No (can be client) |
|
||||||
|
| `federated` | Postgres | Valkey | pgvector | Yes (server + client) |
|
||||||
|
|
||||||
|
Federation requires `federated` tier on **both** sides.
|
||||||
|
|
||||||
|
### 11.2 Onboarding flow (admin-driven)
|
||||||
|
|
||||||
|
1. Admin on Server B runs `mosaic federation grant create --user <user-id> --peer <peer-hostname> --scope-file scope.json`.
|
||||||
|
2. Server B generates a `grant_id`, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint.
|
||||||
|
3. Admin on Server A (or the user themselves, if allowed) runs `mosaic federation peer add <enrollment-url>`.
|
||||||
|
4. Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL).
|
||||||
|
5. B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it.
|
||||||
|
6. A stores the signed cert + private key (encrypted) in `federation_peers`.
|
||||||
|
7. Grant status flips from `pending` to `active` on both sides.
|
||||||
|
8. Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active.
|
||||||
|
|
||||||
|
### 11.3 Revocation
|
||||||
|
|
||||||
|
- **Admin-initiated:** `mosaic federation grant revoke <grant-id>` on B flips status to `revoked`, adds the cert to B's CRL, and writes an audit entry.
|
||||||
|
- **Revoke-on-delete:** Deleting a user on B automatically revokes all grants where that user is the subject.
|
||||||
|
- Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to `revoked`.
|
||||||
|
|
||||||
|
### 11.4 Rate limit
|
||||||
|
|
||||||
|
Default `60 req/min` per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns `429` with `Retry-After`.
|
||||||
|
|
||||||
|
## 12. Operational Concerns
|
||||||
|
|
||||||
|
- **Observability:** Each federation request emits an OTEL span with `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`. Traces correlate across both servers via W3C traceparent.
|
||||||
|
- **Health check:** `mosaic federation status` on each side shows active grants, last-success times, cert expirations, and any CRL mismatches.
|
||||||
|
- **Backpressure:** If the serving side is overloaded, it returns `503` with a structured body; the client marks the peer `degraded` and falls back to local-only until the next successful handshake.
|
||||||
|
- **Secrets:** `client_key_pem` in `federation_peers` is encrypted with the gateway's key (sealed with the instance's master key — same mechanism as `provider_credentials`).
|
||||||
|
- **Credentials never cross:** The `credentials` resource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user.
|
||||||
|
|
||||||
|
## 13. Future (post-v1)
|
||||||
|
|
||||||
|
- B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS.
|
||||||
|
- Mesh (N-to-N) federation.
|
||||||
|
- Write verbs with conflict resolution.
|
||||||
|
- Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots.
|
||||||
|
- Federated memory search over vector indexes with homomorphic filtering.
|
||||||
|
|
||||||
|
## 14. Locked Decisions (was "Open Questions")
|
||||||
|
|
||||||
|
| # | Question | Decision |
|
||||||
|
| --- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| 1 | What happens to a grant when its subject user is deleted? | **Revoke-on-delete.** All grants where the user is subject are auto-revoked and CRL'd. |
|
||||||
|
| 2 | Do we audit read-only requests? | **Yes.** All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only. |
|
||||||
|
| 3 | Default rate limit? | **60 requests per minute per grant,** override-able per grant. |
|
||||||
|
| 4 | How do we verify the requesting-server's identity beyond the grant token? | **X.509 client cert tied to the user,** issued by Step-CA (per-server) or locally generated. Cert subject carries `grantId` + `subjectUserId`. |
|
||||||
|
|
||||||
|
### M1 decisions
|
||||||
|
|
||||||
|
- **Postgres deployment:** **Containerized** alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature.
|
||||||
|
- **Instance signing key:** **Separate** from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences.
|
||||||
|
|
||||||
|
## 15. Acceptance Criteria
|
||||||
|
|
||||||
|
- [ ] Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow.
|
||||||
|
- [ ] Server A can query Server B for `tasks`, `notes`, `memory` respecting scope filters.
|
||||||
|
- [ ] A user on B with no grant cannot be queried by A, even if A has a valid grant for another user.
|
||||||
|
- [ ] Revoking a grant on B causes A's next request to fail with a clear error within one request cycle.
|
||||||
|
- [ ] Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action.
|
||||||
|
- [ ] Rate-limit enforcement returns 429 with `Retry-After`; client backs off.
|
||||||
|
- [ ] With B unreachable, a session on A completes using local data and surfaces a "federation offline for `<peer>`" signal once.
|
||||||
|
- [ ] Every federated request appears in B's `federation_audit_log` within 1 second.
|
||||||
|
- [ ] A scope excluding `credentials` means credentials are not returnable even via `search` with matching keywords.
|
||||||
|
- [ ] `mosaic federation status` shows cert expiry, grant status, and last success/failure per peer.
|
||||||
|
|
||||||
|
## 16. Implementation Milestones (reference)
|
||||||
|
|
||||||
|
Milestones live in `docs/federation/MILESTONES.md` (to be authored next). High-level:
|
||||||
|
|
||||||
|
- **M1:** Server A runs `federated` tier standalone (Postgres + Valkey + pgvector, containerized). No peer yet.
|
||||||
|
- **M2:** Step-CA embedded; `federation_grants` / `federation_peers` schema + admin CLI.
|
||||||
|
- **M3:** Handshake + `list`/`get` verbs with scope enforcement.
|
||||||
|
- **M4:** `search` verb, audit log, rate limits.
|
||||||
|
- **M5:** Cache layer, offline-degradation UX, observability surfaces.
|
||||||
|
- **M6:** Revocation flows (admin + revoke-on-delete), cert auto-renewal.
|
||||||
|
- **M7:** Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Next step after PRD sign-off:** author `docs/federation/MILESTONES.md` with per-milestone acceptance tests and estimated token budget, then file tracking issues on `git.mosaicstack.dev/mosaicstack/stack`.
|
||||||
76
docs/federation/TASKS.md
Normal file
76
docs/federation/TASKS.md
Normal file
@@ -0,0 +1,76 @@
|
|||||||
|
# Tasks — Federation v1
|
||||||
|
|
||||||
|
> Single-writer: orchestrator only. Workers read but never modify.
|
||||||
|
>
|
||||||
|
> **Mission:** federation-v1-20260419
|
||||||
|
> **Schema:** `| id | status | description | issue | agent | branch | depends_on | estimate | notes |`
|
||||||
|
> **Status values:** `not-started` | `in-progress` | `done` | `blocked` | `failed` | `needs-qa`
|
||||||
|
> **Agent values:** `codex` | `glm-5.1` | `haiku` | `sonnet` | `opus` | `—` (auto)
|
||||||
|
>
|
||||||
|
> **Scope of this file:** M1 is fully decomposed below. M2–M7 are placeholders pending each milestone's entry into active planning — the orchestrator expands them one milestone at a time to avoid speculative decomposition of work whose shape will depend on what M1 surfaces.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 1 — Federated tier infrastructure (FED-M1)
|
||||||
|
|
||||||
|
Goal: Gateway runs in `federated` tier with containerized PG+pgvector+Valkey. No federation logic yet. Existing standalone behavior does not regress.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | ------ | ------------------------------- | ---------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| FED-M1-01 | not-started | Extend `mosaic.config.json` schema: add `"federated"` to `tier` enum in validator + TS types. Keep `local` and `standalone` working. Update schema docs/README where referenced. | #460 | codex | feat/federation-m1-tier-config | — | 4K | Schema lives in `packages/types`; validator in gateway bootstrap. No behavior change yet — enum only. |
|
||||||
|
| FED-M1-02 | not-started | Author `docker-compose.federated.yml` as an overlay profile: Postgres 16 + pgvector extension (port 5433), Valkey (6380), named volumes, healthchecks. Compose-up should boot cleanly on a clean machine. | #460 | codex | feat/federation-m1-compose | FED-M1-01 | 5K | Overlay on existing `docker-compose.yml`; no changes to base file. Add `profile: federated` gating. |
|
||||||
|
| FED-M1-03 | not-started | Add pgvector support to `packages/storage/src/adapters/postgres.ts`: create extension on init (idempotent), expose vector column type in schema helpers. No adapter changes for non-federated tiers. | #460 | codex | feat/federation-m1-pgvector | FED-M1-02 | 8K | Extension create is idempotent `CREATE EXTENSION IF NOT EXISTS vector`. Gate on tier = federated. |
|
||||||
|
| FED-M1-04 | not-started | Implement `apps/gateway/src/bootstrap/tier-detector.ts`: reads config, asserts PG/Valkey/pgvector reachable for `federated`, fail-fast with actionable error message on failure. Unit tests for each failure mode. | #460 | codex | feat/federation-m1-detector | FED-M1-03 | 8K | Structured error type with remediation hints. Logs which service failed, with host:port attempted. |
|
||||||
|
| FED-M1-05 | not-started | Write `scripts/migrate-to-federated.ts`: one-way migration from `local` (PGlite) / `standalone` (PG without pgvector) → `federated`. Dumps, transforms, loads; dry-run + confirm UX. Idempotent on re-run. | #460 | codex | feat/federation-m1-migrate | FED-M1-04 | 10K | Do NOT run automatically. CLI subcommand `mosaic migrate tier --to federated --dry-run`. Safety rails. |
|
||||||
|
| FED-M1-06 | not-started | Update `mosaic doctor`: report current tier, required services, actual health per service, pgvector presence, overall green/yellow/red. Machine-readable JSON output flag for CI use. | #460 | sonnet | feat/federation-m1-doctor | FED-M1-04 | 6K | Existing doctor output evolves; add `--json` flag. Green/yellow/red + remediation suggestions per issue. |
|
||||||
|
| FED-M1-07 | not-started | Integration test: gateway boots in `federated` tier with docker-compose `federated` profile; refuses to boot when PG unreachable (asserts fail-fast); pgvector extension query succeeds. | #460 | sonnet | feat/federation-m1-integration | FED-M1-04 | 8K | Vitest + docker-compose test profile. One test file per assertion; real services, no mocks. |
|
||||||
|
| FED-M1-08 | not-started | Integration test for migration script: seed a local PGlite with representative data (tasks, notes, users, teams), run migration, assert row counts + key samples equal on federated PG. | #460 | sonnet | feat/federation-m1-migrate-test | FED-M1-05 | 6K | Runs against docker-compose federated profile; uses temp PGlite file; deterministic seed. |
|
||||||
|
| FED-M1-09 | not-started | Standalone regression: full agent-session E2E on existing `standalone` tier with a gateway built from this branch. Must pass without referencing any federation module. | #460 | haiku | feat/federation-m1-regression | FED-M1-07 | 4K | Reuse existing e2e harness; just re-point at the federation branch build. Canary that we didn't break it. |
|
||||||
|
| FED-M1-10 | not-started | Code review pass: security-focused on the migration script (data-at-rest during migration) + tier detector (error-message sensitivity leakage). Independent reviewer, not authors of tasks 01-09. | #460 | sonnet | — | FED-M1-09 | 8K | Use `feature-dev:code-reviewer` agent. Specifically: no secrets in error messages; no partial-migration footguns. |
|
||||||
|
| FED-M1-11 | not-started | Docs update: `docs/federation/` operator notes for tier setup; README blurb on federated tier; `docs/guides/` entry for migration. Do NOT touch runbook yet (deferred to FED-M7). | #460 | haiku | feat/federation-m1-docs | FED-M1-10 | 4K | Short, actionable. Link from MISSION-MANIFEST. No decisions captured here — those belong in PRD. |
|
||||||
|
| FED-M1-12 | not-started | PR, CI green, merge to main, close #460. | #460 | — | (aggregate) | FED-M1-11 | 3K | Queue-guard before push; wait for green; merge squashed; tea `issue-close` #460. |
|
||||||
|
|
||||||
|
**M1 total estimate:** ~74K tokens (over-budget vs 20K PRD estimate — explanation below)
|
||||||
|
|
||||||
|
**Why over-budget:** PRD's 20K estimate reflected implementation complexity only. The per-task breakdown includes tests, review, and docs as separate tasks per the delivery cycle, which catches the real cost. The final per-milestone budgets in MISSION-MANIFEST will be updated after M1 completes with actuals.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 2 — Step-CA + grant schema + admin CLI (FED-M2)
|
||||||
|
|
||||||
|
_Deferred to mission planning when M1 is complete. Issue #461 tracks scope._
|
||||||
|
|
||||||
|
## Milestone 3 — mTLS handshake + list/get + scope enforcement (FED-M3)
|
||||||
|
|
||||||
|
_Deferred. Issue #462._
|
||||||
|
|
||||||
|
## Milestone 4 — search + audit + rate limit (FED-M4)
|
||||||
|
|
||||||
|
_Deferred. Issue #463._
|
||||||
|
|
||||||
|
## Milestone 5 — cache + offline + OTEL (FED-M5)
|
||||||
|
|
||||||
|
_Deferred. Issue #464._
|
||||||
|
|
||||||
|
## Milestone 6 — revocation + auto-renewal + CRL (FED-M6)
|
||||||
|
|
||||||
|
_Deferred. Issue #465._
|
||||||
|
|
||||||
|
## Milestone 7 — multi-user hardening + acceptance suite (FED-M7)
|
||||||
|
|
||||||
|
_Deferred. Issue #466._
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Execution Notes
|
||||||
|
|
||||||
|
**Agent assignment rationale:**
|
||||||
|
|
||||||
|
- `codex` for most implementation tasks (OpenAI credit pool preferred for feature code)
|
||||||
|
- `sonnet` for tests (pattern-based, moderate complexity), `doctor` work (cross-cutting), and independent code review
|
||||||
|
- `haiku` for docs and the standalone regression canary (cheapest tier for mechanical/verification work)
|
||||||
|
- No `opus` in M1 — save for cross-cutting architecture decisions if they surface later
|
||||||
|
|
||||||
|
**Branch strategy:** Each task gets its own feature branch off `main`. Tasks within a milestone merge in dependency order. Final aggregate PR (FED-M1-12) isn't a branch of its own — it's the merge of the last upstream task that closes the issue.
|
||||||
|
|
||||||
|
**Queue guard:** Every push and every merge in this mission must run `~/.config/mosaic/tools/git/ci-queue-wait.sh --purpose push|merge` per Mosaic hard gate #6.
|
||||||
Reference in New Issue
Block a user