Files
stack/docs/federation/PRD.md
jason.woltje 46dd799548
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
docs(federation): PRD, milestones, mission manifest, and M1 task breakdown (#467)
2026-04-19 22:09:20 +00:00

331 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Mosaic Stack — Federation PRD
**Status:** Draft v1 (locked for implementation)
**Owner:** Jason
**Date:** 2026-04-19
**Scope:** Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence.
---
## 1. Problem Statement
Jarvis operates across 34 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service).
The goal is a federation model where each user's **home instance** remains the source of truth for their personal data, and **work/shared instances** expose scoped data to that user's home instance on demand — without persisting anything across the boundary.
## 2. Goals
1. A user logged into their **home gateway** (Server A) can query their **work gateway** (Server B) in real time during a session.
2. Data returned from Server B is used in-session only; never written to Server A storage.
3. Server B has multiple users, each with their own Server A. No user's data leaks to another user.
4. Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay.
5. Sync latency target: seconds, or at the next data need of the agent.
6. Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal.
7. Teams exist on both sides. A federation grant can share **team-owned** data without exposing other team members' personal data.
8. Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box.
## 3. Non-Goals (v1)
- Mesh federation (N-to-N). v1 is strictly A↔B pairs.
- Cross-instance writes. All federation is **read-only** on the remote side.
- Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only.
- Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user.
- Realtime push from B→A. v1 is pull-only (A pulls from B during a session).
- Global search index. Federation is query-by-query, not index replication.
## 4. User Stories
- **US-1 (Solo user at home):** As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain.
- **US-2 (Cross-location):** As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them.
- **US-3 (Work admin):** As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data.
- **US-4 (Privacy boundary):** As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com.
- **US-5 (Revocation):** As a work admin, when I delete an employee, their home gateway loses access within one request cycle.
- **US-6 (Offline):** As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung.
## 5. Architecture Overview
```
┌─────────────────────────────────────┐ mTLS / X.509 ┌─────────────────────────────────────┐
│ Server A — mosaic.woltje.com │ ───────────────────────► │ Server B — mosaic.uscllc.com │
│ (home, master for Jason) │ ◄── JSON over HTTPS │ (work, multi-tenant) │
│ │ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │ ┌──────────────┐ ┌──────────────┐ │
│ │ Gateway │ │ Postgres │ │ │ │ Gateway │ │ Postgres │ │
│ │ (NestJS) │──│ (local SSOT)│ │ │ │ (NestJS) │──│ (tenant SSOT)│ │
│ └──────┬───────┘ └──────────────┘ │ │ └──────┬───────┘ └──────────────┘ │
│ │ │ │ │ │
│ │ FederationClient │ │ │ FederationServer │
│ │ (outbound, scoped query) │ │ │ (inbound, RBAC-gated) │
│ └───────────────────────────┼──────────────────────────┼────────┘ │
│ │ │ │
│ Step-CA (issues A's client cert) │ │ Step-CA (issues B's server cert, │
│ │ │ trusts A's CA root on grant)│
└─────────────────────────────────────┘ └──────────────────────────────────────┘
```
- Federation is a **transport-layer** concern between two gateways, implemented as a new internal module on each gateway.
- Both sides run the same code. Direction (client vs. server role) is per-request.
- Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote.
## 6. Transport & Authentication
**Transport:** HTTPS with mutual TLS (mTLS).
**Identity:** X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side.
**Why mTLS over HMAC bearer tokens:**
- Standard rotation/revocation semantics (renew, CRL, OCSP).
- The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity.
- Client certs never transit request bodies, so they can't be logged by accident.
- Transport is pinned at the TLS layer, not re-validated per-handler.
**Cert contents (SAN + subject):**
- `CN=grant-<uuid>`
- `O=<requesting-server-hostname>` (e.g., `mosaic.woltje.com`)
- Custom OIDs embedded in SAN otherName:
- `mosaic.federation.grantId` (UUID)
- `mosaic.federation.subjectUserId` (user on the **serving** side that this grant acts-as)
- Default lifetime: **30 days**, with auto-renewal at T-7 days if the grant is still active.
**Step-CA topology (v1):** Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1.
**Handshake:**
1. Client (A) opens HTTPS to B with its grant cert.
2. B validates cert chain against trusted CA roots for that grant.
3. B extracts `grantId` and `subjectUserId` from the cert.
4. B loads the grant record, checks it is `active`, not revoked, and not expired.
5. B enforces scope and rate-limit for this grant.
6. Request proceeds; response returned.
## 7. Data Model
All tables live on **each instance's own Postgres**. Federation grants are bilateral — each side has a record of the grant.
### 7.1 `federation_grants` (on serving side, Server B)
| Field | Type | Notes |
| --------------------------- | ----------- | ------------------------------------------------- |
| `id` | uuid PK | |
| `subject_user_id` | uuid FK | Which local user this grant acts-as |
| `requesting_server` | text | Hostname of requesting gateway (e.g., woltje.com) |
| `requesting_ca_fingerprint` | text | SHA-256 of trusted CA root |
| `active_cert_fingerprint` | text | SHA-256 of currently valid client cert |
| `scope` | jsonb | See §8 |
| `rate_limit_rpm` | int | Default 60 |
| `status` | enum | `pending`, `active`, `suspended`, `revoked` |
| `created_at` | timestamptz | |
| `activated_at` | timestamptz | |
| `revoked_at` | timestamptz | |
| `last_used_at` | timestamptz | |
| `notes` | text | Admin-visible description |
### 7.2 `federation_peers` (on requesting side, Server A)
| Field | Type | Notes |
| --------------------- | ----------- | ------------------------------------------------ |
| `id` | uuid PK | |
| `peer_hostname` | text | e.g., `mosaic.uscllc.com` |
| `peer_ca_fingerprint` | text | SHA-256 of peer's CA root |
| `grant_id` | uuid | The grant ID assigned by the peer |
| `local_user_id` | uuid FK | Who on Server A this federation belongs to |
| `client_cert_pem` | text (enc) | Current client cert (PEM); rotated automatically |
| `client_key_pem` | text (enc) | Private key (encrypted at rest) |
| `cert_expires_at` | timestamptz | |
| `status` | enum | `pending`, `active`, `degraded`, `revoked` |
| `last_success_at` | timestamptz | |
| `last_failure_at` | timestamptz | |
| `notes` | text | |
### 7.3 `federation_audit_log` (on serving side, Server B)
| Field | Type | Notes |
| ------------- | ----------- | ------------------------------------------------ |
| `id` | uuid PK | |
| `grant_id` | uuid FK | |
| `occurred_at` | timestamptz | indexed |
| `verb` | text | `query`, `handshake`, `rejected`, `rate_limited` |
| `resource` | text | e.g., `tasks`, `notes`, `credentials` |
| `query_hash` | text | SHA-256 of normalized query (no payload stored) |
| `outcome` | text | `ok`, `denied`, `error` |
| `bytes_out` | int | |
| `latency_ms` | int | |
**Audit policy:** Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage.
## 8. RBAC & Scope
Every federation grant has a scope object that answers three questions for every inbound request:
1. **Who is acting?**`subject_user_id` from the cert.
2. **What resources?** — an allowlist of resource types (`tasks`, `notes`, `credentials`, `memory`, `teams/:id/tasks`, …).
3. **Filter expression** — predicates applied on top of the subject's normal RBAC (see below).
### 8.1 Scope schema
```json
{
"resources": ["tasks", "notes", "memory"],
"filters": {
"tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true },
"notes": { "include_personal": true, "include_teams": [] },
"memory": { "include_personal": true }
},
"excluded_resources": ["credentials", "api_keys"],
"max_rows_per_query": 500
}
```
### 8.2 Access rule (enforced on serving side)
For every inbound federated query on resource R:
1. Resolve effective identity → `subject_user_id`.
2. Check R is in `scope.resources` and NOT in `scope.excluded_resources`. Otherwise 403.
3. Evaluate the user's **normal RBAC** (what would they see if they logged into Server B directly)?
4. Intersect with the scope filter (e.g., only team X, only personal).
5. Apply `max_rows_per_query`.
6. Return; log to audit.
### 8.3 Team boundary guarantees
- Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves.
- `include_teams` means "only these teams," not "these teams in addition to all teams."
- `include_personal: false` hides the user's personal data entirely from federation, even if they own it — useful for work-only accounts.
### 8.4 No cross-user leakage
When Server B has multiple users (employees) all federating back to their own Server A:
- Each employee has their own grant with their own `subject_user_id`.
- The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another.
- Audit log is per-grant.
## 9. Query Model
Federation exposes a **narrow read API**, not arbitrary SQL.
### 9.1 Supported verbs (v1)
| Verb | Purpose | Returns |
| -------------- | ------------------------------------------ | ------------------------------- |
| `list` | Paginated list of a resource type | Array of resources |
| `get` | Fetch a single resource by id | One resource or 404 |
| `search` | Keyword search within allowed resources | Ranked list of hits |
| `capabilities` | What this grant is allowed to do right now | Scope object + rate-limit state |
### 9.2 Not in v1
- Write verbs.
- Aggregations / analytics.
- Streaming / subscriptions (future: see §13).
### 9.3 Agent-facing integration
Agents never call federation directly. Instead:
- The gateway query layer accepts `source: "local" | "federated:<peer_hostname>" | "all"`.
- `"all"` fans out in parallel, merges results, tags each with `_source`.
- Federation results are in-memory only; the gateway does not persist them.
## 10. Caching
- **In-memory response cache** with short TTL (default 30s) for `list` and `get`. `search` is not cached.
- Cache is keyed by `(grant_id, verb, resource, query_hash)`.
- Cache is flushed on cert rotation and on grant revocation.
- No disk cache. No cross-session cache.
## 11. Bootstrap & Onboarding
### 11.1 Instance capability tiers
| Tier | Storage | Queue | Memory | Can federate? |
| ------------ | -------- | ------- | -------- | --------------------- |
| `local` | PGlite | in-proc | keyword | No |
| `standalone` | Postgres | Valkey | keyword | No (can be client) |
| `federated` | Postgres | Valkey | pgvector | Yes (server + client) |
Federation requires `federated` tier on **both** sides.
### 11.2 Onboarding flow (admin-driven)
1. Admin on Server B runs `mosaic federation grant create --user <user-id> --peer <peer-hostname> --scope-file scope.json`.
2. Server B generates a `grant_id`, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint.
3. Admin on Server A (or the user themselves, if allowed) runs `mosaic federation peer add <enrollment-url>`.
4. Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL).
5. B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it.
6. A stores the signed cert + private key (encrypted) in `federation_peers`.
7. Grant status flips from `pending` to `active` on both sides.
8. Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active.
### 11.3 Revocation
- **Admin-initiated:** `mosaic federation grant revoke <grant-id>` on B flips status to `revoked`, adds the cert to B's CRL, and writes an audit entry.
- **Revoke-on-delete:** Deleting a user on B automatically revokes all grants where that user is the subject.
- Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to `revoked`.
### 11.4 Rate limit
Default `60 req/min` per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns `429` with `Retry-After`.
## 12. Operational Concerns
- **Observability:** Each federation request emits an OTEL span with `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`. Traces correlate across both servers via W3C traceparent.
- **Health check:** `mosaic federation status` on each side shows active grants, last-success times, cert expirations, and any CRL mismatches.
- **Backpressure:** If the serving side is overloaded, it returns `503` with a structured body; the client marks the peer `degraded` and falls back to local-only until the next successful handshake.
- **Secrets:** `client_key_pem` in `federation_peers` is encrypted with the gateway's key (sealed with the instance's master key — same mechanism as `provider_credentials`).
- **Credentials never cross:** The `credentials` resource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user.
## 13. Future (post-v1)
- B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS.
- Mesh (N-to-N) federation.
- Write verbs with conflict resolution.
- Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots.
- Federated memory search over vector indexes with homomorphic filtering.
## 14. Locked Decisions (was "Open Questions")
| # | Question | Decision |
| --- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | What happens to a grant when its subject user is deleted? | **Revoke-on-delete.** All grants where the user is subject are auto-revoked and CRL'd. |
| 2 | Do we audit read-only requests? | **Yes.** All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only. |
| 3 | Default rate limit? | **60 requests per minute per grant,** override-able per grant. |
| 4 | How do we verify the requesting-server's identity beyond the grant token? | **X.509 client cert tied to the user,** issued by Step-CA (per-server) or locally generated. Cert subject carries `grantId` + `subjectUserId`. |
### M1 decisions
- **Postgres deployment:** **Containerized** alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature.
- **Instance signing key:** **Separate** from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences.
## 15. Acceptance Criteria
- [ ] Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow.
- [ ] Server A can query Server B for `tasks`, `notes`, `memory` respecting scope filters.
- [ ] A user on B with no grant cannot be queried by A, even if A has a valid grant for another user.
- [ ] Revoking a grant on B causes A's next request to fail with a clear error within one request cycle.
- [ ] Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action.
- [ ] Rate-limit enforcement returns 429 with `Retry-After`; client backs off.
- [ ] With B unreachable, a session on A completes using local data and surfaces a "federation offline for `<peer>`" signal once.
- [ ] Every federated request appears in B's `federation_audit_log` within 1 second.
- [ ] A scope excluding `credentials` means credentials are not returnable even via `search` with matching keywords.
- [ ] `mosaic federation status` shows cert expiry, grant status, and last success/failure per peer.
## 16. Implementation Milestones (reference)
Milestones live in `docs/federation/MILESTONES.md` (to be authored next). High-level:
- **M1:** Server A runs `federated` tier standalone (Postgres + Valkey + pgvector, containerized). No peer yet.
- **M2:** Step-CA embedded; `federation_grants` / `federation_peers` schema + admin CLI.
- **M3:** Handshake + `list`/`get` verbs with scope enforcement.
- **M4:** `search` verb, audit log, rate limits.
- **M5:** Cache layer, offline-degradation UX, observability surfaces.
- **M6:** Revocation flows (admin + revoke-on-delete), cert auto-renewal.
- **M7:** Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green.
---
**Next step after PRD sign-off:** author `docs/federation/MILESTONES.md` with per-milestone acceptance tests and estimated token budget, then file tracking issues on `git.mosaicstack.dev/mosaicstack/stack`.