# Mosaic Stack — Federation PRD **Status:** Draft v1 (locked for implementation) **Owner:** Jason **Date:** 2026-04-19 **Scope:** Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence. --- ## 1. Problem Statement Jarvis operates across 3–4 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service). The goal is a federation model where each user's **home instance** remains the source of truth for their personal data, and **work/shared instances** expose scoped data to that user's home instance on demand — without persisting anything across the boundary. ## 2. Goals 1. A user logged into their **home gateway** (Server A) can query their **work gateway** (Server B) in real time during a session. 2. Data returned from Server B is used in-session only; never written to Server A storage. 3. Server B has multiple users, each with their own Server A. No user's data leaks to another user. 4. Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay. 5. Sync latency target: seconds, or at the next data need of the agent. 6. Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal. 7. Teams exist on both sides. A federation grant can share **team-owned** data without exposing other team members' personal data. 8. Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box. ## 3. Non-Goals (v1) - Mesh federation (N-to-N). v1 is strictly A↔B pairs. - Cross-instance writes. All federation is **read-only** on the remote side. - Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only. - Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user. - Realtime push from B→A. v1 is pull-only (A pulls from B during a session). - Global search index. Federation is query-by-query, not index replication. ## 4. User Stories - **US-1 (Solo user at home):** As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain. - **US-2 (Cross-location):** As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them. - **US-3 (Work admin):** As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data. - **US-4 (Privacy boundary):** As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com. - **US-5 (Revocation):** As a work admin, when I delete an employee, their home gateway loses access within one request cycle. - **US-6 (Offline):** As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung. ## 5. Architecture Overview ``` ┌─────────────────────────────────────┐ mTLS / X.509 ┌─────────────────────────────────────┐ │ Server A — mosaic.woltje.com │ ───────────────────────► │ Server B — mosaic.uscllc.com │ │ (home, master for Jason) │ ◄── JSON over HTTPS │ (work, multi-tenant) │ │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Gateway │ │ Postgres │ │ │ │ Gateway │ │ Postgres │ │ │ │ (NestJS) │──│ (local SSOT)│ │ │ │ (NestJS) │──│ (tenant SSOT)│ │ │ └──────┬───────┘ └──────────────┘ │ │ └──────┬───────┘ └──────────────┘ │ │ │ │ │ │ │ │ │ FederationClient │ │ │ FederationServer │ │ │ (outbound, scoped query) │ │ │ (inbound, RBAC-gated) │ │ └───────────────────────────┼──────────────────────────┼────────┘ │ │ │ │ │ │ Step-CA (issues A's client cert) │ │ Step-CA (issues B's server cert, │ │ │ │ trusts A's CA root on grant)│ └─────────────────────────────────────┘ └──────────────────────────────────────┘ ``` - Federation is a **transport-layer** concern between two gateways, implemented as a new internal module on each gateway. - Both sides run the same code. Direction (client vs. server role) is per-request. - Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote. ## 6. Transport & Authentication **Transport:** HTTPS with mutual TLS (mTLS). **Identity:** X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side. **Why mTLS over HMAC bearer tokens:** - Standard rotation/revocation semantics (renew, CRL, OCSP). - The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity. - Client certs never transit request bodies, so they can't be logged by accident. - Transport is pinned at the TLS layer, not re-validated per-handler. **Cert contents (SAN + subject):** - `CN=grant-` - `O=` (e.g., `mosaic.woltje.com`) - Custom OIDs embedded in SAN otherName: - `mosaic.federation.grantId` (UUID) - `mosaic.federation.subjectUserId` (user on the **serving** side that this grant acts-as) - Default lifetime: **30 days**, with auto-renewal at T-7 days if the grant is still active. **Step-CA topology (v1):** Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1. **Handshake:** 1. Client (A) opens HTTPS to B with its grant cert. 2. B validates cert chain against trusted CA roots for that grant. 3. B extracts `grantId` and `subjectUserId` from the cert. 4. B loads the grant record, checks it is `active`, not revoked, and not expired. 5. B enforces scope and rate-limit for this grant. 6. Request proceeds; response returned. ## 7. Data Model All tables live on **each instance's own Postgres**. Federation grants are bilateral — each side has a record of the grant. ### 7.1 `federation_grants` (on serving side, Server B) | Field | Type | Notes | | --------------------------- | ----------- | ------------------------------------------------- | | `id` | uuid PK | | | `subject_user_id` | uuid FK | Which local user this grant acts-as | | `requesting_server` | text | Hostname of requesting gateway (e.g., woltje.com) | | `requesting_ca_fingerprint` | text | SHA-256 of trusted CA root | | `active_cert_fingerprint` | text | SHA-256 of currently valid client cert | | `scope` | jsonb | See §8 | | `rate_limit_rpm` | int | Default 60 | | `status` | enum | `pending`, `active`, `suspended`, `revoked` | | `created_at` | timestamptz | | | `activated_at` | timestamptz | | | `revoked_at` | timestamptz | | | `last_used_at` | timestamptz | | | `notes` | text | Admin-visible description | ### 7.2 `federation_peers` (on requesting side, Server A) | Field | Type | Notes | | --------------------- | ----------- | ------------------------------------------------ | | `id` | uuid PK | | | `peer_hostname` | text | e.g., `mosaic.uscllc.com` | | `peer_ca_fingerprint` | text | SHA-256 of peer's CA root | | `grant_id` | uuid | The grant ID assigned by the peer | | `local_user_id` | uuid FK | Who on Server A this federation belongs to | | `client_cert_pem` | text (enc) | Current client cert (PEM); rotated automatically | | `client_key_pem` | text (enc) | Private key (encrypted at rest) | | `cert_expires_at` | timestamptz | | | `status` | enum | `pending`, `active`, `degraded`, `revoked` | | `last_success_at` | timestamptz | | | `last_failure_at` | timestamptz | | | `notes` | text | | ### 7.3 `federation_audit_log` (on serving side, Server B) | Field | Type | Notes | | ------------- | ----------- | ------------------------------------------------ | | `id` | uuid PK | | | `grant_id` | uuid FK | | | `occurred_at` | timestamptz | indexed | | `verb` | text | `query`, `handshake`, `rejected`, `rate_limited` | | `resource` | text | e.g., `tasks`, `notes`, `credentials` | | `query_hash` | text | SHA-256 of normalized query (no payload stored) | | `outcome` | text | `ok`, `denied`, `error` | | `bytes_out` | int | | | `latency_ms` | int | | **Audit policy:** Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage. ## 8. RBAC & Scope Every federation grant has a scope object that answers three questions for every inbound request: 1. **Who is acting?** — `subject_user_id` from the cert. 2. **What resources?** — an allowlist of resource types (`tasks`, `notes`, `credentials`, `memory`, `teams/:id/tasks`, …). 3. **Filter expression** — predicates applied on top of the subject's normal RBAC (see below). ### 8.1 Scope schema ```json { "resources": ["tasks", "notes", "memory"], "filters": { "tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true }, "notes": { "include_personal": true, "include_teams": [] }, "memory": { "include_personal": true } }, "excluded_resources": ["credentials", "api_keys"], "max_rows_per_query": 500 } ``` ### 8.2 Access rule (enforced on serving side) For every inbound federated query on resource R: 1. Resolve effective identity → `subject_user_id`. 2. Check R is in `scope.resources` and NOT in `scope.excluded_resources`. Otherwise 403. 3. Evaluate the user's **normal RBAC** (what would they see if they logged into Server B directly)? 4. Intersect with the scope filter (e.g., only team X, only personal). 5. Apply `max_rows_per_query`. 6. Return; log to audit. ### 8.3 Team boundary guarantees - Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves. - `include_teams` means "only these teams," not "these teams in addition to all teams." - `include_personal: false` hides the user's personal data entirely from federation, even if they own it — useful for work-only accounts. ### 8.4 No cross-user leakage When Server B has multiple users (employees) all federating back to their own Server A: - Each employee has their own grant with their own `subject_user_id`. - The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another. - Audit log is per-grant. ## 9. Query Model Federation exposes a **narrow read API**, not arbitrary SQL. ### 9.1 Supported verbs (v1) | Verb | Purpose | Returns | | -------------- | ------------------------------------------ | ------------------------------- | | `list` | Paginated list of a resource type | Array of resources | | `get` | Fetch a single resource by id | One resource or 404 | | `search` | Keyword search within allowed resources | Ranked list of hits | | `capabilities` | What this grant is allowed to do right now | Scope object + rate-limit state | ### 9.2 Not in v1 - Write verbs. - Aggregations / analytics. - Streaming / subscriptions (future: see §13). ### 9.3 Agent-facing integration Agents never call federation directly. Instead: - The gateway query layer accepts `source: "local" | "federated:" | "all"`. - `"all"` fans out in parallel, merges results, tags each with `_source`. - Federation results are in-memory only; the gateway does not persist them. ## 10. Caching - **In-memory response cache** with short TTL (default 30s) for `list` and `get`. `search` is not cached. - Cache is keyed by `(grant_id, verb, resource, query_hash)`. - Cache is flushed on cert rotation and on grant revocation. - No disk cache. No cross-session cache. ## 11. Bootstrap & Onboarding ### 11.1 Instance capability tiers | Tier | Storage | Queue | Memory | Can federate? | | ------------ | -------- | ------- | -------- | --------------------- | | `local` | PGlite | in-proc | keyword | No | | `standalone` | Postgres | Valkey | keyword | No (can be client) | | `federated` | Postgres | Valkey | pgvector | Yes (server + client) | Federation requires `federated` tier on **both** sides. ### 11.2 Onboarding flow (admin-driven) 1. Admin on Server B runs `mosaic federation grant create --user --peer --scope-file scope.json`. 2. Server B generates a `grant_id`, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint. 3. Admin on Server A (or the user themselves, if allowed) runs `mosaic federation peer add `. 4. Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL). 5. B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it. 6. A stores the signed cert + private key (encrypted) in `federation_peers`. 7. Grant status flips from `pending` to `active` on both sides. 8. Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active. ### 11.3 Revocation - **Admin-initiated:** `mosaic federation grant revoke ` on B flips status to `revoked`, adds the cert to B's CRL, and writes an audit entry. - **Revoke-on-delete:** Deleting a user on B automatically revokes all grants where that user is the subject. - Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to `revoked`. ### 11.4 Rate limit Default `60 req/min` per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns `429` with `Retry-After`. ## 12. Operational Concerns - **Observability:** Each federation request emits an OTEL span with `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`. Traces correlate across both servers via W3C traceparent. - **Health check:** `mosaic federation status` on each side shows active grants, last-success times, cert expirations, and any CRL mismatches. - **Backpressure:** If the serving side is overloaded, it returns `503` with a structured body; the client marks the peer `degraded` and falls back to local-only until the next successful handshake. - **Secrets:** `client_key_pem` in `federation_peers` is encrypted with the gateway's key (sealed with the instance's master key — same mechanism as `provider_credentials`). - **Credentials never cross:** The `credentials` resource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user. ## 13. Future (post-v1) - B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS. - Mesh (N-to-N) federation. - Write verbs with conflict resolution. - Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots. - Federated memory search over vector indexes with homomorphic filtering. ## 14. Locked Decisions (was "Open Questions") | # | Question | Decision | | --- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | | 1 | What happens to a grant when its subject user is deleted? | **Revoke-on-delete.** All grants where the user is subject are auto-revoked and CRL'd. | | 2 | Do we audit read-only requests? | **Yes.** All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only. | | 3 | Default rate limit? | **60 requests per minute per grant,** override-able per grant. | | 4 | How do we verify the requesting-server's identity beyond the grant token? | **X.509 client cert tied to the user,** issued by Step-CA (per-server) or locally generated. Cert subject carries `grantId` + `subjectUserId`. | ### M1 decisions - **Postgres deployment:** **Containerized** alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature. - **Instance signing key:** **Separate** from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences. ## 15. Acceptance Criteria - [ ] Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow. - [ ] Server A can query Server B for `tasks`, `notes`, `memory` respecting scope filters. - [ ] A user on B with no grant cannot be queried by A, even if A has a valid grant for another user. - [ ] Revoking a grant on B causes A's next request to fail with a clear error within one request cycle. - [ ] Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action. - [ ] Rate-limit enforcement returns 429 with `Retry-After`; client backs off. - [ ] With B unreachable, a session on A completes using local data and surfaces a "federation offline for ``" signal once. - [ ] Every federated request appears in B's `federation_audit_log` within 1 second. - [ ] A scope excluding `credentials` means credentials are not returnable even via `search` with matching keywords. - [ ] `mosaic federation status` shows cert expiry, grant status, and last success/failure per peer. ## 16. Implementation Milestones (reference) Milestones live in `docs/federation/MILESTONES.md` (to be authored next). High-level: - **M1:** Server A runs `federated` tier standalone (Postgres + Valkey + pgvector, containerized). No peer yet. - **M2:** Step-CA embedded; `federation_grants` / `federation_peers` schema + admin CLI. - **M3:** Handshake + `list`/`get` verbs with scope enforcement. - **M4:** `search` verb, audit log, rate limits. - **M5:** Cache layer, offline-degradation UX, observability surfaces. - **M6:** Revocation flows (admin + revoke-on-delete), cert auto-renewal. - **M7:** Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green. --- **Next step after PRD sign-off:** author `docs/federation/MILESTONES.md` with per-milestone acceptance tests and estimated token budget, then file tracking issues on `git.mosaicstack.dev/mosaicstack/stack`.