331 lines
22 KiB
Markdown
331 lines
22 KiB
Markdown
# Mosaic Stack — Federation PRD
|
||
|
||
**Status:** Draft v1 (locked for implementation)
|
||
**Owner:** Jason
|
||
**Date:** 2026-04-19
|
||
**Scope:** Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence.
|
||
|
||
---
|
||
|
||
## 1. Problem Statement
|
||
|
||
Jarvis operates across 3–4 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service).
|
||
|
||
The goal is a federation model where each user's **home instance** remains the source of truth for their personal data, and **work/shared instances** expose scoped data to that user's home instance on demand — without persisting anything across the boundary.
|
||
|
||
## 2. Goals
|
||
|
||
1. A user logged into their **home gateway** (Server A) can query their **work gateway** (Server B) in real time during a session.
|
||
2. Data returned from Server B is used in-session only; never written to Server A storage.
|
||
3. Server B has multiple users, each with their own Server A. No user's data leaks to another user.
|
||
4. Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay.
|
||
5. Sync latency target: seconds, or at the next data need of the agent.
|
||
6. Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal.
|
||
7. Teams exist on both sides. A federation grant can share **team-owned** data without exposing other team members' personal data.
|
||
8. Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box.
|
||
|
||
## 3. Non-Goals (v1)
|
||
|
||
- Mesh federation (N-to-N). v1 is strictly A↔B pairs.
|
||
- Cross-instance writes. All federation is **read-only** on the remote side.
|
||
- Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only.
|
||
- Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user.
|
||
- Realtime push from B→A. v1 is pull-only (A pulls from B during a session).
|
||
- Global search index. Federation is query-by-query, not index replication.
|
||
|
||
## 4. User Stories
|
||
|
||
- **US-1 (Solo user at home):** As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain.
|
||
- **US-2 (Cross-location):** As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them.
|
||
- **US-3 (Work admin):** As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data.
|
||
- **US-4 (Privacy boundary):** As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com.
|
||
- **US-5 (Revocation):** As a work admin, when I delete an employee, their home gateway loses access within one request cycle.
|
||
- **US-6 (Offline):** As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung.
|
||
|
||
## 5. Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────┐ mTLS / X.509 ┌─────────────────────────────────────┐
|
||
│ Server A — mosaic.woltje.com │ ───────────────────────► │ Server B — mosaic.uscllc.com │
|
||
│ (home, master for Jason) │ ◄── JSON over HTTPS │ (work, multi-tenant) │
|
||
│ │ │ │
|
||
│ ┌──────────────┐ ┌──────────────┐ │ │ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Gateway │ │ Postgres │ │ │ │ Gateway │ │ Postgres │ │
|
||
│ │ (NestJS) │──│ (local SSOT)│ │ │ │ (NestJS) │──│ (tenant SSOT)│ │
|
||
│ └──────┬───────┘ └──────────────┘ │ │ └──────┬───────┘ └──────────────┘ │
|
||
│ │ │ │ │ │
|
||
│ │ FederationClient │ │ │ FederationServer │
|
||
│ │ (outbound, scoped query) │ │ │ (inbound, RBAC-gated) │
|
||
│ └───────────────────────────┼──────────────────────────┼────────┘ │
|
||
│ │ │ │
|
||
│ Step-CA (issues A's client cert) │ │ Step-CA (issues B's server cert, │
|
||
│ │ │ trusts A's CA root on grant)│
|
||
└─────────────────────────────────────┘ └──────────────────────────────────────┘
|
||
```
|
||
|
||
- Federation is a **transport-layer** concern between two gateways, implemented as a new internal module on each gateway.
|
||
- Both sides run the same code. Direction (client vs. server role) is per-request.
|
||
- Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote.
|
||
|
||
## 6. Transport & Authentication
|
||
|
||
**Transport:** HTTPS with mutual TLS (mTLS).
|
||
|
||
**Identity:** X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side.
|
||
|
||
**Why mTLS over HMAC bearer tokens:**
|
||
|
||
- Standard rotation/revocation semantics (renew, CRL, OCSP).
|
||
- The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity.
|
||
- Client certs never transit request bodies, so they can't be logged by accident.
|
||
- Transport is pinned at the TLS layer, not re-validated per-handler.
|
||
|
||
**Cert contents (SAN + subject):**
|
||
|
||
- `CN=grant-<uuid>`
|
||
- `O=<requesting-server-hostname>` (e.g., `mosaic.woltje.com`)
|
||
- Custom OIDs embedded in SAN otherName:
|
||
- `mosaic.federation.grantId` (UUID)
|
||
- `mosaic.federation.subjectUserId` (user on the **serving** side that this grant acts-as)
|
||
- Default lifetime: **30 days**, with auto-renewal at T-7 days if the grant is still active.
|
||
|
||
**Step-CA topology (v1):** Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1.
|
||
|
||
**Handshake:**
|
||
|
||
1. Client (A) opens HTTPS to B with its grant cert.
|
||
2. B validates cert chain against trusted CA roots for that grant.
|
||
3. B extracts `grantId` and `subjectUserId` from the cert.
|
||
4. B loads the grant record, checks it is `active`, not revoked, and not expired.
|
||
5. B enforces scope and rate-limit for this grant.
|
||
6. Request proceeds; response returned.
|
||
|
||
## 7. Data Model
|
||
|
||
All tables live on **each instance's own Postgres**. Federation grants are bilateral — each side has a record of the grant.
|
||
|
||
### 7.1 `federation_grants` (on serving side, Server B)
|
||
|
||
| Field | Type | Notes |
|
||
| --------------------------- | ----------- | ------------------------------------------------- |
|
||
| `id` | uuid PK | |
|
||
| `subject_user_id` | uuid FK | Which local user this grant acts-as |
|
||
| `requesting_server` | text | Hostname of requesting gateway (e.g., woltje.com) |
|
||
| `requesting_ca_fingerprint` | text | SHA-256 of trusted CA root |
|
||
| `active_cert_fingerprint` | text | SHA-256 of currently valid client cert |
|
||
| `scope` | jsonb | See §8 |
|
||
| `rate_limit_rpm` | int | Default 60 |
|
||
| `status` | enum | `pending`, `active`, `suspended`, `revoked` |
|
||
| `created_at` | timestamptz | |
|
||
| `activated_at` | timestamptz | |
|
||
| `revoked_at` | timestamptz | |
|
||
| `last_used_at` | timestamptz | |
|
||
| `notes` | text | Admin-visible description |
|
||
|
||
### 7.2 `federation_peers` (on requesting side, Server A)
|
||
|
||
| Field | Type | Notes |
|
||
| --------------------- | ----------- | ------------------------------------------------ |
|
||
| `id` | uuid PK | |
|
||
| `peer_hostname` | text | e.g., `mosaic.uscllc.com` |
|
||
| `peer_ca_fingerprint` | text | SHA-256 of peer's CA root |
|
||
| `grant_id` | uuid | The grant ID assigned by the peer |
|
||
| `local_user_id` | uuid FK | Who on Server A this federation belongs to |
|
||
| `client_cert_pem` | text (enc) | Current client cert (PEM); rotated automatically |
|
||
| `client_key_pem` | text (enc) | Private key (encrypted at rest) |
|
||
| `cert_expires_at` | timestamptz | |
|
||
| `status` | enum | `pending`, `active`, `degraded`, `revoked` |
|
||
| `last_success_at` | timestamptz | |
|
||
| `last_failure_at` | timestamptz | |
|
||
| `notes` | text | |
|
||
|
||
### 7.3 `federation_audit_log` (on serving side, Server B)
|
||
|
||
| Field | Type | Notes |
|
||
| ------------- | ----------- | ------------------------------------------------ |
|
||
| `id` | uuid PK | |
|
||
| `grant_id` | uuid FK | |
|
||
| `occurred_at` | timestamptz | indexed |
|
||
| `verb` | text | `query`, `handshake`, `rejected`, `rate_limited` |
|
||
| `resource` | text | e.g., `tasks`, `notes`, `credentials` |
|
||
| `query_hash` | text | SHA-256 of normalized query (no payload stored) |
|
||
| `outcome` | text | `ok`, `denied`, `error` |
|
||
| `bytes_out` | int | |
|
||
| `latency_ms` | int | |
|
||
|
||
**Audit policy:** Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage.
|
||
|
||
## 8. RBAC & Scope
|
||
|
||
Every federation grant has a scope object that answers three questions for every inbound request:
|
||
|
||
1. **Who is acting?** — `subject_user_id` from the cert.
|
||
2. **What resources?** — an allowlist of resource types (`tasks`, `notes`, `credentials`, `memory`, `teams/:id/tasks`, …).
|
||
3. **Filter expression** — predicates applied on top of the subject's normal RBAC (see below).
|
||
|
||
### 8.1 Scope schema
|
||
|
||
```json
|
||
{
|
||
"resources": ["tasks", "notes", "memory"],
|
||
"filters": {
|
||
"tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true },
|
||
"notes": { "include_personal": true, "include_teams": [] },
|
||
"memory": { "include_personal": true }
|
||
},
|
||
"excluded_resources": ["credentials", "api_keys"],
|
||
"max_rows_per_query": 500
|
||
}
|
||
```
|
||
|
||
### 8.2 Access rule (enforced on serving side)
|
||
|
||
For every inbound federated query on resource R:
|
||
|
||
1. Resolve effective identity → `subject_user_id`.
|
||
2. Check R is in `scope.resources` and NOT in `scope.excluded_resources`. Otherwise 403.
|
||
3. Evaluate the user's **normal RBAC** (what would they see if they logged into Server B directly)?
|
||
4. Intersect with the scope filter (e.g., only team X, only personal).
|
||
5. Apply `max_rows_per_query`.
|
||
6. Return; log to audit.
|
||
|
||
### 8.3 Team boundary guarantees
|
||
|
||
- Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves.
|
||
- `include_teams` means "only these teams," not "these teams in addition to all teams."
|
||
- `include_personal: false` hides the user's personal data entirely from federation, even if they own it — useful for work-only accounts.
|
||
|
||
### 8.4 No cross-user leakage
|
||
|
||
When Server B has multiple users (employees) all federating back to their own Server A:
|
||
|
||
- Each employee has their own grant with their own `subject_user_id`.
|
||
- The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another.
|
||
- Audit log is per-grant.
|
||
|
||
## 9. Query Model
|
||
|
||
Federation exposes a **narrow read API**, not arbitrary SQL.
|
||
|
||
### 9.1 Supported verbs (v1)
|
||
|
||
| Verb | Purpose | Returns |
|
||
| -------------- | ------------------------------------------ | ------------------------------- |
|
||
| `list` | Paginated list of a resource type | Array of resources |
|
||
| `get` | Fetch a single resource by id | One resource or 404 |
|
||
| `search` | Keyword search within allowed resources | Ranked list of hits |
|
||
| `capabilities` | What this grant is allowed to do right now | Scope object + rate-limit state |
|
||
|
||
### 9.2 Not in v1
|
||
|
||
- Write verbs.
|
||
- Aggregations / analytics.
|
||
- Streaming / subscriptions (future: see §13).
|
||
|
||
### 9.3 Agent-facing integration
|
||
|
||
Agents never call federation directly. Instead:
|
||
|
||
- The gateway query layer accepts `source: "local" | "federated:<peer_hostname>" | "all"`.
|
||
- `"all"` fans out in parallel, merges results, tags each with `_source`.
|
||
- Federation results are in-memory only; the gateway does not persist them.
|
||
|
||
## 10. Caching
|
||
|
||
- **In-memory response cache** with short TTL (default 30s) for `list` and `get`. `search` is not cached.
|
||
- Cache is keyed by `(grant_id, verb, resource, query_hash)`.
|
||
- Cache is flushed on cert rotation and on grant revocation.
|
||
- No disk cache. No cross-session cache.
|
||
|
||
## 11. Bootstrap & Onboarding
|
||
|
||
### 11.1 Instance capability tiers
|
||
|
||
| Tier | Storage | Queue | Memory | Can federate? |
|
||
| ------------ | -------- | ------- | -------- | --------------------- |
|
||
| `local` | PGlite | in-proc | keyword | No |
|
||
| `standalone` | Postgres | Valkey | keyword | No (can be client) |
|
||
| `federated` | Postgres | Valkey | pgvector | Yes (server + client) |
|
||
|
||
Federation requires `federated` tier on **both** sides.
|
||
|
||
### 11.2 Onboarding flow (admin-driven)
|
||
|
||
1. Admin on Server B runs `mosaic federation grant create --user <user-id> --peer <peer-hostname> --scope-file scope.json`.
|
||
2. Server B generates a `grant_id`, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint.
|
||
3. Admin on Server A (or the user themselves, if allowed) runs `mosaic federation peer add <enrollment-url>`.
|
||
4. Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL).
|
||
5. B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it.
|
||
6. A stores the signed cert + private key (encrypted) in `federation_peers`.
|
||
7. Grant status flips from `pending` to `active` on both sides.
|
||
8. Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active.
|
||
|
||
### 11.3 Revocation
|
||
|
||
- **Admin-initiated:** `mosaic federation grant revoke <grant-id>` on B flips status to `revoked`, adds the cert to B's CRL, and writes an audit entry.
|
||
- **Revoke-on-delete:** Deleting a user on B automatically revokes all grants where that user is the subject.
|
||
- Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to `revoked`.
|
||
|
||
### 11.4 Rate limit
|
||
|
||
Default `60 req/min` per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns `429` with `Retry-After`.
|
||
|
||
## 12. Operational Concerns
|
||
|
||
- **Observability:** Each federation request emits an OTEL span with `grant_id`, `peer`, `verb`, `resource`, `outcome`, `latency_ms`. Traces correlate across both servers via W3C traceparent.
|
||
- **Health check:** `mosaic federation status` on each side shows active grants, last-success times, cert expirations, and any CRL mismatches.
|
||
- **Backpressure:** If the serving side is overloaded, it returns `503` with a structured body; the client marks the peer `degraded` and falls back to local-only until the next successful handshake.
|
||
- **Secrets:** `client_key_pem` in `federation_peers` is encrypted with the gateway's key (sealed with the instance's master key — same mechanism as `provider_credentials`).
|
||
- **Credentials never cross:** The `credentials` resource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user.
|
||
|
||
## 13. Future (post-v1)
|
||
|
||
- B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS.
|
||
- Mesh (N-to-N) federation.
|
||
- Write verbs with conflict resolution.
|
||
- Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots.
|
||
- Federated memory search over vector indexes with homomorphic filtering.
|
||
|
||
## 14. Locked Decisions (was "Open Questions")
|
||
|
||
| # | Question | Decision |
|
||
| --- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| 1 | What happens to a grant when its subject user is deleted? | **Revoke-on-delete.** All grants where the user is subject are auto-revoked and CRL'd. |
|
||
| 2 | Do we audit read-only requests? | **Yes.** All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only. |
|
||
| 3 | Default rate limit? | **60 requests per minute per grant,** override-able per grant. |
|
||
| 4 | How do we verify the requesting-server's identity beyond the grant token? | **X.509 client cert tied to the user,** issued by Step-CA (per-server) or locally generated. Cert subject carries `grantId` + `subjectUserId`. |
|
||
|
||
### M1 decisions
|
||
|
||
- **Postgres deployment:** **Containerized** alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature.
|
||
- **Instance signing key:** **Separate** from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences.
|
||
|
||
## 15. Acceptance Criteria
|
||
|
||
- [ ] Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow.
|
||
- [ ] Server A can query Server B for `tasks`, `notes`, `memory` respecting scope filters.
|
||
- [ ] A user on B with no grant cannot be queried by A, even if A has a valid grant for another user.
|
||
- [ ] Revoking a grant on B causes A's next request to fail with a clear error within one request cycle.
|
||
- [ ] Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action.
|
||
- [ ] Rate-limit enforcement returns 429 with `Retry-After`; client backs off.
|
||
- [ ] With B unreachable, a session on A completes using local data and surfaces a "federation offline for `<peer>`" signal once.
|
||
- [ ] Every federated request appears in B's `federation_audit_log` within 1 second.
|
||
- [ ] A scope excluding `credentials` means credentials are not returnable even via `search` with matching keywords.
|
||
- [ ] `mosaic federation status` shows cert expiry, grant status, and last success/failure per peer.
|
||
|
||
## 16. Implementation Milestones (reference)
|
||
|
||
Milestones live in `docs/federation/MILESTONES.md` (to be authored next). High-level:
|
||
|
||
- **M1:** Server A runs `federated` tier standalone (Postgres + Valkey + pgvector, containerized). No peer yet.
|
||
- **M2:** Step-CA embedded; `federation_grants` / `federation_peers` schema + admin CLI.
|
||
- **M3:** Handshake + `list`/`get` verbs with scope enforcement.
|
||
- **M4:** `search` verb, audit log, rate limits.
|
||
- **M5:** Cache layer, offline-degradation UX, observability surfaces.
|
||
- **M6:** Revocation flows (admin + revoke-on-delete), cert auto-renewal.
|
||
- **M7:** Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green.
|
||
|
||
---
|
||
|
||
**Next step after PRD sign-off:** author `docs/federation/MILESTONES.md` with per-milestone acceptance tests and estimated token budget, then file tracking issues on `git.mosaicstack.dev/mosaicstack/stack`.
|