22 KiB
Mosaic Stack — Federation PRD
Status: Draft v1 (locked for implementation) Owner: Jason Date: 2026-04-19 Scope: Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence.
1. Problem Statement
Jarvis operates across 3–4 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service).
The goal is a federation model where each user's home instance remains the source of truth for their personal data, and work/shared instances expose scoped data to that user's home instance on demand — without persisting anything across the boundary.
2. Goals
- A user logged into their home gateway (Server A) can query their work gateway (Server B) in real time during a session.
- Data returned from Server B is used in-session only; never written to Server A storage.
- Server B has multiple users, each with their own Server A. No user's data leaks to another user.
- Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay.
- Sync latency target: seconds, or at the next data need of the agent.
- Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal.
- Teams exist on both sides. A federation grant can share team-owned data without exposing other team members' personal data.
- Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box.
3. Non-Goals (v1)
- Mesh federation (N-to-N). v1 is strictly A↔B pairs.
- Cross-instance writes. All federation is read-only on the remote side.
- Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only.
- Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user.
- Realtime push from B→A. v1 is pull-only (A pulls from B during a session).
- Global search index. Federation is query-by-query, not index replication.
4. User Stories
- US-1 (Solo user at home): As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain.
- US-2 (Cross-location): As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them.
- US-3 (Work admin): As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data.
- US-4 (Privacy boundary): As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com.
- US-5 (Revocation): As a work admin, when I delete an employee, their home gateway loses access within one request cycle.
- US-6 (Offline): As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung.
5. Architecture Overview
┌─────────────────────────────────────┐ mTLS / X.509 ┌─────────────────────────────────────┐
│ Server A — mosaic.woltje.com │ ───────────────────────► │ Server B — mosaic.uscllc.com │
│ (home, master for Jason) │ ◄── JSON over HTTPS │ (work, multi-tenant) │
│ │ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │ ┌──────────────┐ ┌──────────────┐ │
│ │ Gateway │ │ Postgres │ │ │ │ Gateway │ │ Postgres │ │
│ │ (NestJS) │──│ (local SSOT)│ │ │ │ (NestJS) │──│ (tenant SSOT)│ │
│ └──────┬───────┘ └──────────────┘ │ │ └──────┬───────┘ └──────────────┘ │
│ │ │ │ │ │
│ │ FederationClient │ │ │ FederationServer │
│ │ (outbound, scoped query) │ │ │ (inbound, RBAC-gated) │
│ └───────────────────────────┼──────────────────────────┼────────┘ │
│ │ │ │
│ Step-CA (issues A's client cert) │ │ Step-CA (issues B's server cert, │
│ │ │ trusts A's CA root on grant)│
└─────────────────────────────────────┘ └──────────────────────────────────────┘
- Federation is a transport-layer concern between two gateways, implemented as a new internal module on each gateway.
- Both sides run the same code. Direction (client vs. server role) is per-request.
- Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote.
6. Transport & Authentication
Transport: HTTPS with mutual TLS (mTLS).
Identity: X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side.
Why mTLS over HMAC bearer tokens:
- Standard rotation/revocation semantics (renew, CRL, OCSP).
- The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity.
- Client certs never transit request bodies, so they can't be logged by accident.
- Transport is pinned at the TLS layer, not re-validated per-handler.
Cert contents (SAN + subject):
CN=grant-<uuid>O=<requesting-server-hostname>(e.g.,mosaic.woltje.com)- Custom OIDs embedded in SAN otherName:
mosaic.federation.grantId(UUID)mosaic.federation.subjectUserId(user on the serving side that this grant acts-as)
- Default lifetime: 30 days, with auto-renewal at T-7 days if the grant is still active.
Step-CA topology (v1): Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1.
Handshake:
- Client (A) opens HTTPS to B with its grant cert.
- B validates cert chain against trusted CA roots for that grant.
- B extracts
grantIdandsubjectUserIdfrom the cert. - B loads the grant record, checks it is
active, not revoked, and not expired. - B enforces scope and rate-limit for this grant.
- Request proceeds; response returned.
7. Data Model
All tables live on each instance's own Postgres. Federation grants are bilateral — each side has a record of the grant.
7.1 federation_grants (on serving side, Server B)
| Field | Type | Notes |
|---|---|---|
id |
uuid PK | |
subject_user_id |
uuid FK | Which local user this grant acts-as |
requesting_server |
text | Hostname of requesting gateway (e.g., woltje.com) |
requesting_ca_fingerprint |
text | SHA-256 of trusted CA root |
active_cert_fingerprint |
text | SHA-256 of currently valid client cert |
scope |
jsonb | See §8 |
rate_limit_rpm |
int | Default 60 |
status |
enum | pending, active, suspended, revoked |
created_at |
timestamptz | |
activated_at |
timestamptz | |
revoked_at |
timestamptz | |
last_used_at |
timestamptz | |
notes |
text | Admin-visible description |
7.2 federation_peers (on requesting side, Server A)
| Field | Type | Notes |
|---|---|---|
id |
uuid PK | |
peer_hostname |
text | e.g., mosaic.uscllc.com |
peer_ca_fingerprint |
text | SHA-256 of peer's CA root |
grant_id |
uuid | The grant ID assigned by the peer |
local_user_id |
uuid FK | Who on Server A this federation belongs to |
client_cert_pem |
text (enc) | Current client cert (PEM); rotated automatically |
client_key_pem |
text (enc) | Private key (encrypted at rest) |
cert_expires_at |
timestamptz | |
status |
enum | pending, active, degraded, revoked |
last_success_at |
timestamptz | |
last_failure_at |
timestamptz | |
notes |
text |
7.3 federation_audit_log (on serving side, Server B)
| Field | Type | Notes |
|---|---|---|
id |
uuid PK | |
grant_id |
uuid FK | |
occurred_at |
timestamptz | indexed |
verb |
text | query, handshake, rejected, rate_limited |
resource |
text | e.g., tasks, notes, credentials |
query_hash |
text | SHA-256 of normalized query (no payload stored) |
outcome |
text | ok, denied, error |
bytes_out |
int | |
latency_ms |
int |
Audit policy: Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage.
8. RBAC & Scope
Every federation grant has a scope object that answers three questions for every inbound request:
- Who is acting? —
subject_user_idfrom the cert. - What resources? — an allowlist of resource types (
tasks,notes,credentials,memory,teams/:id/tasks, …). - Filter expression — predicates applied on top of the subject's normal RBAC (see below).
8.1 Scope schema
{
"resources": ["tasks", "notes", "memory"],
"filters": {
"tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true },
"notes": { "include_personal": true, "include_teams": [] },
"memory": { "include_personal": true }
},
"excluded_resources": ["credentials", "api_keys"],
"max_rows_per_query": 500
}
8.2 Access rule (enforced on serving side)
For every inbound federated query on resource R:
- Resolve effective identity →
subject_user_id. - Check R is in
scope.resourcesand NOT inscope.excluded_resources. Otherwise 403. - Evaluate the user's normal RBAC (what would they see if they logged into Server B directly)?
- Intersect with the scope filter (e.g., only team X, only personal).
- Apply
max_rows_per_query. - Return; log to audit.
8.3 Team boundary guarantees
- Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves.
include_teamsmeans "only these teams," not "these teams in addition to all teams."include_personal: falsehides the user's personal data entirely from federation, even if they own it — useful for work-only accounts.
8.4 No cross-user leakage
When Server B has multiple users (employees) all federating back to their own Server A:
- Each employee has their own grant with their own
subject_user_id. - The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another.
- Audit log is per-grant.
9. Query Model
Federation exposes a narrow read API, not arbitrary SQL.
9.1 Supported verbs (v1)
| Verb | Purpose | Returns |
|---|---|---|
list |
Paginated list of a resource type | Array of resources |
get |
Fetch a single resource by id | One resource or 404 |
search |
Keyword search within allowed resources | Ranked list of hits |
capabilities |
What this grant is allowed to do right now | Scope object + rate-limit state |
9.2 Not in v1
- Write verbs.
- Aggregations / analytics.
- Streaming / subscriptions (future: see §13).
9.3 Agent-facing integration
Agents never call federation directly. Instead:
- The gateway query layer accepts
source: "local" | "federated:<peer_hostname>" | "all". "all"fans out in parallel, merges results, tags each with_source.- Federation results are in-memory only; the gateway does not persist them.
10. Caching
- In-memory response cache with short TTL (default 30s) for
listandget.searchis not cached. - Cache is keyed by
(grant_id, verb, resource, query_hash). - Cache is flushed on cert rotation and on grant revocation.
- No disk cache. No cross-session cache.
11. Bootstrap & Onboarding
11.1 Instance capability tiers
| Tier | Storage | Queue | Memory | Can federate? |
|---|---|---|---|---|
local |
PGlite | in-proc | keyword | No |
standalone |
Postgres | Valkey | keyword | No (can be client) |
federated |
Postgres | Valkey | pgvector | Yes (server + client) |
Federation requires federated tier on both sides.
11.2 Onboarding flow (admin-driven)
- Admin on Server B runs
mosaic federation grant create --user <user-id> --peer <peer-hostname> --scope-file scope.json. - Server B generates a
grant_id, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint. - Admin on Server A (or the user themselves, if allowed) runs
mosaic federation peer add <enrollment-url>. - Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL).
- B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it.
- A stores the signed cert + private key (encrypted) in
federation_peers. - Grant status flips from
pendingtoactiveon both sides. - Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active.
11.3 Revocation
- Admin-initiated:
mosaic federation grant revoke <grant-id>on B flips status torevoked, adds the cert to B's CRL, and writes an audit entry. - Revoke-on-delete: Deleting a user on B automatically revokes all grants where that user is the subject.
- Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to
revoked.
11.4 Rate limit
Default 60 req/min per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns 429 with Retry-After.
12. Operational Concerns
- Observability: Each federation request emits an OTEL span with
grant_id,peer,verb,resource,outcome,latency_ms. Traces correlate across both servers via W3C traceparent. - Health check:
mosaic federation statuson each side shows active grants, last-success times, cert expirations, and any CRL mismatches. - Backpressure: If the serving side is overloaded, it returns
503with a structured body; the client marks the peerdegradedand falls back to local-only until the next successful handshake. - Secrets:
client_key_peminfederation_peersis encrypted with the gateway's key (sealed with the instance's master key — same mechanism asprovider_credentials). - Credentials never cross: The
credentialsresource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user.
13. Future (post-v1)
- B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS.
- Mesh (N-to-N) federation.
- Write verbs with conflict resolution.
- Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots.
- Federated memory search over vector indexes with homomorphic filtering.
14. Locked Decisions (was "Open Questions")
| # | Question | Decision |
|---|---|---|
| 1 | What happens to a grant when its subject user is deleted? | Revoke-on-delete. All grants where the user is subject are auto-revoked and CRL'd. |
| 2 | Do we audit read-only requests? | Yes. All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only. |
| 3 | Default rate limit? | 60 requests per minute per grant, override-able per grant. |
| 4 | How do we verify the requesting-server's identity beyond the grant token? | X.509 client cert tied to the user, issued by Step-CA (per-server) or locally generated. Cert subject carries grantId + subjectUserId. |
M1 decisions
- Postgres deployment: Containerized alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature.
- Instance signing key: Separate from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences.
15. Acceptance Criteria
- Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow.
- Server A can query Server B for
tasks,notes,memoryrespecting scope filters. - A user on B with no grant cannot be queried by A, even if A has a valid grant for another user.
- Revoking a grant on B causes A's next request to fail with a clear error within one request cycle.
- Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action.
- Rate-limit enforcement returns 429 with
Retry-After; client backs off. - With B unreachable, a session on A completes using local data and surfaces a "federation offline for
<peer>" signal once. - Every federated request appears in B's
federation_audit_logwithin 1 second. - A scope excluding
credentialsmeans credentials are not returnable even viasearchwith matching keywords. mosaic federation statusshows cert expiry, grant status, and last success/failure per peer.
16. Implementation Milestones (reference)
Milestones live in docs/federation/MILESTONES.md (to be authored next). High-level:
- M1: Server A runs
federatedtier standalone (Postgres + Valkey + pgvector, containerized). No peer yet. - M2: Step-CA embedded;
federation_grants/federation_peersschema + admin CLI. - M3: Handshake +
list/getverbs with scope enforcement. - M4:
searchverb, audit log, rate limits. - M5: Cache layer, offline-degradation UX, observability surfaces.
- M6: Revocation flows (admin + revoke-on-delete), cert auto-renewal.
- M7: Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green.
Next step after PRD sign-off: author docs/federation/MILESTONES.md with per-milestone acceptance tests and estimated token budget, then file tracking issues on git.mosaicstack.dev/mosaicstack/stack.