Files
stack/docs/federation/PRD.md
jason.woltje 46dd799548
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
docs(federation): PRD, milestones, mission manifest, and M1 task breakdown (#467)
2026-04-19 22:09:20 +00:00

22 KiB
Raw Blame History

Mosaic Stack — Federation PRD

Status: Draft v1 (locked for implementation) Owner: Jason Date: 2026-04-19 Scope: Enables cross-instance data federation between Mosaic Stack gateways with asymmetric trust, multi-tenant scoping, and no cross-boundary data persistence.


1. Problem Statement

Jarvis operates across 34 workstations in two physical locations (home, USC). The user currently reaches back to a single jarvis-brain checkout from every session, and has tried OpenBrain to solve cross-session state — with poor results (cache invalidation, latency, opacity, hard dependency on a remote service).

The goal is a federation model where each user's home instance remains the source of truth for their personal data, and work/shared instances expose scoped data to that user's home instance on demand — without persisting anything across the boundary.

2. Goals

  1. A user logged into their home gateway (Server A) can query their work gateway (Server B) in real time during a session.
  2. Data returned from Server B is used in-session only; never written to Server A storage.
  3. Server B has multiple users, each with their own Server A. No user's data leaks to another user.
  4. Federation works over public HTTPS (no VPN required). Tailscale is a supported optional overlay.
  5. Sync latency target: seconds, or at the next data need of the agent.
  6. Graceful degradation: if the remote instance is unreachable, the local session continues with local data and a clear "federation offline" signal.
  7. Teams exist on both sides. A federation grant can share team-owned data without exposing other team members' personal data.
  8. Auth and revocation use standard PKI (X.509) so that certificate tooling (Step-CA, rotation, OCSP, CRL) is available out of the box.

3. Non-Goals (v1)

  • Mesh federation (N-to-N). v1 is strictly A↔B pairs.
  • Cross-instance writes. All federation is read-only on the remote side.
  • Shared agent sessions across instances. Sessions live on one instance; federation is data-plane only.
  • Cross-instance SSO. Each instance owns its own BetterAuth identity store; federation is service-to-service, not user-to-user.
  • Realtime push from B→A. v1 is pull-only (A pulls from B during a session).
  • Global search index. Federation is query-by-query, not index replication.

4. User Stories

  • US-1 (Solo user at home): As the sole user on Server A, I want my agent session on workstation-1 to see the same data it saw on workstation-2, without running OpenBrain.
  • US-2 (Cross-location): As a user with a home server and a work server, I want a session on my home laptop to transparently pull my USC-owned tasks/notes when I ask for them.
  • US-3 (Work admin): As the admin of mosaic.uscllc.com, I want to grant each employee's home gateway scoped read access to only their own data plus explicitly-shared team data.
  • US-4 (Privacy boundary): As employee A on mosaic.uscllc.com, my data must never appear in a session on employee B's home gateway — even if both are federated with uscllc.com.
  • US-5 (Revocation): As a work admin, when I delete an employee, their home gateway loses access within one request cycle.
  • US-6 (Offline): As a user in a hotel with flaky wifi, my local session keeps working; federation calls fail fast and are reported as "offline," not hung.

5. Architecture Overview

┌─────────────────────────────────────┐     mTLS / X.509         ┌─────────────────────────────────────┐
│  Server A — mosaic.woltje.com       │ ───────────────────────► │  Server B — mosaic.uscllc.com        │
│  (home, master for Jason)           │   ◄── JSON over HTTPS    │  (work, multi-tenant)                │
│                                     │                          │                                      │
│  ┌──────────────┐  ┌──────────────┐ │                          │ ┌──────────────┐  ┌──────────────┐  │
│  │  Gateway     │  │  Postgres    │ │                          │ │  Gateway     │  │  Postgres    │  │
│  │  (NestJS)    │──│  (local SSOT)│ │                          │ │  (NestJS)    │──│  (tenant SSOT)│ │
│  └──────┬───────┘  └──────────────┘ │                          │ └──────┬───────┘  └──────────────┘  │
│         │                           │                          │        │                             │
│         │  FederationClient         │                          │        │  FederationServer           │
│         │  (outbound, scoped query) │                          │        │  (inbound, RBAC-gated)      │
│         └───────────────────────────┼──────────────────────────┼────────┘                             │
│                                     │                          │                                      │
│  Step-CA (issues A's client cert)   │                          │  Step-CA (issues B's server cert,    │
│                                     │                          │           trusts A's CA root on grant)│
└─────────────────────────────────────┘                          └──────────────────────────────────────┘
  • Federation is a transport-layer concern between two gateways, implemented as a new internal module on each gateway.
  • Both sides run the same code. Direction (client vs. server role) is per-request.
  • Nothing in the agent runtime changes — agents query the gateway; the gateway decides local vs. remote.

6. Transport & Authentication

Transport: HTTPS with mutual TLS (mTLS).

Identity: X.509 client certificates issued by Step-CA. Each federation grant materializes as a client cert on the requesting side and a trust-anchor entry (CA root or explicit cert) on the serving side.

Why mTLS over HMAC bearer tokens:

  • Standard rotation/revocation semantics (renew, CRL, OCSP).
  • The cert subject carries identity claims (user, grant_id) that don't need a separate DB lookup to verify authenticity.
  • Client certs never transit request bodies, so they can't be logged by accident.
  • Transport is pinned at the TLS layer, not re-validated per-handler.

Cert contents (SAN + subject):

  • CN=grant-<uuid>
  • O=<requesting-server-hostname> (e.g., mosaic.woltje.com)
  • Custom OIDs embedded in SAN otherName:
    • mosaic.federation.grantId (UUID)
    • mosaic.federation.subjectUserId (user on the serving side that this grant acts-as)
  • Default lifetime: 30 days, with auto-renewal at T-7 days if the grant is still active.

Step-CA topology (v1): Each server runs its own Step-CA instance. During onboarding, the serving side imports the requesting side's CA root. A central/shared Step-CA is out of scope for v1.

Handshake:

  1. Client (A) opens HTTPS to B with its grant cert.
  2. B validates cert chain against trusted CA roots for that grant.
  3. B extracts grantId and subjectUserId from the cert.
  4. B loads the grant record, checks it is active, not revoked, and not expired.
  5. B enforces scope and rate-limit for this grant.
  6. Request proceeds; response returned.

7. Data Model

All tables live on each instance's own Postgres. Federation grants are bilateral — each side has a record of the grant.

7.1 federation_grants (on serving side, Server B)

Field Type Notes
id uuid PK
subject_user_id uuid FK Which local user this grant acts-as
requesting_server text Hostname of requesting gateway (e.g., woltje.com)
requesting_ca_fingerprint text SHA-256 of trusted CA root
active_cert_fingerprint text SHA-256 of currently valid client cert
scope jsonb See §8
rate_limit_rpm int Default 60
status enum pending, active, suspended, revoked
created_at timestamptz
activated_at timestamptz
revoked_at timestamptz
last_used_at timestamptz
notes text Admin-visible description

7.2 federation_peers (on requesting side, Server A)

Field Type Notes
id uuid PK
peer_hostname text e.g., mosaic.uscllc.com
peer_ca_fingerprint text SHA-256 of peer's CA root
grant_id uuid The grant ID assigned by the peer
local_user_id uuid FK Who on Server A this federation belongs to
client_cert_pem text (enc) Current client cert (PEM); rotated automatically
client_key_pem text (enc) Private key (encrypted at rest)
cert_expires_at timestamptz
status enum pending, active, degraded, revoked
last_success_at timestamptz
last_failure_at timestamptz
notes text

7.3 federation_audit_log (on serving side, Server B)

Field Type Notes
id uuid PK
grant_id uuid FK
occurred_at timestamptz indexed
verb text query, handshake, rejected, rate_limited
resource text e.g., tasks, notes, credentials
query_hash text SHA-256 of normalized query (no payload stored)
outcome text ok, denied, error
bytes_out int
latency_ms int

Audit policy: Every federation request is logged on the serving side. Read-only requests only — no body capture. Retention: 90 days hot, then roll to cold storage.

8. RBAC & Scope

Every federation grant has a scope object that answers three questions for every inbound request:

  1. Who is acting?subject_user_id from the cert.
  2. What resources? — an allowlist of resource types (tasks, notes, credentials, memory, teams/:id/tasks, …).
  3. Filter expression — predicates applied on top of the subject's normal RBAC (see below).

8.1 Scope schema

{
  "resources": ["tasks", "notes", "memory"],
  "filters": {
    "tasks": { "include_teams": ["team_uuid_1", "team_uuid_2"], "include_personal": true },
    "notes": { "include_personal": true, "include_teams": [] },
    "memory": { "include_personal": true }
  },
  "excluded_resources": ["credentials", "api_keys"],
  "max_rows_per_query": 500
}

8.2 Access rule (enforced on serving side)

For every inbound federated query on resource R:

  1. Resolve effective identity → subject_user_id.
  2. Check R is in scope.resources and NOT in scope.excluded_resources. Otherwise 403.
  3. Evaluate the user's normal RBAC (what would they see if they logged into Server B directly)?
  4. Intersect with the scope filter (e.g., only team X, only personal).
  5. Apply max_rows_per_query.
  6. Return; log to audit.

8.3 Team boundary guarantees

  • Scope filters are additive, never subtractive of the native RBAC. A grant cannot grant access the user would not have had themselves.
  • include_teams means "only these teams," not "these teams in addition to all teams."
  • include_personal: false hides the user's personal data entirely from federation, even if they own it — useful for work-only accounts.

8.4 No cross-user leakage

When Server B has multiple users (employees) all federating back to their own Server A:

  • Each employee has their own grant with their own subject_user_id.
  • The cert is bound to a specific grant; there is no mechanism by which one grant's cert can be used to impersonate another.
  • Audit log is per-grant.

9. Query Model

Federation exposes a narrow read API, not arbitrary SQL.

9.1 Supported verbs (v1)

Verb Purpose Returns
list Paginated list of a resource type Array of resources
get Fetch a single resource by id One resource or 404
search Keyword search within allowed resources Ranked list of hits
capabilities What this grant is allowed to do right now Scope object + rate-limit state

9.2 Not in v1

  • Write verbs.
  • Aggregations / analytics.
  • Streaming / subscriptions (future: see §13).

9.3 Agent-facing integration

Agents never call federation directly. Instead:

  • The gateway query layer accepts source: "local" | "federated:<peer_hostname>" | "all".
  • "all" fans out in parallel, merges results, tags each with _source.
  • Federation results are in-memory only; the gateway does not persist them.

10. Caching

  • In-memory response cache with short TTL (default 30s) for list and get. search is not cached.
  • Cache is keyed by (grant_id, verb, resource, query_hash).
  • Cache is flushed on cert rotation and on grant revocation.
  • No disk cache. No cross-session cache.

11. Bootstrap & Onboarding

11.1 Instance capability tiers

Tier Storage Queue Memory Can federate?
local PGlite in-proc keyword No
standalone Postgres Valkey keyword No (can be client)
federated Postgres Valkey pgvector Yes (server + client)

Federation requires federated tier on both sides.

11.2 Onboarding flow (admin-driven)

  1. Admin on Server B runs mosaic federation grant create --user <user-id> --peer <peer-hostname> --scope-file scope.json.
  2. Server B generates a grant_id, prints a one-time enrollment URL containing the grant ID + B's CA root fingerprint.
  3. Admin on Server A (or the user themselves, if allowed) runs mosaic federation peer add <enrollment-url>.
  4. Server A's Step-CA generates a CSR for the new grant. A submits the CSR to B over a short-lived enrollment endpoint (single-use token in the enrollment URL).
  5. B's Step-CA signs the cert (with grant ID embedded in SAN OIDs), returns it.
  6. A stores the signed cert + private key (encrypted) in federation_peers.
  7. Grant status flips from pending to active on both sides.
  8. Cert auto-renews at T-7 days using the standard Step-CA renewal flow as long as the grant remains active.

11.3 Revocation

  • Admin-initiated: mosaic federation grant revoke <grant-id> on B flips status to revoked, adds the cert to B's CRL, and writes an audit entry.
  • Revoke-on-delete: Deleting a user on B automatically revokes all grants where that user is the subject.
  • Server A learns of revocation on the next request (TLS handshake fails) and flips the peer to revoked.

11.4 Rate limit

Default 60 req/min per grant. Configurable per grant. Enforced at the serving side. A rate-limited request returns 429 with Retry-After.

12. Operational Concerns

  • Observability: Each federation request emits an OTEL span with grant_id, peer, verb, resource, outcome, latency_ms. Traces correlate across both servers via W3C traceparent.
  • Health check: mosaic federation status on each side shows active grants, last-success times, cert expirations, and any CRL mismatches.
  • Backpressure: If the serving side is overloaded, it returns 503 with a structured body; the client marks the peer degraded and falls back to local-only until the next successful handshake.
  • Secrets: client_key_pem in federation_peers is encrypted with the gateway's key (sealed with the instance's master key — same mechanism as provider_credentials).
  • Credentials never cross: The credentials resource type is in the default excluded list. It must be explicitly added to scope (admin action, logged) and even then is per-grant and per-user.

13. Future (post-v1)

  • B→A push (e.g., "notify A when a task assigned to subject changes") via Socket.IO over mTLS.
  • Mesh (N-to-N) federation.
  • Write verbs with conflict resolution.
  • Shared Step-CA (a "root of roots") so that onboarding doesn't require exchanging CA roots.
  • Federated memory search over vector indexes with homomorphic filtering.

14. Locked Decisions (was "Open Questions")

# Question Decision
1 What happens to a grant when its subject user is deleted? Revoke-on-delete. All grants where the user is subject are auto-revoked and CRL'd.
2 Do we audit read-only requests? Yes. All federated reads are audited on the serving side. Bodies are not captured; query hash + metadata only.
3 Default rate limit? 60 requests per minute per grant, override-able per grant.
4 How do we verify the requesting-server's identity beyond the grant token? X.509 client cert tied to the user, issued by Step-CA (per-server) or locally generated. Cert subject carries grantId + subjectUserId.

M1 decisions

  • Postgres deployment: Containerized alongside the gateway in M1 (Docker Compose profile). Moving to a dedicated host is a M5+ operational concern, not a v1 feature.
  • Instance signing key: Separate from the Step-CA key. Step-CA signs federation certs; the instance master key seals at-rest secrets (client keys, provider credentials). Different blast-radius, different rotation cadences.

15. Acceptance Criteria

  • Two Mosaic Stack gateways on different hosts can establish a federation grant via the CLI-driven onboarding flow.
  • Server A can query Server B for tasks, notes, memory respecting scope filters.
  • A user on B with no grant cannot be queried by A, even if A has a valid grant for another user.
  • Revoking a grant on B causes A's next request to fail with a clear error within one request cycle.
  • Cert rotation happens automatically at T-7 days; an in-progress session survives rotation without user action.
  • Rate-limit enforcement returns 429 with Retry-After; client backs off.
  • With B unreachable, a session on A completes using local data and surfaces a "federation offline for <peer>" signal once.
  • Every federated request appears in B's federation_audit_log within 1 second.
  • A scope excluding credentials means credentials are not returnable even via search with matching keywords.
  • mosaic federation status shows cert expiry, grant status, and last success/failure per peer.

16. Implementation Milestones (reference)

Milestones live in docs/federation/MILESTONES.md (to be authored next). High-level:

  • M1: Server A runs federated tier standalone (Postgres + Valkey + pgvector, containerized). No peer yet.
  • M2: Step-CA embedded; federation_grants / federation_peers schema + admin CLI.
  • M3: Handshake + list/get verbs with scope enforcement.
  • M4: search verb, audit log, rate limits.
  • M5: Cache layer, offline-degradation UX, observability surfaces.
  • M6: Revocation flows (admin + revoke-on-delete), cert auto-renewal.
  • M7: Multi-user RBAC hardening on B, team-scoped grants end-to-end, acceptance suite green.

Next step after PRD sign-off: author docs/federation/MILESTONES.md with per-milestone acceptance tests and estimated token budget, then file tracking issues on git.mosaicstack.dev/mosaicstack/stack.