Files
stack/docs/fleet/north-star.md
jason.woltje fc90c89913
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
fix(fleet): durable runtime PATH for detached agent launch (#581)
2026-06-21 17:30:40 +00:00

11 KiB
Raw Blame History

Mosaic Fleet — North Star

Workstream: W-FLEET (Fleet) under mission mvp-20260312 Umbrella: docs/MISSION-MANIFEST.md · docs/PRD.md (Mosaic Stack v0.1.0) Status: doctrine — authored 2026-06-20. Owner of this file: Fleet workstream lead. This document does not modify the MVP rollup; a rollup row is proposed, not written here.

Vision

A customizable, multi-tenant fleet of always-on AI agents — each defined by role, materialized as a durable, joinable runtime session, coordinated by the proven orchestrator/worker model, and observable end-to-end across hosts. Coding today; finance, analytics, research as roster entries tomorrow — same primitives, different roster. The fleet is the agent-session execution layer of the Mosaic Stack MVP: the thing federation makes reachable across hosts and the webUI/TUI/CLI make visible.

The USC tmux PoC (durable sessions + agent-send comms) proved the model. This workstream makes it an official, observable, multi-tenant Mosaic Stack capability.

The Fleet as means of production (bootstrapping)

The Fleet has a dual role, and that is the point:

  • As product — a multi-tenant agent-fleet capability of Mosaic Stack (this workstream).
  • As means of production — the orchestrator/worker fleet that actually builds the entire MVP (federation W1, webUI, TUI, CLI, and the Fleet itself).

We are building the system that builds the system. Every other MVP workstream is delivered by the fleet, so fleet observability and control are not merely product features — they are the operational floor of the whole delivery effort. If we cannot see and steer the agents, we cannot trust what they ship. This is why Phase 2 (observability) leads: it is the instrument panel for the factory, dogfooded on the live fleet that is, recursively, building Mosaic Stack.

The discipline that makes great power safe is the same gate chain the fleet enforces: independent review before merge, green CI, honest completion, decide-and-inform cadence, and no irreversible action without authority. The bootstrap is only as trustworthy as those gates.

Alignment with MVP cross-cutting requirements

The Fleet inherits — does not re-invent — the MVP's hard requirements:

MVP req What it means for the Fleet
MVP-X1 three-surface parity fleet observability/control reachable via CLI + TUI + webUI (CLI first; webUI is required for parity, not optional)
MVP-X2 multi-tenant isolation one tenant = one Linux uid (own systemd --user, socket, ~/.config/mosaic); no cross-tenant leakage
MVP-X3 auth (BetterAuth/SSO) operator→fleet and cross-host views are auth-gated through the platform's existing auth
MVP-X4 quality gates pnpm typecheck/lint/format:check green before any push
MVP-X5 federated topology cross-host fleet visibility rides the federation boundary (W1), not a bespoke broker
MVP-X6 OTEL tracing heartbeats, sends, and lifecycle events emit spans; traceparent crosses the federation boundary
MVP-X7 trunk merge branch from main, squash-merge via PR, never push to main

The stack — where every concern lives

One definition is the source of truth; the session is how it runs.

Layer Owner Phase-2 reality Destination
Definition + identity + auth gateway / mosaic-as (scoped tokens, #541) roster.yaml (tenant-tagged) one definition; mosaic agent --new materializes it
Tenancy boundary Linux uid per tenant (linger, own systemd --user, own socket, own ~/.config/mosaic) one tenant: jarvis = tenant zero uid-per-tenant; federation aggregates across hosts
Runtime per-tenant tmux session on isolated socket dogfood stub sessions (live now on mosaic-factory) claude/codex/pi/opencode TUIs
Liveness heartbeat protocol every runtime answers protocol defined + dogfood stub answers it all runtimes answer; "healthy" ≠ "pane alive"
Observation read-only watch (native tmux) + pipe-pane stream CLI watch/ps; explicit opt-in attach for control + auth-gated webUI streams
Control plane federation across hosts × tenants records already carry tenant_id + host federated gateways expose fleet state; webUI in Phase 5

Operating model (inherited, not reinvented)

The AI-guide law stands: one accountable orchestrator, isolated workers that stop at PR-open, the serialized gate chain (independent review → green CI → diff-sanity → squash-merge → verify), decide-and-inform cadence, and a durable board so missions survive session death. The Fleet is the infrastructure under this model. See mosaicstack-aiguide whitepapers 01 (inter-agent comms) and 03 (orchestration model) for the rationale.

Invariants — "maximal vision, incremental delivery, zero foreclosure"

Every artifact, starting Phase 2, MUST:

  1. Carry tenant_id + host in schema and message addressing — even with one of each today.
  2. Treat isolation socket ≠ invisibility — anything isolated is surfaced by one command.
  3. Define healthy = answered a heartbeat within N seconds, never just "pane alive".
  4. Make observation read-only by default; control is an explicit, separate, opt-in verb.

Observation model

Verb Behavior
mosaic fleet ps one table joining systemd + tmux + process + idle + last-heartbeat, with drift + boot-enable flags
mosaic agent watch <name> read-only join (grouped session / -r), no resize tyranny, no keystrokes
mosaic agent attach <name> explicit interactive takeover (the only path that can type)
mosaic agent send <name> --verify confirms message accepted, not merely keystroke-injected

Why the current PoC blocks observation: sessions live on the isolated mosaic-factory socket (invisible to default tmux ls), the only sanctioned read is capture-pane (blank for full-screen TUIs), and attach is read-write + resizes the session. The verbs above restore "join and observe" safely.

Phased roadmap

Phase Outcome Status
01 tmux PoC, hardening, published CLI v0.0.34 (#565#568) done
2 — Observability fleet ps (host+tenant aware join), heartbeat protocol + dogfood stub answers it, agent watch (read-only), agent send --verify receipts ▶ now
3 — Real runtimes claude/codex/pi/opencode answer heartbeat; hybrid lifecycle (core always-on: orchestrator+reviewer; ephemeral workers per lane) planned
4 — Unified definition one agent schema in gateway; mosaic agent --new → materialized per-tenant session; uid-tenant provisioning planned
5 — Control plane federation-backed cross-host × cross-tenant fleet view; webUI (surface chosen then) for MVP-X1 parity planned

Decisions of record (2026-06-20, with Jason)

  • Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
  • Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
  • Health: heartbeat required (dogfood stub implements the protocol now).
  • Lifecycle: hybrid — core always-on + ephemeral workers per lane.
  • Observation: read-only default, opt-in takeover.
  • Multi-host: designed-for from day one; control plane rides federation (W1).
  • Delivery: CLI-first now, dogfood against the live stub fleet; webUI deferred to Phase 5.
  • Runtimes: fleet agents default to Codex / pi-on-Codex; Claude is reserved for Claude Code only (avoid alternate-harness API pricing). Validated durable recipe: mosaic yolo pi --model openai-codex/gpt-5.5:high. Durable detached launch requires the runtime-bin on PATH (baked into the pane command) + boot-survival (enable + linger), which fleet init should automate.

Assumptions (veto-able)

  • ASSUMPTION: first-class runtimes = claude, codex, pi, opencode; a "role" (analyst, finance, researcher) = persona + skills + tools on top of a runtime, shipped as a starter role library in the framework.
  • ASSUMPTION: the cross-host control plane is the federation layer (W1), not a separate fleetd daemon.
  • ASSUMPTION: Fleet is workstream W-FLEET under mvp-20260312; a rollup row in docs/TASKS.md and a workstream declaration in MISSION-MANIFEST.md are proposed to the MVP orchestrator, not written by this workstream.