Files
stack/scratchpads/2026-06-19-tmux-fleet-durable-install-plan.md
Jarvis 757f5e6998
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
feat(fleet): add durable tmux fleet poc
2026-06-19 15:50:35 -05:00

756 lines
22 KiB
Markdown

# Durable tmux Fleet Installation Plan
> **For Mosaic/Hermes:** This is an implementation plan for making the tmux-backed Mosaic software-factory fleet durable on this server and reusable in generic Mosaic Stack installs. Keep local USC/Mosaic defaults in profiles; keep framework behavior customizable.
**Goal:** Add a supported Mosaic tmux-fleet installation path: holder-owned tmux server, per-agent reusable sessions, reliable send/reset/status tools, local roster customization, and a documented cutover for this server.
**Architecture:** Mosaic should ship generic tmux fleet primitives in the framework, then layer local rosters through configuration. The holder service owns the tmux socket; each agent service joins the holder-owned server and runs `mosaic yolo <runtime>`. The orchestrator addresses agents through `mosaic agent ...` abstractions so tmux can later be replaced by Matrix-backed agent comms without changing mission flow.
**Reference:** AI Guide `playbooks/tmux-fleet.md` at commit `2a0b0b5` documents the organization-neutral holder-service pattern, exact-match `=<name>` stop targets, and coupled-server cutover/verification sequence. The Stack implementation should treat that as the lifecycle model and keep concrete Mosaic unit/tooling details here.
**Tech Stack:** Bash, tmux, user systemd units, Mosaic CLI/framework installer, JSON/YAML roster config, existing `packages/mosaic/framework/tools/tmux/{agent-send.sh,send-message.sh}`.
---
## Current evidence from this server
Checked 2026-06-19:
- Host: `W-jarvis`
- User: `jarvis`
- tmux: `/usr/bin/tmux`, version `3.4`
- user systemd: active
- existing tmux sessions: `ai-bma-0`, `dyor-1`, `melaniewoltje-3`, `sage-2`
- existing Mosaic runtime: `/home/jarvis/.npm-global/bin/mosaic`, version `0.0.31`
- installed `~/.config/mosaic/tools/tmux` was not present even though the stack repo contains `packages/mosaic/framework/tools/tmux/`
Implication: do not kill the current tmux server casually. This server has active ad-hoc/service sessions. The durable fleet cutover must be planned, with either a separate socket first or a scheduled fleet recycle.
## Design decisions
### 1. Generic framework, local profile
The Mosaic framework should ship:
- systemd unit templates;
- tmux fleet CLI wrappers;
- roster schema and examples;
- install/enable/status/reset commands;
- docs and verification scripts.
Local environments should provide:
- agent names;
- runtime per slot (`claude`, `pi`, `codex`, etc.);
- default role class;
- launch directory;
- optional kickstart prompt;
- model/provider hints;
- transport selection (`tmux` now, `matrix` later).
Do not bake the USC roster into generic install code. Ship it as an example profile.
### 2. Durable sessions, disposable task context
Session names are durable operational addresses. Task persona is disposable. Reusable worker slots should be reset with `/clear` or `/new` and then receive a fresh task kickstart.
Persistent/semi-persistent personas:
- lead orchestrator;
- final/adversarial reviewer;
- architecture/enhancement lane.
Disposable slots:
- implementers;
- ordinary reviewers;
- security reviewers unless actively holding a security mission.
### 3. Transport abstraction now
Add commands around tmux instead of calling tmux directly from orchestration:
```bash
mosaic agent send <agent> --message "..."
mosaic agent status [--json]
mosaic agent reset <agent> [--clear|--new]
mosaic agent roster [--json]
mosaic fleet install|start|stop|restart|status|verify
```
Today these call tmux/systemd. Later the same command surface can target Matrix or per-agent gateways.
### 4. Avoid shared-server ownership bug
Use the AI Guide holder pattern:
```text
mosaic-tmux-holder.service owns the tmux server/socket
mosaic-agent@<name>.service joins the existing holder-owned socket
ExecStop kills only session =<name>
```
Use exact tmux targets: `=<session>`.
### 5. Prefer separate named socket for Mosaic factory
To avoid disturbing existing tmux work, the default fleet should use a named socket such as:
```text
$XDG_RUNTIME_DIR/mosaic-factory.tmux
```
or tmux socket name:
```bash
tmux -L mosaic-factory ...
```
This avoids collision with ordinary `tmux ls` sessions. The send tools need socket support.
---
## Target USC-style roster example
Ship as example only, not default:
```yaml
version: 1
transport: tmux
tmux:
socket_name: mosaic-factory
holder_session: _holder
working_directory: ~/src
agents:
- name: mos-claude
runtime: claude
class: orchestrator
model_hint: Claude Opus
persistent_persona: true
- name: coder0
runtime: claude
class: implementer
model_hint: Claude Opus
reset_between_tasks: true
- name: coder1
runtime: claude
class: implementer
model_hint: Claude Opus
reset_between_tasks: true
- name: coder2
runtime: pi
class: implementer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: coder3
runtime: pi
class: implementer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: coder4
runtime: claude
class: implementer
model_hint: Claude Opus
reset_between_tasks: true
- name: coder5
runtime: claude
class: implementer
model_hint: Claude Opus
reset_between_tasks: true
- name: enhance
runtime: claude
class: enhancer
model_hint: Claude Opus
persistent_persona: semi
- name: rev0
runtime: pi
class: reviewer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: rev1
runtime: pi
class: reviewer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: secrev0
runtime: pi
class: security_reviewer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: secrev1
runtime: pi
class: security_reviewer
model_hint: Pi GPT-5.5
reset_between_tasks: true
- name: ultron
runtime: pi
class: final_reviewer
model_hint: Pi GPT-5.5
persistent_persona: semi
```
---
## Phase 0 — Confirm install surfaces
### Task 0.1: Inspect installer copy behavior
**Objective:** Confirm how framework files under `packages/mosaic/framework/` become installed under `~/.config/mosaic/`.
**Files:**
- Read: `tools/install.sh`
- Read: `packages/mosaic/framework/install.sh`
- Read: `packages/mosaic/src/runtime/install-manifest.ts`
**Steps:**
1. Verify `packages/mosaic/framework/install.sh` rsyncs `tools/tmux`.
2. Verify whether npm-packaged installs include `framework/tools/tmux`.
3. Confirm whether installed hosts should run `mosaic update`, `bash tools/install.sh`, or `packages/mosaic/framework/install.sh` to receive new tmux tools.
4. Record exact propagation command in docs.
**Verification:**
```bash
bash packages/mosaic/framework/install.sh --help || true
npm pack --dry-run --json | jq '.[0].files[].path' | grep 'framework/tools/tmux'
```
Expected: tmux tools are included in installable package or packaging fix is identified.
### Task 0.2: Inspect current yolo launch semantics
**Objective:** Confirm `mosaic yolo claude` and `mosaic yolo pi` accept optional initial prompt text and behave well under systemd/tmux.
**Files:**
- Read: `packages/mosaic/src/**`
- Read: `packages/mosaic/framework/runtime/claude/RUNTIME.md`
- Read: `packages/mosaic/framework/runtime/pi/RUNTIME.md`
**Verification commands:**
```bash
mosaic yolo claude --help
mosaic yolo pi --help
```
Expected: a systemd `ExecStart` can launch the runtime either with no prompt or with a kickstart prompt file/string.
---
## Phase 1 — Framework tmux primitives
### Task 1.1: Add socket support to send tools
**Objective:** Allow `agent-send.sh` and `send-message.sh` to target a named Mosaic tmux socket without affecting default tmux sessions.
**Files:**
- Modify: `packages/mosaic/framework/tools/tmux/send-message.sh`
- Modify: `packages/mosaic/framework/tools/tmux/agent-send.sh`
- Modify: `packages/mosaic/framework/tools/tmux/README.md`
- Test: `packages/mosaic/framework/tools/tmux/test-send-message.sh` (new)
**Design:**
Add optional flags:
```bash
-L SOCKET_NAME # tmux -L socket name
-SOCKET PATH # optional later if needed; avoid conflict with existing -S source label in agent-send
```
Because `agent-send.sh` already uses `-S` for source label, prefer `-L` for socket name and `-T` or `--socket-path` only if long-option parsing is added.
**Implementation notes:**
- Build a tmux command array:
```bash
tmux_cmd=(tmux)
if [ -n "$SOCKET_NAME" ]; then tmux_cmd+=( -L "$SOCKET_NAME" ); fi
```
- Replace raw `tmux ...` calls with `"${tmux_cmd[@]}" ...`.
- Pass `-L` through remote ssh invocation.
- Include socket name in verbose output.
**Verification:**
```bash
tmux -L mosaic-test new-session -d -s target 'cat'
packages/mosaic/framework/tools/tmux/send-message.sh -L mosaic-test -t target -m 'hello'
tmux -L mosaic-test capture-pane -t target -p | grep hello
tmux -L mosaic-test kill-server
```
Expected: message lands in the named socket session; default `tmux ls` is untouched.
### Task 1.2: Add exact target validation helper
**Objective:** Prevent accidental prefix targeting in all tmux fleet operations.
**Files:**
- Create: `packages/mosaic/framework/tools/tmux/_lib.sh`
- Modify: `send-message.sh`
- Modify: `agent-send.sh`
**Behavior:**
- For session-only agent names, normalize target to `=<name>` before kill/status/reset operations.
- For explicit pane targets like `session:window.pane`, allow as advanced path but document the risk.
**Verification:**
Create sessions `agent` and `agent0`; verify killing/resetting `agent` does not affect `agent0`.
---
## Phase 2 — systemd unit templates
### Task 2.1: Add holder service template
**Objective:** Ship a user systemd unit template that owns the Mosaic factory tmux server.
**Files:**
- Create: `packages/mosaic/framework/systemd/user/mosaic-tmux-holder.service`
- Create: `packages/mosaic/framework/tools/fleet/install-user-units.sh`
**Unit shape:**
```ini
[Unit]
Description=Mosaic tmux fleet holder
Documentation=https://git.mosaicstack.dev/mosaicstack/aiguide
[Service]
Type=oneshot
RemainAfterExit=yes
Environment=MOSAIC_TMUX_SOCKET=mosaic-factory
ExecStart=/usr/bin/tmux -L ${MOSAIC_TMUX_SOCKET} new-session -d -s _holder 'while true; do sleep 3600; done'
ExecStop=-/usr/bin/tmux -L ${MOSAIC_TMUX_SOCKET} kill-server
[Install]
WantedBy=default.target
```
**Important:** systemd environment expansion in `ExecStart` is limited. Verify syntax; if `%E`/environment expansion is awkward, generate concrete units from config instead of relying on dynamic expansion.
**Verification:**
```bash
systemd-analyze --user verify ~/.config/systemd/user/mosaic-tmux-holder.service
systemctl --user daemon-reload
systemctl --user start mosaic-tmux-holder.service
tmux -L mosaic-factory ls | grep _holder
```
### Task 2.2: Add agent service template
**Objective:** Ship a user systemd template that starts one configured agent slot.
**Files:**
- Create: `packages/mosaic/framework/systemd/user/mosaic-agent@.service`
- Modify: `packages/mosaic/framework/tools/fleet/install-user-units.sh`
**Unit shape:**
```ini
[Unit]
Description=Mosaic agent session %i
Requires=mosaic-tmux-holder.service
After=mosaic-tmux-holder.service
PartOf=mosaic-tmux-holder.service
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=%h/src
Environment=MOSAIC_TMUX_SOCKET=mosaic-factory
ExecStart=/bin/bash -lc 'tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "%i" "mosaic yolo $(mosaic fleet runtime %i)"'
ExecStop=-/usr/bin/tmux -L mosaic-factory kill-session -t '=%i'
[Install]
WantedBy=default.target
```
**Design warning:** command substitution in unit files can become brittle. Prefer a generated per-agent EnvironmentFile:
```text
~/.config/mosaic/fleet/agents/coder0.env
```
with:
```bash
MOSAIC_AGENT_NAME=coder0
MOSAIC_AGENT_RUNTIME=claude
MOSAIC_AGENT_WORKDIR=/home/jarvis/src
MOSAIC_TMUX_SOCKET=mosaic-factory
```
Then `ExecStart` calls a wrapper:
```bash
~/.config/mosaic/tools/fleet/start-agent-session.sh
```
**Verification:**
```bash
systemd-analyze --user verify ~/.config/systemd/user/mosaic-agent@.service
systemctl --user start mosaic-agent@coder0.service
tmux -L mosaic-factory has-session -t '=coder0'
systemctl --user restart mosaic-agent@coder0.service
```
Expected: holder server PID remains unchanged; only `coder0` session recycles.
### Task 2.3: Add start-agent wrapper
**Objective:** Keep systemd units simple by moving config lookup and launch command construction into a script.
**Files:**
- Create: `packages/mosaic/framework/tools/fleet/start-agent-session.sh`
**Behavior:**
Inputs:
```bash
start-agent-session.sh <agent-name>
```
Reads:
```text
$MOSAIC_HOME/fleet/agents/<agent-name>.env
```
Starts:
```bash
tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "$MOSAIC_AGENT_NAME" -c "$MOSAIC_AGENT_WORKDIR" "mosaic yolo $MOSAIC_AGENT_RUNTIME"
```
Guardrails:
- fail if runtime is empty;
- fail if workdir does not exist;
- no duplicate sessions unless `--replace` is passed;
- exact session names only.
---
## Phase 3 — roster config and CLI wrappers
### Task 3.1: Add fleet config schema and examples
**Objective:** Define customizable install-time roster without hardcoding USC.
**Files:**
- Create: `packages/mosaic/framework/fleet/roster.schema.json`
- Create: `packages/mosaic/framework/fleet/examples/minimal.yaml`
- Create: `packages/mosaic/framework/fleet/examples/usc-software-factory.yaml`
- Create: `packages/mosaic/framework/fleet/README.md`
**Schema concepts:**
- `transport`: `tmux` now; `matrix` later.
- `tmux.socket_name`
- `tmux.holder_session`
- `defaults.working_directory`
- `agents[].name`
- `agents[].runtime`
- `agents[].class`
- `agents[].model_hint`
- `agents[].persistent_persona`
- `agents[].reset_between_tasks`
- `agents[].kickstart_template`
**Verification:**
Use `jq` for JSON examples or add a small Python/YAML validator if YAML is chosen. If no YAML parser is guaranteed, store examples as JSON or support both with Python stdlib JSON first.
### Task 3.2: Add `mosaic fleet` commands
**Objective:** Provide operator-safe commands for install/status/start/stop/restart/verify.
**Files:**
- Modify: `packages/mosaic/src/cli.ts` or the current commander entrypoint.
- Create scripts under: `packages/mosaic/framework/tools/fleet/`
**Commands:**
```bash
mosaic fleet init --profile minimal|usc --write
mosaic fleet install-systemd
mosaic fleet start [agent]
mosaic fleet stop [agent]
mosaic fleet restart [agent]
mosaic fleet status --json
mosaic fleet verify
```
**Implementation path:**
Start by wrapping framework shell scripts from the TypeScript CLI. Do not overbuild a TypeScript service manager in the first pass.
### Task 3.3: Add `mosaic agent` commands
**Objective:** Provide transport-stable per-agent operations.
**Files:**
- Modify: Mosaic CLI entrypoint.
- Create: `packages/mosaic/framework/tools/agent/` or reuse `tools/tmux` + `tools/fleet`.
**Commands:**
```bash
mosaic agent roster [--json]
mosaic agent status [agent] [--json]
mosaic agent send <agent> --message "..."
mosaic agent reset <agent> --clear|--new
mosaic agent tail <agent> [-n 80]
```
**Reset behavior:**
For tmux transport, `reset --clear` sends `/clear` then Enter through `send-message.sh`.
For Claude/Pi differences, keep reset command configurable per runtime:
```yaml
runtimes:
claude:
reset_command: /clear
pi:
reset_command: /new
```
If a runtime does not support a known reset command, restart the service and send a fresh kickstart.
---
## Phase 4 — this-server rollout strategy
### Task 4.1: Install on separate socket first
**Objective:** Prove the holder pattern without disturbing existing sessions.
**Commands after implementation lands locally:**
```bash
mosaic fleet init --profile minimal --write
mosaic fleet install-systemd
systemctl --user daemon-reload
systemctl --user start mosaic-tmux-holder.service
mosaic fleet verify
```
Expected:
- `tmux -L mosaic-factory ls` shows `_holder`.
- normal `tmux ls` still shows existing sessions unchanged.
### Task 4.2: Start one canary agent
**Objective:** Validate single-agent start/restart isolation.
Use a harmless canary first, not the full fleet.
Example roster addition:
```yaml
- name: canary-pi
runtime: pi
class: canary
working_directory: /home/jarvis/src
```
Commands:
```bash
systemctl --user start mosaic-agent@canary-pi.service
SRV=$(tmux -L mosaic-factory display-message -p '#{pid}')
systemctl --user restart mosaic-agent@canary-pi.service
test "$SRV" = "$(tmux -L mosaic-factory display-message -p '#{pid}')"
tmux -L mosaic-factory ls
```
Expected: holder PID unchanged; `_holder` remains; `canary-pi` recreated.
### Task 4.3: Configure local Mosaic factory roster
**Objective:** Create the actual local roster for this server after canary passes.
Do not assume USC exact roster is desired here. Create a local profile such as:
```text
~/.config/mosaic/fleet/roster.yaml
```
Initial local recommendation:
- `mos-claude` orchestrator
- `coder0` / `coder1` implementers
- `rev0` reviewer
- `secrev0` security reviewer
- `ultron` final/adversarial reviewer
Scale to full USC-style pool only after resource/budget behavior is understood.
### Task 4.4: Cut over existing ad-hoc tmux sessions only if desired
**Objective:** Avoid data loss.
Existing sessions on this server are not on the proposed `mosaic-factory` socket. They can remain untouched. If we later want them under Mosaic fleet control:
1. list sessions;
2. capture logs/handoffs;
3. stop old processes intentionally;
4. recreate as configured `mosaic-agent@...` services;
5. verify comms and state.
Do not run `tmux kill-server` on the default socket unless Jason explicitly approves that outage.
---
## Phase 5 — docs and AI Guide backfill
### Task 5.1: Stack docs
**Objective:** Document install and customization for Mosaic Stack users.
**Files:**
- Create: `docs/fleet/tmux-fleet.md` or `packages/mosaic/framework/tools/fleet/README.md`
- Modify: top-level `README.md` if appropriate.
Must cover:
- what problem holder service solves;
- install commands;
- customization file;
- example rosters;
- reset/reuse lifecycle;
- exact-target safety;
- separate socket default;
- Matrix migration path.
### Task 5.2: AI Guide docs
**Objective:** Keep generic guidance in AI Guide and implementation details in Stack.
**Files in `mosaicstack/aiguide`:**
- Update: `playbooks/tmux-fleet.md` with named socket, roster/profile, and resettable-slot pattern.
- Add or update: `reference/agent-role-matrix.md` if PR #5 lands.
Do not put Mosaic install commands as the only path in AI Guide. Present them as one implementation profile.
---
## Phase 6 — Matrix migration seam
### Task 6.1: Add transport enum but implement tmux only
**Objective:** Avoid hardcoding tmux into orchestration semantics.
Roster:
```yaml
transport: tmux
```
Future:
```yaml
transport: matrix
matrix:
homeserver: https://matrix.example
room_prefix: mosaic-factory
```
### Task 6.2: Define transport interface docs
**Objective:** Make Matrix plugin work a transport swap, not a rewrite.
Minimum operations:
```text
send(agent, message)
reset(agent, mode)
status(agent)
tail(agent)
listAgents()
```
Any tmux-specific concept must stay below this line.
---
## Acceptance criteria
The implementation is complete when:
- `mosaic fleet init` can write a minimal roster.
- `mosaic fleet install-systemd` installs holder and agent units without hand editing.
- `mosaic fleet start` starts the holder and configured agents on a named tmux socket.
- Restarting one `mosaic-agent@name.service` does not change holder server PID or kill sibling sessions.
- `mosaic agent send` can deliver a message to a named agent with a self-identifying preamble.
- `mosaic agent reset` can clear/new a reusable slot and send a fresh kickstart.
- `mosaic fleet verify` proves holder ownership, exact-target safety, and per-agent restart isolation.
- Existing default tmux sessions on this server are not disturbed by default install.
- Docs explain generic customization and include USC-style roster only as an example.
- AI Guide remains generic; Mosaic Stack docs carry the concrete install path.
## Risks and mitigations
| Risk | Mitigation |
| --------------------------------------------------- | --------------------------------------------------------------------------------- |
| Killing existing tmux sessions | Use named `mosaic-factory` socket; no default `tmux kill-server`. |
| systemd unit quoting/env expansion bugs | Move logic into shell wrappers; verify with `systemd-analyze --user verify`. |
| Runtime reset command mismatch | Make reset command runtime-configurable; fallback to service restart + kickstart. |
| Tool install drift | Ensure npm package includes framework tmux/fleet tools; add packaging test. |
| Mosaic-specific assumptions leak into generic guide | Keep USC roster as example profile; AI Guide documents pattern/options. |
| Matrix migration blocked by tmux coupling | Add `mosaic agent` abstraction now; keep tmux details below transport layer. |
## Suggested first PR split
1. **PR A — tmux tool hardening**
- socket support;
- exact target helpers;
- tests/docs.
2. **PR B — fleet systemd primitives**
- holder unit;
- agent unit;
- start-agent wrapper;
- install-user-units script;
- verify script.
3. **PR C — roster and CLI**
- roster schema/examples;
- `mosaic fleet ...` commands;
- `mosaic agent ...` commands.
4. **PR D — local rollout and docs**
- local roster for this server;
- run canary;
- document verification evidence;
- update AI Guide with generic lessons.
## Immediate next action
Implement PR A first. It is low-risk, improves existing tools, and is required for a safe named-socket rollout on this server.