Compare commits
2 Commits
fix/pr-ci-
...
docs/missi
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5cde3a3b6d | ||
|
|
079c5597ff |
99
docs/mission-control/BOARD.md
Normal file
99
docs/mission-control/BOARD.md
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
# Mission Control Plane — Feature Board
|
||||||
|
|
||||||
|
> Discussion board for the combined PRD / mission / Kanban workflow.
|
||||||
|
> Use this to decide scope before implementation.
|
||||||
|
|
||||||
|
## Board Legend
|
||||||
|
|
||||||
|
- **Must-have** — required for the first usable version
|
||||||
|
- **Should-have** — strongly preferred, but can ship after the core path
|
||||||
|
- **Could-have** — valuable later if time permits
|
||||||
|
- **Won't-have** — explicitly deferred
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Feature Board
|
||||||
|
|
||||||
|
| Feature Card | Need | Priority | Decision / Notes |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| Canonical mission manifest | One durable root object for goal, PRD, board, session | Must-have | Mission manifest becomes the anchor for all downstream state |
|
||||||
|
| PRD generator integration | PRD should be generated from a feature idea and saved in docs | Must-have | Use Mosaic PRDy format and keep the file human-reviewable |
|
||||||
|
| Board atomization | Break PRD into assignable tasks with dependencies | Must-have | Each user story should map to one or more tasks |
|
||||||
|
| Short-cycle detector | Detect compaction churn and repeated tool loops | Must-have | Coordinator should track churn score per session |
|
||||||
|
| Handoff packet | Preserve actionable context across rotations | Must-have | Use a compact structured summary, not a raw transcript |
|
||||||
|
| Auto-resume workers | Let new sessions read mission + board on start | Should-have | Makes overnight autonomy realistic |
|
||||||
|
| Mission status view | Show current phase, blockers, and active session | Should-have | Expose through CLI first, dashboard later |
|
||||||
|
| Worktree root convention | Keep worktrees off `/tmp` and on the larger persistent drive | Should-have | Prefer `/src/<repo>-worktrees` for repo worktrees and long-lived agent work |
|
||||||
|
| Review gate | Prevent autonomous work from shipping unreviewed | Should-have | Use reviewer tasks before mission close |
|
||||||
|
| Rotation policy config | Configure thresholds per mission/profile | Could-have | Keep v1 simple, add tuning later |
|
||||||
|
| Goal decomposition suggestions | Suggest sub-goals from the PRD | Could-have | Good for planning, not necessary for core path |
|
||||||
|
| Cross-channel continuity | Continue a mission across CLI/gateway/remote channels | Could-have | Important later, not required for MVP |
|
||||||
|
| Automatic board sync | Mirror git docs into DB and back | Could-have | Nice-to-have after the file-first flow stabilizes |
|
||||||
|
| Fully autonomous closeout | Let mission finish without human intervention | Won't-have | Keep an operator-visible review step |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Needs Discussion
|
||||||
|
|
||||||
|
### 1) Canonical source of truth
|
||||||
|
|
||||||
|
**Question:** Should the PRD, mission manifest, and board all live in git, or should one be the database source of truth?
|
||||||
|
|
||||||
|
**Proposed answer:** Keep the human-readable artifacts in git and sync the mission runtime state to the database.
|
||||||
|
|
||||||
|
### 2) Scope of automation
|
||||||
|
|
||||||
|
**Question:** Should the first version auto-create the board from the PRD, or require a human/orchestrator to approve the split?
|
||||||
|
|
||||||
|
**Proposed answer:** Auto-create a draft board, then let the orchestrator approve or adjust it.
|
||||||
|
|
||||||
|
### 3) Rotation triggers
|
||||||
|
|
||||||
|
**Question:** What should trigger a forced session rotation?
|
||||||
|
|
||||||
|
**Candidate signals:**
|
||||||
|
- repeated compaction
|
||||||
|
- repeated prompts for permission
|
||||||
|
- identical tool loops
|
||||||
|
- no new file/task state after several turns
|
||||||
|
- task blocked on a missing prerequisite
|
||||||
|
|
||||||
|
**Proposed answer:** Use a weighted churn score with a small hard cap on repeated compactions.
|
||||||
|
|
||||||
|
### 4) Handoff format
|
||||||
|
|
||||||
|
**Question:** What should the next session receive?
|
||||||
|
|
||||||
|
**Proposed answer:**
|
||||||
|
- Mission ID
|
||||||
|
- PRD path
|
||||||
|
- Active board task
|
||||||
|
- Completed work
|
||||||
|
- Blockers
|
||||||
|
- Next 3 actions
|
||||||
|
- Non-negotiable constraints
|
||||||
|
|
||||||
|
### 5) Operator control
|
||||||
|
|
||||||
|
**Question:** Should the operator be able to force a rotation or pause the mission?
|
||||||
|
|
||||||
|
**Proposed answer:** Yes. Human override should win.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Draft Decisions
|
||||||
|
|
||||||
|
1. File-first artifacts, DB-backed runtime state.
|
||||||
|
2. PRD-first planning, board-second execution.
|
||||||
|
3. Auto-rotation on churn, but human override remains available.
|
||||||
|
4. Structured handoff packets required on every rotation.
|
||||||
|
5. Mission close requires a reviewer task.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- What exact data fields belong in the mission manifest?
|
||||||
|
- Should rotation thresholds vary by agent profile?
|
||||||
|
- What is the minimum viable status surface for v1?
|
||||||
|
- Should the board support milestones in addition to tasks?
|
||||||
95
docs/mission-control/MISSION-MANIFEST.md
Normal file
95
docs/mission-control/MISSION-MANIFEST.md
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
# Mission Manifest — Mosaic Mission Control Plane
|
||||||
|
|
||||||
|
> Persistent document tracking scope, status, and handoff history for the combined PRD / mission / Kanban workflow.
|
||||||
|
|
||||||
|
## Mission
|
||||||
|
|
||||||
|
**ID:** mission-control-plane-20260506
|
||||||
|
|
||||||
|
**Statement:** Combine Mosaic PRDy, coord, and Kanban into one durable workflow so an agent can move from feature idea to PRD to mission to task board and keep working across session rotation, compaction, and restarts with minimal context loss.
|
||||||
|
|
||||||
|
**Phase:** planning — MC-01 complete, MC-02 next
|
||||||
|
|
||||||
|
**Current Milestone:** MC-02
|
||||||
|
|
||||||
|
**Progress:** 1 / 6 milestones
|
||||||
|
|
||||||
|
**Status:** active
|
||||||
|
|
||||||
|
**Last Updated:** 2026-05-06
|
||||||
|
|
||||||
|
**Parent Mission:** None — new mission
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
This mission exists because overnight autonomy breaks when the working session short-cycles. The system needs durable artifacts and a mechanical coordinator that can:
|
||||||
|
|
||||||
|
1. keep a canonical PRD,
|
||||||
|
2. atomize the PRD into board tasks,
|
||||||
|
3. track mission state separately from the chat session,
|
||||||
|
4. detect churn or compaction pressure,
|
||||||
|
5. rotate to a fresh session, and
|
||||||
|
6. re-enter from a structured handoff.
|
||||||
|
|
||||||
|
Operational convention: repo worktrees and long-lived working directories should use `/src/<repo>-worktrees` instead of `/tmp`.
|
||||||
|
|
||||||
|
Design references:
|
||||||
|
|
||||||
|
- `docs/mission-control/PRD.md` — product requirements
|
||||||
|
- `docs/mission-control/BOARD.md` — feature discussion board
|
||||||
|
- `docs/mission-control/TASKS.md` — atomized execution plan
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- [ ] AC-1: A feature idea can be converted into a PRD, mission, and task board.
|
||||||
|
- [ ] AC-2: The coordinator can load a mission and its board from durable storage.
|
||||||
|
- [ ] AC-3: The coordinator can detect short-cycling and rotate sessions automatically.
|
||||||
|
- [ ] AC-4: A rotated session can resume from a handoff packet without manual re-prompting.
|
||||||
|
- [ ] AC-5: The board remains traceable back to the PRD user stories.
|
||||||
|
- [ ] AC-6: Operators can inspect mission state, task state, and latest handoff from one place.
|
||||||
|
- [ ] AC-7: The system can run overnight without losing the mission goal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestones
|
||||||
|
|
||||||
|
| # | ID | Name | Status | Branch | Started | Completed |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| 1 | MC-01 | PRD + mission schema foundation | in-progress | docs/mission-control-* | 2026-05-06 | — |
|
||||||
|
| 2 | MC-02 | Mission runtime model | not-started | — | — | — |
|
||||||
|
| 3 | MC-03 | Board atomization and task linkage | not-started | — | — | — |
|
||||||
|
| 4 | MC-04 | Short-cycle detector and rotation engine | not-started | — | — | — |
|
||||||
|
| 5 | MC-05 | Handoff generation and re-entry | not-started | — | — | — |
|
||||||
|
| 6 | MC-06 | Operator surface and E2E validation | not-started | — | — | — |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Budget
|
||||||
|
|
||||||
|
| Milestone | Est. tokens | Parallelizable? |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| MC-01 | 16K | No |
|
||||||
|
| MC-02 | 20K | No |
|
||||||
|
| MC-03 | 24K | Mostly after MC-01 |
|
||||||
|
| MC-04 | 20K | After MC-02 |
|
||||||
|
| MC-05 | 18K | After MC-04 |
|
||||||
|
| MC-06 | 26K | After MC-04/05 |
|
||||||
|
| **Total** | **~124K** | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Session History
|
||||||
|
|
||||||
|
| Session | Date | Runtime | Outcome |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| S1 | 2026-05-06 | hermes | PRD, board, task plan, mission manifest, and worktree convention drafted |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Step
|
||||||
|
|
||||||
|
Kick off MC-02: implement the durable mission runtime model and wire the mission state into the coordinator.
|
||||||
200
docs/mission-control/PRD.md
Normal file
200
docs/mission-control/PRD.md
Normal file
@@ -0,0 +1,200 @@
|
|||||||
|
# PRD: Mosaic Mission Control Plane
|
||||||
|
|
||||||
|
## Metadata
|
||||||
|
|
||||||
|
- **Owner:** Jason Woltje
|
||||||
|
- **Date:** 2026-05-06
|
||||||
|
- **Status:** draft
|
||||||
|
- **Framework:** Mosaic PRDy + coord + Kanban
|
||||||
|
- **Target Repo:** `git.mosaicstack.dev/mosaic/mosaic-stack`
|
||||||
|
- **Primary Modules:** `packages/prdy`, `packages/coord`, `packages/queue`, `apps/gateway`, `packages/brain`, `packages/cli`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
Mosaic already has the ingredients for durable agent work: PRD generation (`prdy`), mission coordination (`coord`), and task execution boards (`Kanban` / `TASKS.md`). Today those systems can still drift apart:
|
||||||
|
|
||||||
|
- A PRD can exist without a mission record.
|
||||||
|
- A mission can exist without a machine-readable execution board.
|
||||||
|
- Agents can short-cycle or compact repeatedly without a durable handoff.
|
||||||
|
- The next session may know the goal, but not the exact next step.
|
||||||
|
|
||||||
|
The result is brittle overnight autonomy: work continues only as long as a single session remains healthy.
|
||||||
|
|
||||||
|
This feature unifies those layers into one durable workflow so a mission can survive session rotation, compaction, and restarts with minimal state loss.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
1. Create one canonical pipeline from idea → PRD → mission → board → execution.
|
||||||
|
2. Let `prdy` generate a PRD that is immediately usable as a mission input.
|
||||||
|
3. Let `coord` own mission state, handoffs, and session rotation.
|
||||||
|
4. Let the board hold atomized tasks with dependencies and assignees.
|
||||||
|
5. Let agents read the mission and board to learn the next action without extra prompting.
|
||||||
|
6. Detect short-cycling and rotate sessions before quality degrades.
|
||||||
|
7. Preserve useful context across handoffs with a structured summary packet.
|
||||||
|
8. Give operators a single place to see mission status, task state, and the current session.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
1. Replacing the Mosaic agent runtime or gateway architecture.
|
||||||
|
2. Rewriting `prdy` or `coord` from scratch.
|
||||||
|
3. Turning the board into a general project-management system.
|
||||||
|
4. Building a full Gantt/charting product.
|
||||||
|
5. Removing human review or approval gates.
|
||||||
|
6. Allowing agents to create arbitrary mission state without schema.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## User Stories
|
||||||
|
|
||||||
|
### US-001: Create a mission from a feature idea
|
||||||
|
|
||||||
|
**Description:** As an orchestrator, I want to turn a feature idea into a PRD and mission so that agents can work from a durable spec instead of a chat transcript.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- [ ] `prdy` can emit a PRD with goals, non-goals, and requirements.
|
||||||
|
- [ ] The PRD is linked to a mission ID.
|
||||||
|
- [ ] The mission manifest references the PRD path.
|
||||||
|
- [ ] The mission is readable by downstream agent sessions.
|
||||||
|
|
||||||
|
### US-002: Atomize work into a board
|
||||||
|
|
||||||
|
**Description:** As an orchestrator, I want to split a PRD into board tasks so that work can be assigned to specialists.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- [ ] Each user story can become one or more tasks.
|
||||||
|
- [ ] Tasks have assignees, dependencies, and estimates.
|
||||||
|
- [ ] Tasks are machine-readable and durable.
|
||||||
|
- [ ] The board can be regenerated from the PRD without ambiguity.
|
||||||
|
|
||||||
|
### US-003: Rotate sessions without losing the mission
|
||||||
|
|
||||||
|
**Description:** As a coordinator, I want to restart or rotate a session when it short-cycles so that the mission continues with minimal loss.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- [ ] The coordinator detects compaction pressure or repeated loops.
|
||||||
|
- [ ] The coordinator writes a handoff summary before rotation.
|
||||||
|
- [ ] A new session can resume from the handoff packet.
|
||||||
|
- [ ] The mission state remains intact across the rotation.
|
||||||
|
|
||||||
|
### US-004: Let workers read the next step automatically
|
||||||
|
|
||||||
|
**Description:** As a worker agent, I want to read the mission and board at startup so I can do the next useful thing without waiting for a human prompt.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- [ ] Startup loads the active mission manifest.
|
||||||
|
- [ ] Startup loads the current board/task row.
|
||||||
|
- [ ] Startup exposes the next action clearly in the prompt.
|
||||||
|
- [ ] The agent can continue after compaction using the same mission context.
|
||||||
|
|
||||||
|
### US-005: Observe mission health from one place
|
||||||
|
|
||||||
|
**Description:** As an operator, I want a single view of mission health so that I can see progress, blocked tasks, and session churn.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- [ ] Mission state shows current phase and progress.
|
||||||
|
- [ ] Board state shows task status by assignee.
|
||||||
|
- [ ] Short-cycle/rotation events are visible.
|
||||||
|
- [ ] Handoffs are inspectable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Functional Requirements
|
||||||
|
|
||||||
|
FR-1. The system must represent a mission as a durable object with an ID, goal, current phase, PRD path, board path, and active session ID.
|
||||||
|
|
||||||
|
FR-2. The system must represent a PRD as a markdown document with goals, user stories, functional requirements, non-goals, technical considerations, and success metrics.
|
||||||
|
|
||||||
|
FR-3. The system must represent execution work as a board of atomized tasks with status, assignee, dependency, and estimate fields.
|
||||||
|
|
||||||
|
FR-4. The coordinator must be able to derive a task board from a PRD.
|
||||||
|
|
||||||
|
FR-5. The coordinator must be able to write a handoff packet that includes goal, current state, completed work, blocked work, next steps, and constraints.
|
||||||
|
|
||||||
|
FR-6. The coordinator must detect short-cycling signals such as repeated compactions, repeated tool loops, repeated approval prompts, or no progress across several turns.
|
||||||
|
|
||||||
|
FR-7. The coordinator must rotate the session when the short-cycle threshold is exceeded.
|
||||||
|
|
||||||
|
FR-8. The coordinator must preserve mission continuity across session rotation.
|
||||||
|
|
||||||
|
FR-9. The worker session must read the mission state and board state at startup.
|
||||||
|
|
||||||
|
FR-10. The worker session must be able to resume from the last handoff summary without the operator rewriting the goal manually.
|
||||||
|
|
||||||
|
FR-11. The operator must be able to inspect the mission state, PRD, board, and latest handoff from one place.
|
||||||
|
|
||||||
|
FR-12. The mission system must keep a traceable link between PRD requirements and board tasks.
|
||||||
|
|
||||||
|
FR-13. The system must not allow a task to become active without a valid mission context.
|
||||||
|
|
||||||
|
FR-14. The system must keep durable history for rotation and handoff events.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Board Discussion: Features and Needs
|
||||||
|
|
||||||
|
This is the feature discussion board that should drive the mission design.
|
||||||
|
|
||||||
|
| Card | Need | Why it matters | Proposed decision |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| Canonical mission record | One source of truth for goal/state | Prevents drift between chat, docs, and queue | Make mission manifest the durable root object |
|
||||||
|
| PRD → board derivation | Break feature ideas into executable work | Lets the plan be assigned and tracked | Keep PRD as the spec, generate board tasks from user stories |
|
||||||
|
| Session watchdog | Detect churn/short-cycling | Keeps overnight runs productive | Add short-cycle scoring and forced rotation |
|
||||||
|
| Structured handoff | Preserve context across session changes | Minimizes restart loss | Use a compact JSON/MD handoff packet |
|
||||||
|
| Worker auto-read | Let agents resume without human re-prompting | Reduces operator overhead | Load mission + board on session start |
|
||||||
|
| Status surface | Show progress and blockers clearly | Operators need confidence | Expose mission state via CLI and dashboard |
|
||||||
|
| Review gate | Keep quality high on autonomous work | Prevents silent regressions | Require review tasks before close |
|
||||||
|
| Recoverability | Resume after failure or restart | Mission should outlive a process | Persist session and handoff history |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Considerations
|
||||||
|
|
||||||
|
1. The PRD should stay human-readable markdown, because the board and mission references need to be reviewable in git.
|
||||||
|
2. The board should be machine-readable enough for automation but still readable by humans.
|
||||||
|
3. The mission manifest should point to the PRD and board, not duplicate them.
|
||||||
|
4. Handoff packets should be compact and structured so they can be injected into a new session with minimal token cost.
|
||||||
|
5. The coordinator should prefer rotation over forced context growth once the session is near the compaction threshold.
|
||||||
|
6. Existing Mosaic commands should be extended, not replaced, wherever possible.
|
||||||
|
7. The same mission should be resumable across CLI, gateway, and remote channels.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technical Considerations
|
||||||
|
|
||||||
|
- Likely storage split:
|
||||||
|
- PRD/board/manifest in git-backed docs
|
||||||
|
- mission/session state in the Mosaic data layer
|
||||||
|
- runtime health in queue/session state
|
||||||
|
- Worktrees and long-lived agent working directories should live under `/src/<repo>-worktrees` rather than `/tmp` so they sit on the larger persistent drive and survive longer-running missions.
|
||||||
|
- The coordinator needs a stable session identity, even if the active session changes.
|
||||||
|
- Task dependencies must be enforced so workers do not start early.
|
||||||
|
- The handoff packet should include the top 3 immediate actions and the strongest constraints.
|
||||||
|
- Rotation triggers should be configurable per profile or per mission.
|
||||||
|
- The initial version can be file-first, with dashboard sync added later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Metrics
|
||||||
|
|
||||||
|
- A mission can rotate sessions without losing the active goal.
|
||||||
|
- A new session can resume from the latest handoff in under one turn.
|
||||||
|
- Board tasks remain aligned to PRD user stories.
|
||||||
|
- Short-cycling sessions are replaced before repeated compaction harms quality.
|
||||||
|
- Operators can find mission state without spelunking across multiple chat logs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. What should the canonical mission ID format be?
|
||||||
|
2. Should the board live only in git, or also in the database?
|
||||||
|
3. Should rotation be automatic by default, or opt-in per mission?
|
||||||
|
4. What should the short-cycle threshold be initially?
|
||||||
|
5. Should handoffs be pure text, structured JSON, or both?
|
||||||
|
6. Which CLI command should be the primary mission entrypoint: `mosaic mission`, `mosaic coord`, or `mosaic prdy`?
|
||||||
113
docs/mission-control/TASKS.md
Normal file
113
docs/mission-control/TASKS.md
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
# Tasks — Mosaic Mission Control Plane
|
||||||
|
|
||||||
|
> Single-writer: orchestrator only. Workers read but never modify.
|
||||||
|
>
|
||||||
|
> **Mission:** mission-control-plane-20260506
|
||||||
|
> **Schema:** `| id | status | description | issue | agent | branch | depends_on | estimate | notes |`
|
||||||
|
> **Status values:** `not-started` | `in-progress` | `done` | `blocked` | `failed` | `needs-qa`
|
||||||
|
> **Agent values:** `codex` | `glm-5.1` | `haiku` | `sonnet` | `opus` | `—` (auto)
|
||||||
|
>
|
||||||
|
> Scope: this file decomposes the combined PRD / mission / board workflow into atomized tasks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 1 — PRD + mission schema foundation
|
||||||
|
|
||||||
|
Goal: create the durable doc structure and the minimal mission metadata needed to keep PRD, board, and mission aligned.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-01-01 | not-started | Write `docs/mission-control/PRD.md` with goals, non-goals, functional requirements, and success metrics. | — | sonnet | docs/mission-control-prd | — | 5K | Human-readable PRD becomes the spec anchor. |
|
||||||
|
| MC-01-02 | not-started | Write `docs/mission-control/BOARD.md` as a decision board for scope, priority, and open questions. | — | haiku | docs/mission-control-board | MC-01-01 | 3K | Keeps discussion separate from the spec. |
|
||||||
|
| MC-01-03 | not-started | Write `docs/mission-control/MISSION-MANIFEST.md` linking PRD, board, tasks, and mission identity. | — | sonnet | docs/mission-control-manifest | MC-01-01, MC-01-02 | 4K | Durable mission root object. |
|
||||||
|
| MC-01-04 | not-started | Write `docs/mission-control/TASKS.md` with the atomized execution plan and dependency graph. | — | sonnet | docs/mission-control-tasks | MC-01-03 | 4K | Board-backed execution plan. |
|
||||||
|
|
||||||
|
**Milestone 1 estimate:** ~16K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 2 — Mission runtime model
|
||||||
|
|
||||||
|
Goal: make missions first-class runtime objects that can survive session restarts and compaction.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-02-01 | not-started | Define mission schema in the data layer: mission ID, goal, phase, PRD path, board path, active session ID, last handoff, and churn score. | — | codex | feat/mission-control-schema | MC-01-03 | 6K | This is the durable root state. |
|
||||||
|
| MC-02-02 | not-started | Add mission read/write services to `packages/coord` so the coordinator can load and persist mission state. | — | codex | feat/mission-control-coord-store | MC-02-01 | 6K | Keep storage simple and explicit. |
|
||||||
|
| MC-02-03 | not-started | Add mission status reporting to `mosaic mission` and `mosaic coord status`. | — | codex | feat/mission-control-status-cli | MC-02-02 | 4K | Operators need one obvious status command. |
|
||||||
|
| MC-02-04 | not-started | Add tests for mission persistence and recovery after restart. | — | haiku | feat/mission-control-persistence-tests | MC-02-02 | 4K | Verify mission survives process churn. |
|
||||||
|
|| MC-02-05 | done | Add a worktree-root convention to the mission runtime notes and startup guidance so agents prefer `/src/<repo>-worktrees` over `/tmp`. | — | haiku | docs/mission-control-worktree-root | MC-01-03 | 3K | Keep long-lived work on the larger persistent drive. |
|
||||||
|
|
||||||
|
**Milestone 2 estimate:** ~20K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 3 — Board atomization and task linkage
|
||||||
|
|
||||||
|
Goal: derive assignable tasks from the PRD and keep them linked to mission state.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-03-01 | not-started | Add a PRD-to-task decomposition rule set: every user story maps to one or more board tasks. | — | sonnet | feat/mission-control-decompose | MC-01-01 | 5K | Start simple and deterministic. |
|
||||||
|
| MC-03-02 | not-started | Implement board generation from the PRD in a machine-readable format. | — | codex | feat/mission-control-board-gen | MC-03-01 | 6K | Output should be usable by the coordinator. |
|
||||||
|
| MC-03-03 | not-started | Add dependency validation so tasks cannot start before parent tasks complete. | — | codex | feat/mission-control-deps | MC-03-02 | 5K | Enforces ordering. |
|
||||||
|
| MC-03-04 | not-started | Add review-task support so a mission cannot close without a reviewer step. | — | sonnet | feat/mission-control-review-gate | MC-03-03 | 4K | Preserves quality. |
|
||||||
|
| MC-03-05 | not-started | Add tests proving the board stays traceable back to the PRD user stories. | — | haiku | feat/mission-control-trace-tests | MC-03-02, MC-03-03 | 4K | Traceability is the point. |
|
||||||
|
|
||||||
|
**Milestone 3 estimate:** ~24K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 4 — Short-cycle detector and rotation engine
|
||||||
|
|
||||||
|
Goal: detect when a session is stuck and rotate to a fresh session before quality falls off.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-04-01 | not-started | Define churn signals: repeated compaction, identical tool loops, repeated permission prompts, and no progress across several turns. | — | sonnet | feat/mission-control-churn-signals | MC-02-01 | 4K | Keep the rules explicit. |
|
||||||
|
| MC-04-02 | not-started | Implement churn scoring in the coordinator with configurable thresholds. | — | codex | feat/mission-control-churn-score | MC-04-01 | 6K | Weighted score makes tuning easier. |
|
||||||
|
| MC-04-03 | not-started | Implement automatic session rotation when churn crosses the threshold. | — | codex | feat/mission-control-rotate-session | MC-04-02 | 6K | The session is disposable; the mission is not. |
|
||||||
|
| MC-04-04 | not-started | Add tests for rotation triggers and for avoiding premature rotation. | — | haiku | feat/mission-control-rotation-tests | MC-04-03 | 4K | Prevent flapping. |
|
||||||
|
|
||||||
|
**Milestone 4 estimate:** ~20K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 5 — Handoff generation and re-entry
|
||||||
|
|
||||||
|
Goal: preserve the best context from the old session and inject it into the new session cleanly.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-05-01 | not-started | Define the handoff packet schema: mission ID, session ID, completed work, blockers, next 3 actions, and constraints. | — | sonnet | feat/mission-control-handoff-schema | MC-02-01 | 4K | Keep it compact and structured. |
|
||||||
|
| MC-05-02 | not-started | Implement handoff packet writing during rotation. | — | codex | feat/mission-control-handoff-write | MC-05-01, MC-04-03 | 5K | Persist before the old session exits. |
|
||||||
|
| MC-05-03 | not-started | Implement handoff packet loading at session startup. | — | codex | feat/mission-control-handoff-load | MC-05-01, MC-04-03 | 5K | New session should know the next action. |
|
||||||
|
| MC-05-04 | not-started | Add tests proving a rotated session can continue the mission without manual re-prompting. | — | haiku | feat/mission-control-handoff-tests | MC-05-02, MC-05-03 | 4K | Resume quality is the key metric. |
|
||||||
|
|
||||||
|
**Milestone 5 estimate:** ~18K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone 6 — Operator surface and E2E validation
|
||||||
|
|
||||||
|
Goal: expose the whole workflow through commands and verify it end-to-end.
|
||||||
|
|
||||||
|
| id | status | description | issue | agent | branch | depends_on | estimate | notes |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| MC-06-01 | not-started | Add a CLI command to inspect the active mission, PRD path, board path, task statuses, and latest handoff. | — | codex | feat/mission-control-inspect-cli | MC-02-03, MC-05-03 | 5K | One place to inspect the whole stack. |
|
||||||
|
| MC-06-02 | not-started | Add a compact dashboard or TUI summary view for mission health. | — | codex | feat/mission-control-summary-ui | MC-06-01 | 6K | Nice to have, but not before the core works. |
|
||||||
|
| MC-06-03 | not-started | Build an E2E harness that simulates compaction / rotation and verifies the mission can continue. | — | sonnet | feat/mission-control-e2e-harness | MC-04-03, MC-05-03 | 8K | This is the proof that the design works. |
|
||||||
|
| MC-06-04 | not-started | Add final docs for operators explaining how PRD, mission, and board fit together. | — | haiku | feat/mission-control-ops-docs | MC-06-03 | 4K | Make it usable by humans. |
|
||||||
|
| MC-06-05 | not-started | Consolidate review findings and close the mission with a release note. | — | sonnet | chore/mission-control-close | MC-06-04 | 3K | Only after the E2E passes. |
|
||||||
|
|
||||||
|
**Milestone 6 estimate:** ~26K tokens
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Execution Notes
|
||||||
|
|
||||||
|
- `sonnet` is best for planning, decomposition, and the review-gate tasks.
|
||||||
|
- `codex` is best for schema, coordinator, and CLI implementation.
|
||||||
|
- `haiku` is best for validation, traceability checks, and docs.
|
||||||
|
- The first implementation pass should stay file-first and keep the runtime state thin.
|
||||||
|
- The mission should not close until the PRD, board, mission manifest, and E2E harness all agree.
|
||||||
238
docs/plans/2026-05-06-hermes-mosaic-alignment.md
Normal file
238
docs/plans/2026-05-06-hermes-mosaic-alignment.md
Normal file
@@ -0,0 +1,238 @@
|
|||||||
|
# Hermes-Mosaic Alignment Plan
|
||||||
|
|
||||||
|
> **For Hermes:** Use subagent-driven-development skill to implement this plan task-by-task.
|
||||||
|
|
||||||
|
**Goal:** Package Mosaic's mechanical coordination primitives as a native Hermes toolset so any Hermes profile gets mission management, task decomposition, handoff, and session continuity without depending on the Mosaic gateway or OpenClaw runtime.
|
||||||
|
|
||||||
|
**Architecture:** Extract the coordination logic from Mosaic's `packages/coord` (TypeScript, file-first) into a Hermes Python toolset that wraps the same file conventions. The Mosaic Stack repo remains the canonical upstream for the file formats (TASKS.md schema, mission.json schema, handoff packet schema). Hermes implements native Python tools that read/write those same files, plus tool-calls for churn detection and handoff generation that have no Mosaic equivalent today.
|
||||||
|
|
||||||
|
**Tech Stack:** Python (Hermes toolset), SQLite (Hermes Kanban), JSON + Markdown (Mosaic file conventions)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alignment Map
|
||||||
|
|
||||||
|
### What Mosaic has that Hermes needs
|
||||||
|
|
||||||
|
| Mosaic Component | What it does | Natural Hermes home | Why |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `packages/coord` (mission.ts) | Mission CRUD, session tracking, milestone state | **Hermes toolset: `mission`** | Mission state is session-scoped, not gateway-scoped. Hermes sessions already have identity, process tracking, and context windows. |
|
||||||
|
| `packages/coord` (tasks-file.ts) | Parse/write TASKS.md tables | **Hermes toolset: `mission`** (same) | Hermes already reads/writes files. The TASKS.md parser is ~300 lines of pure string manipulation — trivial Python port. |
|
||||||
|
| `packages/coord` (runner.ts) | Spawn claude/codex workers with continuation prompts | **Already covered by `delegate_task`** | Hermes delegate_task already does isolated subagent spawning with restricted toolsets. The runner's "find next task and build continuation prompt" logic moves into a tool-call. |
|
||||||
|
| `packages/coord` (status.ts) | Mission health, task progress, next task | **Hermes toolset: `mission`** (same) | Status readout fits naturally as a tool-call. No gateway needed. |
|
||||||
|
| `packages/prdy` | PRD generation wizard | **Hermes skill: `prdy`** | PRD generation is a prompt + template problem, not infrastructure. A Hermes skill with templates is the right fit. |
|
||||||
|
| `plugins/mosaic-framework` | before_agent_start + subagent_spawning hooks | **Hermes system prompt injection** | Hermes already injects system context via skills and config. The framework preamble and worktree rules become standard Hermes skills loaded by the orchestrator profile. |
|
||||||
|
| `plugins/macp` | OpenClaw ACP bridge (spawn codex/claude) | **Already covered by `delegate_task` + ACP** | Hermes already has ACP support and delegate_task. The MACP bridge is redundant when running natively in Hermes. |
|
||||||
|
| Churn detection (planned) | Detect compaction loops, repeated tool calls, no progress | **Hermes middleware** | This needs to live inside Hermes's turn loop where it can observe tool-call patterns. Mosaic can't see this from outside. |
|
||||||
|
| Handoff packet (planned) | Structured context summary for session rotation | **Hermes toolset: `mission`** | Handoff is a serialization of mission + session state. Hermes owns the session, so it should own the handoff. |
|
||||||
|
|
||||||
|
### What Hermes already has that replaces Mosaic infrastructure
|
||||||
|
|
||||||
|
| Mosaic concept | Hermes equivalent | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| Gateway (NestJS) | Hermes gateway | Hermes already has a gateway with WebSocket, Discord, Telegram, CLI. No need for a second one. |
|
||||||
|
| Pi SDK agent runtime | Hermes agent loop | Hermes IS the agent runtime. OpenClaw's Pi SDK is a different runtime that Mosaic targets. |
|
||||||
|
| MACP ACP bridge | `delegate_task` + ACP tools | Same capability, already native. |
|
||||||
|
| Session identity | Hermes session IDs + process_registry | Hermes already tracks session identity, PIDs, and background processes. |
|
||||||
|
| Task execution board | Hermes Kanban | Fully functional SQLite-backed Kanban with dispatcher, triage, events, comments. |
|
||||||
|
| Worker spawning | Hermes dispatcher + cron | Kanban dispatcher + cron already handle this. |
|
||||||
|
| Context injection | Hermes skills + system prompt | Skills are loaded at session start and injected into context. Exactly what mosaic-framework plugin does. |
|
||||||
|
| File checkpoints | Hermes checkpoint_manager | Already tracks file mutations with shadow git. |
|
||||||
|
|
||||||
|
### What Mosaic keeps as its own entity
|
||||||
|
|
||||||
|
| Component | Why it stays in Mosaic |
|
||||||
|
|---|---|
|
||||||
|
| `apps/gateway` | NestJS API surface — Mosaic's web platform offering |
|
||||||
|
| `apps/web` | Next.js dashboard — Mosaic's UI offering |
|
||||||
|
| `packages/types` | Shared TS contracts for Mosaic gateway plugins |
|
||||||
|
| `packages/db` | Drizzle ORM + PG — Mosaic's data layer |
|
||||||
|
| `packages/auth` | BetterAuth — Mosaic's auth system |
|
||||||
|
| `packages/brain` | PG-backed data layer for Mosaic web app |
|
||||||
|
| `packages/queue` | Valkey task queue for Mosaic gateway |
|
||||||
|
| `plugins/discord` | OpenClaw Discord plugin |
|
||||||
|
| `plugins/telegram` | OpenClaw Telegram plugin |
|
||||||
|
| `packages/mosaic` CLI | The `mosaic` CLI — Mosaic's own command surface |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture: `mission` Toolset for Hermes
|
||||||
|
|
||||||
|
### New files under `/opt/hermes/tools/`
|
||||||
|
|
||||||
|
```
|
||||||
|
mission_tools.py — Tool-call surface (mission_create, mission_status,
|
||||||
|
mission_next_task, mission_update_task, mission_handoff,
|
||||||
|
mission_resume)
|
||||||
|
mission_state.py — State management (read/write mission.json, parse TASKS.md,
|
||||||
|
parse MISSION-MANIFEST.md)
|
||||||
|
mission_churn.py — Churn detection (tool-loop counter, compaction counter,
|
||||||
|
progress scorer)
|
||||||
|
mission_handoff.py — Handoff packet generation and loading
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tool-calls exposed to the agent
|
||||||
|
|
||||||
|
| Tool | What it does | When the agent calls it |
|
||||||
|
|---|---|---|
|
||||||
|
| `mission_create` | Initialize mission.json + TASKS.md + MISSION-MANIFEST.md in a project dir | When starting a new mission |
|
||||||
|
| `mission_status` | Read current mission state, milestone progress, next task, active session | At session start, or when checking progress |
|
||||||
|
| `mission_next_task` | Find the next `not-started` task whose dependencies are met, return its full spec | When the agent needs work to do |
|
||||||
|
| `mission_update_task` | Update a task row status in TASKS.md | When completing or blocking a task |
|
||||||
|
| `mission_handoff` | Generate a handoff packet from current session context + mission state | Before session rotation or at session end |
|
||||||
|
| `mission_resume` | Load a handoff packet and inject it as context for the new session | At session start after rotation |
|
||||||
|
|
||||||
|
### Toolset registration
|
||||||
|
|
||||||
|
The `mission` toolset follows the same pattern as `kanban`:
|
||||||
|
|
||||||
|
1. **Gating**: Tools are available when:
|
||||||
|
- The profile has `mission` in its toolsets config, OR
|
||||||
|
- A `HERMES_MISSION_DIR` env var is set (cron/dispatcher spawned workers)
|
||||||
|
|
||||||
|
2. **File conventions**: The toolset reads/writes the same file formats as Mosaic `packages/coord`:
|
||||||
|
- `.mosaic/orchestrator/mission.json` — mission state
|
||||||
|
- `docs/TASKS.md` — task table
|
||||||
|
- `docs/MISSION-MANIFEST.md` — mission manifest
|
||||||
|
- `docs/scratchpads/<id>.md` — session scratchpad
|
||||||
|
|
||||||
|
3. **Kanban bridge**: Optional bidirectional sync between mission TASKS.md rows and Kanban task cards, so the dashboard sees mission tasks.
|
||||||
|
|
||||||
|
### Churn detection (middleware)
|
||||||
|
|
||||||
|
Churn detection lives in Hermes's turn loop, NOT as a tool-call. It observes:
|
||||||
|
|
||||||
|
- Repeated compaction events (context window pressure)
|
||||||
|
- Identical tool-call sequences (loop detection)
|
||||||
|
- No file state changes across N turns
|
||||||
|
- Repeated permission denials
|
||||||
|
|
||||||
|
When churn score exceeds threshold:
|
||||||
|
1. `mission_handoff` is called automatically
|
||||||
|
2. Session is rotated (fresh context window)
|
||||||
|
3. `mission_resume` is called in the new session
|
||||||
|
|
||||||
|
This is new infrastructure that only Hermes can provide (Mosaic runs outside the agent loop).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Tasks
|
||||||
|
|
||||||
|
### Phase 1: Core state management (Python port of coord)
|
||||||
|
|
||||||
|
| Task | Files | Estimate |
|
||||||
|
|---|---|---|
|
||||||
|
| 1.1 Port mission.json read/write to Python | `mission_state.py` | 2h |
|
||||||
|
| 1.2 Port TASKS.md parser to Python | `mission_state.py` | 2h |
|
||||||
|
| 1.3 Port MISSION-MANIFEST.md reader to Python | `mission_state.py` | 1h |
|
||||||
|
| 1.4 Implement `mission_create` tool-call | `mission_tools.py` | 1h |
|
||||||
|
| 1.5 Implement `mission_status` tool-call | `mission_tools.py` | 1h |
|
||||||
|
| 1.6 Implement `mission_next_task` tool-call | `mission_tools.py` | 1h |
|
||||||
|
| 1.7 Implement `mission_update_task` tool-call | `mission_tools.py` | 1h |
|
||||||
|
| 1.8 Register `mission` toolset in Hermes registry | `tools/registry.py` | 30m |
|
||||||
|
| 1.9 Add `mission` to orchestrator profile toolsets | `config.yaml` | 10m |
|
||||||
|
| 1.10 Write unit tests for mission_state | `tests/test_mission_state.py` | 2h |
|
||||||
|
| 1.11 Write unit tests for TASKS.md parser | `tests/test_tasks_parser.py` | 1h |
|
||||||
|
|
||||||
|
**Phase 1 estimate:** ~13h
|
||||||
|
|
||||||
|
### Phase 2: Handoff and session continuity
|
||||||
|
|
||||||
|
| Task | Files | Estimate |
|
||||||
|
|---|---|---|
|
||||||
|
| 2.1 Define handoff packet schema (JSON) | `mission_handoff.py` | 1h |
|
||||||
|
| 2.2 Implement `mission_handoff` tool-call | `mission_handoff.py`, `mission_tools.py` | 2h |
|
||||||
|
| 2.3 Implement `mission_resume` tool-call | `mission_handoff.py`, `mission_tools.py` | 2h |
|
||||||
|
| 2.4 Wire handoff into session start (auto-resume) | agent loop hook | 2h |
|
||||||
|
| 2.5 Write tests for handoff round-trip | `tests/test_mission_handoff.py` | 1h |
|
||||||
|
|
||||||
|
**Phase 2 estimate:** ~8h
|
||||||
|
|
||||||
|
### Phase 3: Churn detection
|
||||||
|
|
||||||
|
| Task | Files | Estimate |
|
||||||
|
|---|---|---|
|
||||||
|
| 3.1 Define churn signal weights and thresholds | `mission_churn.py` | 1h |
|
||||||
|
| 3.2 Implement tool-loop detector (consecutive identical calls) | `mission_churn.py` | 2h |
|
||||||
|
| 3.3 Implement compaction pressure detector | `mission_churn.py` | 1h |
|
||||||
|
| 3.4 Implement progress scorer (file state delta) | `mission_churn.py` | 2h |
|
||||||
|
| 3.5 Wire churn scoring into agent turn loop | agent loop middleware | 2h |
|
||||||
|
| 3.6 Implement auto-rotation trigger | agent loop + handoff | 2h |
|
||||||
|
| 3.7 Write tests for churn scoring | `tests/test_mission_churn.py` | 1h |
|
||||||
|
|
||||||
|
**Phase 3 estimate:** ~11h
|
||||||
|
|
||||||
|
### Phase 4: Kanban bridge + CLI surface
|
||||||
|
|
||||||
|
| Task | Files | Estimate |
|
||||||
|
|---|---|---|
|
||||||
|
| 4.1 Implement TASKS.md → Kanban sync (one-way first) | `mission_kanban_sync.py` | 2h |
|
||||||
|
| 4.2 Add `hermes mission` CLI subcommand | `mission_cli.py` | 2h |
|
||||||
|
| 4.3 Add `hermes mission status` command | `mission_cli.py` | 1h |
|
||||||
|
| 4.4 Add `hermes mission init` command | `mission_cli.py` | 1h |
|
||||||
|
| 4.5 Add `hermes mission handoff` command | `mission_cli.py` | 1h |
|
||||||
|
| 4.6 Add `hermes mission resume` command | `mission_cli.py` | 1h |
|
||||||
|
|
||||||
|
**Phase 4 estimate:** ~8h
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Format Compatibility
|
||||||
|
|
||||||
|
The Python implementation MUST read and write the exact same file formats as Mosaic's TypeScript `packages/coord`. This means:
|
||||||
|
|
||||||
|
1. **mission.json** schema is identical to `Mission` type in `packages/coord/src/types.ts`
|
||||||
|
2. **TASKS.md** table format is identical to what `packages/coord/src/tasks-file.ts` parses
|
||||||
|
3. **MISSION-MANIFEST.md** is free-form markdown (no parser needed — just read the file)
|
||||||
|
4. **Handoff packets** are a new JSON format defined in this toolset (Mosaic doesn't have them yet)
|
||||||
|
|
||||||
|
This way a project can use Hermes mission tools OR Mosaic `mosaic coord` commands interchangeably. The files are the contract.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Relationship Diagram
|
||||||
|
|
||||||
|
```
|
||||||
|
Mosaic Stack (TypeScript) Hermes Agent (Python)
|
||||||
|
┌─────────────────────────┐ ┌─────────────────────────┐
|
||||||
|
│ packages/coord │ │ tools/mission_tools.py │
|
||||||
|
│ ├─ mission.ts │◄──────►│ ├─ mission_state.py │
|
||||||
|
│ ├─ tasks-file.ts │ same │ ├─ mission_handoff.py │
|
||||||
|
│ ├─ status.ts │ files │ ├─ mission_churn.py │
|
||||||
|
│ └─ runner.ts │ │ └─ mission_tools.py │
|
||||||
|
│ │ │ │
|
||||||
|
│ packages/prdy │ │ skills/prdy/ │
|
||||||
|
│ └─ templates, wizard │◄──────►│ └─ SKILL.md + templates │
|
||||||
|
│ │ │ │
|
||||||
|
│ plugins/mosaic-framework│ │ skills/ (existing) │
|
||||||
|
│ └─ context injection │◄──────►│ └─ kanban-orchestrator │
|
||||||
|
│ │ │ + mosaic-coding-* │
|
||||||
|
│ plugins/macp │ │ tools/delegate_task.py │
|
||||||
|
│ └─ ACP bridge │◄──────►│ └─ already covers this │
|
||||||
|
│ │ │ │
|
||||||
|
│ (stays in Mosaic) │ │ tools/kanban_tools.py │
|
||||||
|
│ apps/gateway │ │ └─ Hermes Kanban DB │
|
||||||
|
│ apps/web │ │ │
|
||||||
|
│ packages/db │ │ tools/cronjob_tools.py │
|
||||||
|
│ packages/queue │ │ └─ already covers cron │
|
||||||
|
└─────────────────────────┘ └─────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. **Should the `mission` toolset ship with Hermes core, or as a plugin?**
|
||||||
|
- Recommendation: ship as a **built-in toolset** (like `kanban`) since mission coordination is a core agent capability, not an optional integration. The file formats are stable and the code is small.
|
||||||
|
|
||||||
|
2. **Should churn detection be per-profile configurable?**
|
||||||
|
- Recommendation: yes. Add `mission.churn_threshold` and `mission.churn_weights` to profile config.yaml. Default threshold = 5 consecutive no-progress turns.
|
||||||
|
|
||||||
|
3. **Should handoff packets live in the project dir or in Hermes home?**
|
||||||
|
- Recommendation: **project dir** (`.mosaic/handoffs/<session-id>.json`). This keeps them version-controlled and accessible regardless of which agent runtime picks up the project.
|
||||||
|
|
||||||
|
4. **Bidirectional Kanban sync?**
|
||||||
|
- Recommendation: **one-way first** (TASKS.md → Kanban). Bidirectional adds conflict resolution complexity. Ship one-way, add reverse sync in v2 if needed.
|
||||||
|
|
||||||
|
5. **PRD generation — skill or tool-call?**
|
||||||
|
- Recommendation: **skill** (`prdy`). PRD generation is a prompt engineering problem with templates. Skills already handle this pattern perfectly.
|
||||||
234
docs/plans/2026-05-07-coordination-resilience.md
Normal file
234
docs/plans/2026-05-07-coordination-resilience.md
Normal file
@@ -0,0 +1,234 @@
|
|||||||
|
# Mosaic Stack ↔ Hermes Coordination Resilience
|
||||||
|
|
||||||
|
> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
|
||||||
|
|
||||||
|
## SIBKISS operational summary
|
||||||
|
|
||||||
|
- mission on
|
||||||
|
- heartbeat always
|
||||||
|
- resume from packet
|
||||||
|
- block with `[BLOCKED]`
|
||||||
|
- reassign
|
||||||
|
- keep tasks tiny
|
||||||
|
- auto-heal dead workers
|
||||||
|
|
||||||
|
The design has four parts:
|
||||||
|
|
||||||
|
1. Atomic task decomposition — workers operate only within a small, explicit scope.
|
||||||
|
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
|
||||||
|
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
|
||||||
|
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
|
||||||
|
|
||||||
|
## Why this exists
|
||||||
|
|
||||||
|
Observed failure modes:
|
||||||
|
|
||||||
|
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
|
||||||
|
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
|
||||||
|
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
|
||||||
|
|
||||||
|
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
|
||||||
|
|
||||||
|
## Core workflow
|
||||||
|
|
||||||
|
### 1) Atomic task boundaries
|
||||||
|
|
||||||
|
Every task should have:
|
||||||
|
|
||||||
|
- one concern
|
||||||
|
- explicit files/packages in scope
|
||||||
|
- explicit files/packages out of scope
|
||||||
|
- a maximum file count if possible
|
||||||
|
- a stated expected iteration budget
|
||||||
|
|
||||||
|
When a worker discovers work outside scope, it must stop fixing it and hand off.
|
||||||
|
|
||||||
|
### 2) Worker-authored distress card
|
||||||
|
|
||||||
|
If the worker can still report status, it creates a card like:
|
||||||
|
|
||||||
|
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||||
|
- Assignee: `tuesday` / orchestrator role
|
||||||
|
- Status: `ready`
|
||||||
|
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||||
|
|
||||||
|
The orchestrator receives the card, acts on it, and closes the loop.
|
||||||
|
|
||||||
|
## Routing rules
|
||||||
|
|
||||||
|
### Distress card routing
|
||||||
|
|
||||||
|
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||||
|
- Assignee: `tuesday` / orchestrator role
|
||||||
|
- Status: `ready`
|
||||||
|
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||||
|
- Source task stays linked to the distress card so the recovery trail is auditable
|
||||||
|
|
||||||
|
The orchestrator receives the card, acts on it, and closes the loop.
|
||||||
|
|
||||||
|
### 3) Mechanical fallback for rate-limited workers
|
||||||
|
|
||||||
|
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
|
||||||
|
|
||||||
|
That watcher should:
|
||||||
|
|
||||||
|
- inspect running / blocked tasks
|
||||||
|
- detect repeated 429 / 503 / overload errors
|
||||||
|
- create the same standardized `[BLOCKED]` card on behalf of the worker
|
||||||
|
- link the distress card to the source task
|
||||||
|
- add a comment to the source task
|
||||||
|
- allow the dispatcher to pick up the new card immediately
|
||||||
|
|
||||||
|
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
|
||||||
|
|
||||||
|
### 4) Auto-heal for dead workers
|
||||||
|
|
||||||
|
A separate no-agent watcher should:
|
||||||
|
|
||||||
|
- reap dead PIDs stuck in `running`
|
||||||
|
- reset crash-loops whose failures are infrastructure-related
|
||||||
|
- escalate tasks that have been reset too many times
|
||||||
|
|
||||||
|
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
|
||||||
|
|
||||||
|
## Distress card contract
|
||||||
|
|
||||||
|
### Canonical title
|
||||||
|
|
||||||
|
```text
|
||||||
|
[BLOCKED] t_<source_task_id> <blocker_type>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Canonical blocker types
|
||||||
|
|
||||||
|
- `scope_boundary`
|
||||||
|
- `env_blocker`
|
||||||
|
- `credential_failure`
|
||||||
|
- `dependency`
|
||||||
|
- `iteration_budget`
|
||||||
|
- `rate_limited`
|
||||||
|
|
||||||
|
### Canonical body
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Distress Signal
|
||||||
|
- Blocked task: t_xxx
|
||||||
|
- Worker: <profile_name>
|
||||||
|
- Branch: <git_branch_name>
|
||||||
|
- Workspace: <path>
|
||||||
|
- Blocker type: <type>
|
||||||
|
- Completed: <what was done>
|
||||||
|
- Cannot touch: <out-of-scope packages/files>
|
||||||
|
- Needs: <what the orchestrator should do>
|
||||||
|
- State: committed | uncommitted | stashed(<stash_name>)
|
||||||
|
|
||||||
|
## Scope Guard
|
||||||
|
DO NOT touch: anything outside diagnosing and remediating the blocker described above
|
||||||
|
Only fix: assign, split, reassign, or unblock the source task
|
||||||
|
```
|
||||||
|
|
||||||
|
## Routing rules
|
||||||
|
|
||||||
|
### Distress card routing
|
||||||
|
|
||||||
|
- `[BLOCKED]` title prefix should bypass normal triage.
|
||||||
|
- The card should go directly to the orchestration profile.
|
||||||
|
- The orchestrator should start from a clean session each time.
|
||||||
|
|
||||||
|
### Rate-limit fallback
|
||||||
|
|
||||||
|
When the source task is rate-limited:
|
||||||
|
|
||||||
|
- do not keep retrying in the worker
|
||||||
|
- let the watcher synthesize the distress card
|
||||||
|
- have the orchestrator reassign the source task to a different profile/provider combo
|
||||||
|
|
||||||
|
### Provider fallback principle
|
||||||
|
|
||||||
|
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
|
||||||
|
|
||||||
|
### Suggested fallback order
|
||||||
|
|
||||||
|
1. Keep the current task body and scope guards intact.
|
||||||
|
2. Reassign to a different profile on a different provider.
|
||||||
|
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
|
||||||
|
4. If repeated failures continue, split the task into a narrower atomic card.
|
||||||
|
|
||||||
|
## Related recovery docs
|
||||||
|
|
||||||
|
- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
|
||||||
|
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
|
||||||
|
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
|
||||||
|
- New-session trigger: when a profile config changes, start a fresh session or `/reset` so the updated toolset is actually loaded.
|
||||||
|
|
||||||
|
## Watchers to implement
|
||||||
|
|
||||||
|
### Auto-heal watcher
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- reap stale workers
|
||||||
|
- reset dead-PID crash loops
|
||||||
|
- track reset counts
|
||||||
|
- escalate after repeated resets
|
||||||
|
|
||||||
|
### Distress synthesizer watcher
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- detect rate-limited / stuck workers
|
||||||
|
- create `[BLOCKED]` cards mechanically
|
||||||
|
- link the card to the source task
|
||||||
|
- leave a comment for traceability
|
||||||
|
|
||||||
|
### Iteration-budget watcher
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- detect long-running tasks and repeated failure patterns
|
||||||
|
- recommend splits when a task is clearly over-scoped
|
||||||
|
- report tasks that need human review after multiple resets
|
||||||
|
|
||||||
|
## Operational principle
|
||||||
|
|
||||||
|
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
|
||||||
|
|
||||||
|
This is what makes the system robust across compaction, rate limits, and dead workers.
|
||||||
|
|
||||||
|
## Suggested implementation order
|
||||||
|
|
||||||
|
1. Atomic task metadata in task bodies
|
||||||
|
2. Worker-authored distress card protocol
|
||||||
|
3. Mechanical distress synthesizer watcher
|
||||||
|
4. Auto-heal watcher for dead workers
|
||||||
|
5. Orchestrator routing rules for `[BLOCKED]`
|
||||||
|
6. Rate-limit fallback / model reassignment table
|
||||||
|
|
||||||
|
## Where this fits in Hermes
|
||||||
|
|
||||||
|
- Kanban = durable work graph and status engine
|
||||||
|
- Watchers = mechanical healing and distress synthesis
|
||||||
|
- Orchestrator = split / reassign / unblock decision-maker
|
||||||
|
- Workers = execution inside atomic task boundaries
|
||||||
|
|
||||||
|
## Where this fits in Mosaic Stack
|
||||||
|
|
||||||
|
- PRD / coordination infra should encode the same patterns
|
||||||
|
- Mosaic can use the same distress-card contract and watcher logic
|
||||||
|
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
|
||||||
|
|
||||||
|
## Cross-project takeaway
|
||||||
|
|
||||||
|
The important pattern is not the specific tool names. It is the mechanical feedback loop:
|
||||||
|
|
||||||
|
- detect failure without requiring the failing worker to succeed
|
||||||
|
- create a standardized help artifact
|
||||||
|
- route that artifact to a fresh orchestrator context
|
||||||
|
- repair the assignment graph
|
||||||
|
- continue the mission
|
||||||
|
|
||||||
|
That pattern is reusable anywhere.
|
||||||
Reference in New Issue
Block a user