docs: add mission control and coordination resilience docs
This commit is contained in:
238
docs/plans/2026-05-06-hermes-mosaic-alignment.md
Normal file
238
docs/plans/2026-05-06-hermes-mosaic-alignment.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Hermes-Mosaic Alignment Plan
|
||||
|
||||
> **For Hermes:** Use subagent-driven-development skill to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Package Mosaic's mechanical coordination primitives as a native Hermes toolset so any Hermes profile gets mission management, task decomposition, handoff, and session continuity without depending on the Mosaic gateway or OpenClaw runtime.
|
||||
|
||||
**Architecture:** Extract the coordination logic from Mosaic's `packages/coord` (TypeScript, file-first) into a Hermes Python toolset that wraps the same file conventions. The Mosaic Stack repo remains the canonical upstream for the file formats (TASKS.md schema, mission.json schema, handoff packet schema). Hermes implements native Python tools that read/write those same files, plus tool-calls for churn detection and handoff generation that have no Mosaic equivalent today.
|
||||
|
||||
**Tech Stack:** Python (Hermes toolset), SQLite (Hermes Kanban), JSON + Markdown (Mosaic file conventions)
|
||||
|
||||
---
|
||||
|
||||
## Alignment Map
|
||||
|
||||
### What Mosaic has that Hermes needs
|
||||
|
||||
| Mosaic Component | What it does | Natural Hermes home | Why |
|
||||
|---|---|---|---|
|
||||
| `packages/coord` (mission.ts) | Mission CRUD, session tracking, milestone state | **Hermes toolset: `mission`** | Mission state is session-scoped, not gateway-scoped. Hermes sessions already have identity, process tracking, and context windows. |
|
||||
| `packages/coord` (tasks-file.ts) | Parse/write TASKS.md tables | **Hermes toolset: `mission`** (same) | Hermes already reads/writes files. The TASKS.md parser is ~300 lines of pure string manipulation — trivial Python port. |
|
||||
| `packages/coord` (runner.ts) | Spawn claude/codex workers with continuation prompts | **Already covered by `delegate_task`** | Hermes delegate_task already does isolated subagent spawning with restricted toolsets. The runner's "find next task and build continuation prompt" logic moves into a tool-call. |
|
||||
| `packages/coord` (status.ts) | Mission health, task progress, next task | **Hermes toolset: `mission`** (same) | Status readout fits naturally as a tool-call. No gateway needed. |
|
||||
| `packages/prdy` | PRD generation wizard | **Hermes skill: `prdy`** | PRD generation is a prompt + template problem, not infrastructure. A Hermes skill with templates is the right fit. |
|
||||
| `plugins/mosaic-framework` | before_agent_start + subagent_spawning hooks | **Hermes system prompt injection** | Hermes already injects system context via skills and config. The framework preamble and worktree rules become standard Hermes skills loaded by the orchestrator profile. |
|
||||
| `plugins/macp` | OpenClaw ACP bridge (spawn codex/claude) | **Already covered by `delegate_task` + ACP** | Hermes already has ACP support and delegate_task. The MACP bridge is redundant when running natively in Hermes. |
|
||||
| Churn detection (planned) | Detect compaction loops, repeated tool calls, no progress | **Hermes middleware** | This needs to live inside Hermes's turn loop where it can observe tool-call patterns. Mosaic can't see this from outside. |
|
||||
| Handoff packet (planned) | Structured context summary for session rotation | **Hermes toolset: `mission`** | Handoff is a serialization of mission + session state. Hermes owns the session, so it should own the handoff. |
|
||||
|
||||
### What Hermes already has that replaces Mosaic infrastructure
|
||||
|
||||
| Mosaic concept | Hermes equivalent | Notes |
|
||||
|---|---|---|
|
||||
| Gateway (NestJS) | Hermes gateway | Hermes already has a gateway with WebSocket, Discord, Telegram, CLI. No need for a second one. |
|
||||
| Pi SDK agent runtime | Hermes agent loop | Hermes IS the agent runtime. OpenClaw's Pi SDK is a different runtime that Mosaic targets. |
|
||||
| MACP ACP bridge | `delegate_task` + ACP tools | Same capability, already native. |
|
||||
| Session identity | Hermes session IDs + process_registry | Hermes already tracks session identity, PIDs, and background processes. |
|
||||
| Task execution board | Hermes Kanban | Fully functional SQLite-backed Kanban with dispatcher, triage, events, comments. |
|
||||
| Worker spawning | Hermes dispatcher + cron | Kanban dispatcher + cron already handle this. |
|
||||
| Context injection | Hermes skills + system prompt | Skills are loaded at session start and injected into context. Exactly what mosaic-framework plugin does. |
|
||||
| File checkpoints | Hermes checkpoint_manager | Already tracks file mutations with shadow git. |
|
||||
|
||||
### What Mosaic keeps as its own entity
|
||||
|
||||
| Component | Why it stays in Mosaic |
|
||||
|---|---|
|
||||
| `apps/gateway` | NestJS API surface — Mosaic's web platform offering |
|
||||
| `apps/web` | Next.js dashboard — Mosaic's UI offering |
|
||||
| `packages/types` | Shared TS contracts for Mosaic gateway plugins |
|
||||
| `packages/db` | Drizzle ORM + PG — Mosaic's data layer |
|
||||
| `packages/auth` | BetterAuth — Mosaic's auth system |
|
||||
| `packages/brain` | PG-backed data layer for Mosaic web app |
|
||||
| `packages/queue` | Valkey task queue for Mosaic gateway |
|
||||
| `plugins/discord` | OpenClaw Discord plugin |
|
||||
| `plugins/telegram` | OpenClaw Telegram plugin |
|
||||
| `packages/mosaic` CLI | The `mosaic` CLI — Mosaic's own command surface |
|
||||
|
||||
---
|
||||
|
||||
## Architecture: `mission` Toolset for Hermes
|
||||
|
||||
### New files under `/opt/hermes/tools/`
|
||||
|
||||
```
|
||||
mission_tools.py — Tool-call surface (mission_create, mission_status,
|
||||
mission_next_task, mission_update_task, mission_handoff,
|
||||
mission_resume)
|
||||
mission_state.py — State management (read/write mission.json, parse TASKS.md,
|
||||
parse MISSION-MANIFEST.md)
|
||||
mission_churn.py — Churn detection (tool-loop counter, compaction counter,
|
||||
progress scorer)
|
||||
mission_handoff.py — Handoff packet generation and loading
|
||||
```
|
||||
|
||||
### Tool-calls exposed to the agent
|
||||
|
||||
| Tool | What it does | When the agent calls it |
|
||||
|---|---|---|
|
||||
| `mission_create` | Initialize mission.json + TASKS.md + MISSION-MANIFEST.md in a project dir | When starting a new mission |
|
||||
| `mission_status` | Read current mission state, milestone progress, next task, active session | At session start, or when checking progress |
|
||||
| `mission_next_task` | Find the next `not-started` task whose dependencies are met, return its full spec | When the agent needs work to do |
|
||||
| `mission_update_task` | Update a task row status in TASKS.md | When completing or blocking a task |
|
||||
| `mission_handoff` | Generate a handoff packet from current session context + mission state | Before session rotation or at session end |
|
||||
| `mission_resume` | Load a handoff packet and inject it as context for the new session | At session start after rotation |
|
||||
|
||||
### Toolset registration
|
||||
|
||||
The `mission` toolset follows the same pattern as `kanban`:
|
||||
|
||||
1. **Gating**: Tools are available when:
|
||||
- The profile has `mission` in its toolsets config, OR
|
||||
- A `HERMES_MISSION_DIR` env var is set (cron/dispatcher spawned workers)
|
||||
|
||||
2. **File conventions**: The toolset reads/writes the same file formats as Mosaic `packages/coord`:
|
||||
- `.mosaic/orchestrator/mission.json` — mission state
|
||||
- `docs/TASKS.md` — task table
|
||||
- `docs/MISSION-MANIFEST.md` — mission manifest
|
||||
- `docs/scratchpads/<id>.md` — session scratchpad
|
||||
|
||||
3. **Kanban bridge**: Optional bidirectional sync between mission TASKS.md rows and Kanban task cards, so the dashboard sees mission tasks.
|
||||
|
||||
### Churn detection (middleware)
|
||||
|
||||
Churn detection lives in Hermes's turn loop, NOT as a tool-call. It observes:
|
||||
|
||||
- Repeated compaction events (context window pressure)
|
||||
- Identical tool-call sequences (loop detection)
|
||||
- No file state changes across N turns
|
||||
- Repeated permission denials
|
||||
|
||||
When churn score exceeds threshold:
|
||||
1. `mission_handoff` is called automatically
|
||||
2. Session is rotated (fresh context window)
|
||||
3. `mission_resume` is called in the new session
|
||||
|
||||
This is new infrastructure that only Hermes can provide (Mosaic runs outside the agent loop).
|
||||
|
||||
---
|
||||
|
||||
## Implementation Tasks
|
||||
|
||||
### Phase 1: Core state management (Python port of coord)
|
||||
|
||||
| Task | Files | Estimate |
|
||||
|---|---|---|
|
||||
| 1.1 Port mission.json read/write to Python | `mission_state.py` | 2h |
|
||||
| 1.2 Port TASKS.md parser to Python | `mission_state.py` | 2h |
|
||||
| 1.3 Port MISSION-MANIFEST.md reader to Python | `mission_state.py` | 1h |
|
||||
| 1.4 Implement `mission_create` tool-call | `mission_tools.py` | 1h |
|
||||
| 1.5 Implement `mission_status` tool-call | `mission_tools.py` | 1h |
|
||||
| 1.6 Implement `mission_next_task` tool-call | `mission_tools.py` | 1h |
|
||||
| 1.7 Implement `mission_update_task` tool-call | `mission_tools.py` | 1h |
|
||||
| 1.8 Register `mission` toolset in Hermes registry | `tools/registry.py` | 30m |
|
||||
| 1.9 Add `mission` to orchestrator profile toolsets | `config.yaml` | 10m |
|
||||
| 1.10 Write unit tests for mission_state | `tests/test_mission_state.py` | 2h |
|
||||
| 1.11 Write unit tests for TASKS.md parser | `tests/test_tasks_parser.py` | 1h |
|
||||
|
||||
**Phase 1 estimate:** ~13h
|
||||
|
||||
### Phase 2: Handoff and session continuity
|
||||
|
||||
| Task | Files | Estimate |
|
||||
|---|---|---|
|
||||
| 2.1 Define handoff packet schema (JSON) | `mission_handoff.py` | 1h |
|
||||
| 2.2 Implement `mission_handoff` tool-call | `mission_handoff.py`, `mission_tools.py` | 2h |
|
||||
| 2.3 Implement `mission_resume` tool-call | `mission_handoff.py`, `mission_tools.py` | 2h |
|
||||
| 2.4 Wire handoff into session start (auto-resume) | agent loop hook | 2h |
|
||||
| 2.5 Write tests for handoff round-trip | `tests/test_mission_handoff.py` | 1h |
|
||||
|
||||
**Phase 2 estimate:** ~8h
|
||||
|
||||
### Phase 3: Churn detection
|
||||
|
||||
| Task | Files | Estimate |
|
||||
|---|---|---|
|
||||
| 3.1 Define churn signal weights and thresholds | `mission_churn.py` | 1h |
|
||||
| 3.2 Implement tool-loop detector (consecutive identical calls) | `mission_churn.py` | 2h |
|
||||
| 3.3 Implement compaction pressure detector | `mission_churn.py` | 1h |
|
||||
| 3.4 Implement progress scorer (file state delta) | `mission_churn.py` | 2h |
|
||||
| 3.5 Wire churn scoring into agent turn loop | agent loop middleware | 2h |
|
||||
| 3.6 Implement auto-rotation trigger | agent loop + handoff | 2h |
|
||||
| 3.7 Write tests for churn scoring | `tests/test_mission_churn.py` | 1h |
|
||||
|
||||
**Phase 3 estimate:** ~11h
|
||||
|
||||
### Phase 4: Kanban bridge + CLI surface
|
||||
|
||||
| Task | Files | Estimate |
|
||||
|---|---|---|
|
||||
| 4.1 Implement TASKS.md → Kanban sync (one-way first) | `mission_kanban_sync.py` | 2h |
|
||||
| 4.2 Add `hermes mission` CLI subcommand | `mission_cli.py` | 2h |
|
||||
| 4.3 Add `hermes mission status` command | `mission_cli.py` | 1h |
|
||||
| 4.4 Add `hermes mission init` command | `mission_cli.py` | 1h |
|
||||
| 4.5 Add `hermes mission handoff` command | `mission_cli.py` | 1h |
|
||||
| 4.6 Add `hermes mission resume` command | `mission_cli.py` | 1h |
|
||||
|
||||
**Phase 4 estimate:** ~8h
|
||||
|
||||
---
|
||||
|
||||
## File Format Compatibility
|
||||
|
||||
The Python implementation MUST read and write the exact same file formats as Mosaic's TypeScript `packages/coord`. This means:
|
||||
|
||||
1. **mission.json** schema is identical to `Mission` type in `packages/coord/src/types.ts`
|
||||
2. **TASKS.md** table format is identical to what `packages/coord/src/tasks-file.ts` parses
|
||||
3. **MISSION-MANIFEST.md** is free-form markdown (no parser needed — just read the file)
|
||||
4. **Handoff packets** are a new JSON format defined in this toolset (Mosaic doesn't have them yet)
|
||||
|
||||
This way a project can use Hermes mission tools OR Mosaic `mosaic coord` commands interchangeably. The files are the contract.
|
||||
|
||||
---
|
||||
|
||||
## Relationship Diagram
|
||||
|
||||
```
|
||||
Mosaic Stack (TypeScript) Hermes Agent (Python)
|
||||
┌─────────────────────────┐ ┌─────────────────────────┐
|
||||
│ packages/coord │ │ tools/mission_tools.py │
|
||||
│ ├─ mission.ts │◄──────►│ ├─ mission_state.py │
|
||||
│ ├─ tasks-file.ts │ same │ ├─ mission_handoff.py │
|
||||
│ ├─ status.ts │ files │ ├─ mission_churn.py │
|
||||
│ └─ runner.ts │ │ └─ mission_tools.py │
|
||||
│ │ │ │
|
||||
│ packages/prdy │ │ skills/prdy/ │
|
||||
│ └─ templates, wizard │◄──────►│ └─ SKILL.md + templates │
|
||||
│ │ │ │
|
||||
│ plugins/mosaic-framework│ │ skills/ (existing) │
|
||||
│ └─ context injection │◄──────►│ └─ kanban-orchestrator │
|
||||
│ │ │ + mosaic-coding-* │
|
||||
│ plugins/macp │ │ tools/delegate_task.py │
|
||||
│ └─ ACP bridge │◄──────►│ └─ already covers this │
|
||||
│ │ │ │
|
||||
│ (stays in Mosaic) │ │ tools/kanban_tools.py │
|
||||
│ apps/gateway │ │ └─ Hermes Kanban DB │
|
||||
│ apps/web │ │ │
|
||||
│ packages/db │ │ tools/cronjob_tools.py │
|
||||
│ packages/queue │ │ └─ already covers cron │
|
||||
└─────────────────────────┘ └─────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Should the `mission` toolset ship with Hermes core, or as a plugin?**
|
||||
- Recommendation: ship as a **built-in toolset** (like `kanban`) since mission coordination is a core agent capability, not an optional integration. The file formats are stable and the code is small.
|
||||
|
||||
2. **Should churn detection be per-profile configurable?**
|
||||
- Recommendation: yes. Add `mission.churn_threshold` and `mission.churn_weights` to profile config.yaml. Default threshold = 5 consecutive no-progress turns.
|
||||
|
||||
3. **Should handoff packets live in the project dir or in Hermes home?**
|
||||
- Recommendation: **project dir** (`.mosaic/handoffs/<session-id>.json`). This keeps them version-controlled and accessible regardless of which agent runtime picks up the project.
|
||||
|
||||
4. **Bidirectional Kanban sync?**
|
||||
- Recommendation: **one-way first** (TASKS.md → Kanban). Bidirectional adds conflict resolution complexity. Ship one-way, add reverse sync in v2 if needed.
|
||||
|
||||
5. **PRD generation — skill or tool-call?**
|
||||
- Recommendation: **skill** (`prdy`). PRD generation is a prompt engineering problem with templates. Skills already handle this pattern perfectly.
|
||||
233
docs/plans/2026-05-07-coordination-resilience.md
Normal file
233
docs/plans/2026-05-07-coordination-resilience.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# Mosaic Stack ↔ Hermes Coordination Resilience
|
||||
|
||||
> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
|
||||
|
||||
## Summary
|
||||
|
||||
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
|
||||
|
||||
## SIBKISS operational summary
|
||||
|
||||
- mission on
|
||||
- heartbeat always
|
||||
- resume from packet
|
||||
- block with `[BLOCKED]`
|
||||
- reassign
|
||||
- keep tasks tiny
|
||||
- auto-heal dead workers
|
||||
|
||||
The design has four parts:
|
||||
|
||||
1. Atomic task decomposition — workers operate only within a small, explicit scope.
|
||||
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
|
||||
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
|
||||
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
|
||||
|
||||
## Why this exists
|
||||
|
||||
Observed failure modes:
|
||||
|
||||
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
|
||||
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
|
||||
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
|
||||
|
||||
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
|
||||
|
||||
## Core workflow
|
||||
|
||||
### 1) Atomic task boundaries
|
||||
|
||||
Every task should have:
|
||||
|
||||
- one concern
|
||||
- explicit files/packages in scope
|
||||
- explicit files/packages out of scope
|
||||
- a maximum file count if possible
|
||||
- a stated expected iteration budget
|
||||
|
||||
When a worker discovers work outside scope, it must stop fixing it and hand off.
|
||||
|
||||
### 2) Worker-authored distress card
|
||||
|
||||
If the worker can still report status, it creates a card like:
|
||||
|
||||
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||
- Assignee: `tuesday` / orchestrator role
|
||||
- Status: `ready`
|
||||
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||
|
||||
The orchestrator receives the card, acts on it, and closes the loop.
|
||||
|
||||
## Routing rules
|
||||
|
||||
### Distress card routing
|
||||
|
||||
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||
- Assignee: `tuesday` / orchestrator role
|
||||
- Status: `ready`
|
||||
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||
- Source task stays linked to the distress card so the recovery trail is auditable
|
||||
|
||||
The orchestrator receives the card, acts on it, and closes the loop.
|
||||
|
||||
### 3) Mechanical fallback for rate-limited workers
|
||||
|
||||
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
|
||||
|
||||
That watcher should:
|
||||
|
||||
- inspect running / blocked tasks
|
||||
- detect repeated 429 / 503 / overload errors
|
||||
- create the same standardized `[BLOCKED]` card on behalf of the worker
|
||||
- link the distress card to the source task
|
||||
- add a comment to the source task
|
||||
- allow the dispatcher to pick up the new card immediately
|
||||
|
||||
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
|
||||
|
||||
### 4) Auto-heal for dead workers
|
||||
|
||||
A separate no-agent watcher should:
|
||||
|
||||
- reap dead PIDs stuck in `running`
|
||||
- reset crash-loops whose failures are infrastructure-related
|
||||
- escalate tasks that have been reset too many times
|
||||
|
||||
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
|
||||
|
||||
## Distress card contract
|
||||
|
||||
### Canonical title
|
||||
|
||||
```text
|
||||
[BLOCKED] t_<source_task_id> <blocker_type>
|
||||
```
|
||||
|
||||
### Canonical blocker types
|
||||
|
||||
- `scope_boundary`
|
||||
- `env_blocker`
|
||||
- `credential_failure`
|
||||
- `dependency`
|
||||
- `iteration_budget`
|
||||
- `rate_limited`
|
||||
|
||||
### Canonical body
|
||||
|
||||
```markdown
|
||||
## Distress Signal
|
||||
- Blocked task: t_xxx
|
||||
- Worker: <profile_name>
|
||||
- Branch: <git_branch_name>
|
||||
- Workspace: <path>
|
||||
- Blocker type: <type>
|
||||
- Completed: <what was done>
|
||||
- Cannot touch: <out-of-scope packages/files>
|
||||
- Needs: <what the orchestrator should do>
|
||||
- State: committed | uncommitted | stashed(<stash_name>)
|
||||
|
||||
## Scope Guard
|
||||
DO NOT touch: anything outside diagnosing and remediating the blocker described above
|
||||
Only fix: assign, split, reassign, or unblock the source task
|
||||
```
|
||||
|
||||
## Routing rules
|
||||
|
||||
### Distress card routing
|
||||
|
||||
- `[BLOCKED]` title prefix should bypass normal triage.
|
||||
- The card should go directly to the orchestration profile.
|
||||
- The orchestrator should start from a clean session each time.
|
||||
|
||||
### Rate-limit fallback
|
||||
|
||||
When the source task is rate-limited:
|
||||
|
||||
- do not keep retrying in the worker
|
||||
- let the watcher synthesize the distress card
|
||||
- have the orchestrator reassign the source task to a different profile/provider combo
|
||||
|
||||
### Provider fallback principle
|
||||
|
||||
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
|
||||
|
||||
### Suggested fallback order
|
||||
|
||||
1. Keep the current task body and scope guards intact.
|
||||
2. Reassign to a different profile on a different provider.
|
||||
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
|
||||
4. If repeated failures continue, split the task into a narrower atomic card.
|
||||
|
||||
## Related recovery docs
|
||||
|
||||
- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
|
||||
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
|
||||
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
|
||||
|
||||
## Watchers to implement
|
||||
|
||||
### Auto-heal watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- reap stale workers
|
||||
- reset dead-PID crash loops
|
||||
- track reset counts
|
||||
- escalate after repeated resets
|
||||
|
||||
### Distress synthesizer watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- detect rate-limited / stuck workers
|
||||
- create `[BLOCKED]` cards mechanically
|
||||
- link the card to the source task
|
||||
- leave a comment for traceability
|
||||
|
||||
### Iteration-budget watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- detect long-running tasks and repeated failure patterns
|
||||
- recommend splits when a task is clearly over-scoped
|
||||
- report tasks that need human review after multiple resets
|
||||
|
||||
## Operational principle
|
||||
|
||||
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
|
||||
|
||||
This is what makes the system robust across compaction, rate limits, and dead workers.
|
||||
|
||||
## Suggested implementation order
|
||||
|
||||
1. Atomic task metadata in task bodies
|
||||
2. Worker-authored distress card protocol
|
||||
3. Mechanical distress synthesizer watcher
|
||||
4. Auto-heal watcher for dead workers
|
||||
5. Orchestrator routing rules for `[BLOCKED]`
|
||||
6. Rate-limit fallback / model reassignment table
|
||||
|
||||
## Where this fits in Hermes
|
||||
|
||||
- Kanban = durable work graph and status engine
|
||||
- Watchers = mechanical healing and distress synthesis
|
||||
- Orchestrator = split / reassign / unblock decision-maker
|
||||
- Workers = execution inside atomic task boundaries
|
||||
|
||||
## Where this fits in Mosaic Stack
|
||||
|
||||
- PRD / coordination infra should encode the same patterns
|
||||
- Mosaic can use the same distress-card contract and watcher logic
|
||||
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
|
||||
|
||||
## Cross-project takeaway
|
||||
|
||||
The important pattern is not the specific tool names. It is the mechanical feedback loop:
|
||||
|
||||
- detect failure without requiring the failing worker to succeed
|
||||
- create a standardized help artifact
|
||||
- route that artifact to a fresh orchestrator context
|
||||
- repair the assignment graph
|
||||
- continue the mission
|
||||
|
||||
That pattern is reusable anywhere.
|
||||
Reference in New Issue
Block a user