docs: add mission control and coordination resilience docs
This commit is contained in:
233
docs/plans/2026-05-07-coordination-resilience.md
Normal file
233
docs/plans/2026-05-07-coordination-resilience.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# Mosaic Stack ↔ Hermes Coordination Resilience
|
||||
|
||||
> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
|
||||
|
||||
## Summary
|
||||
|
||||
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
|
||||
|
||||
## SIBKISS operational summary
|
||||
|
||||
- mission on
|
||||
- heartbeat always
|
||||
- resume from packet
|
||||
- block with `[BLOCKED]`
|
||||
- reassign
|
||||
- keep tasks tiny
|
||||
- auto-heal dead workers
|
||||
|
||||
The design has four parts:
|
||||
|
||||
1. Atomic task decomposition — workers operate only within a small, explicit scope.
|
||||
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
|
||||
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
|
||||
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
|
||||
|
||||
## Why this exists
|
||||
|
||||
Observed failure modes:
|
||||
|
||||
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
|
||||
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
|
||||
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
|
||||
|
||||
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
|
||||
|
||||
## Core workflow
|
||||
|
||||
### 1) Atomic task boundaries
|
||||
|
||||
Every task should have:
|
||||
|
||||
- one concern
|
||||
- explicit files/packages in scope
|
||||
- explicit files/packages out of scope
|
||||
- a maximum file count if possible
|
||||
- a stated expected iteration budget
|
||||
|
||||
When a worker discovers work outside scope, it must stop fixing it and hand off.
|
||||
|
||||
### 2) Worker-authored distress card
|
||||
|
||||
If the worker can still report status, it creates a card like:
|
||||
|
||||
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||
- Assignee: `tuesday` / orchestrator role
|
||||
- Status: `ready`
|
||||
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||
|
||||
The orchestrator receives the card, acts on it, and closes the loop.
|
||||
|
||||
## Routing rules
|
||||
|
||||
### Distress card routing
|
||||
|
||||
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
||||
- Assignee: `tuesday` / orchestrator role
|
||||
- Status: `ready`
|
||||
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
||||
- Source task stays linked to the distress card so the recovery trail is auditable
|
||||
|
||||
The orchestrator receives the card, acts on it, and closes the loop.
|
||||
|
||||
### 3) Mechanical fallback for rate-limited workers
|
||||
|
||||
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
|
||||
|
||||
That watcher should:
|
||||
|
||||
- inspect running / blocked tasks
|
||||
- detect repeated 429 / 503 / overload errors
|
||||
- create the same standardized `[BLOCKED]` card on behalf of the worker
|
||||
- link the distress card to the source task
|
||||
- add a comment to the source task
|
||||
- allow the dispatcher to pick up the new card immediately
|
||||
|
||||
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
|
||||
|
||||
### 4) Auto-heal for dead workers
|
||||
|
||||
A separate no-agent watcher should:
|
||||
|
||||
- reap dead PIDs stuck in `running`
|
||||
- reset crash-loops whose failures are infrastructure-related
|
||||
- escalate tasks that have been reset too many times
|
||||
|
||||
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
|
||||
|
||||
## Distress card contract
|
||||
|
||||
### Canonical title
|
||||
|
||||
```text
|
||||
[BLOCKED] t_<source_task_id> <blocker_type>
|
||||
```
|
||||
|
||||
### Canonical blocker types
|
||||
|
||||
- `scope_boundary`
|
||||
- `env_blocker`
|
||||
- `credential_failure`
|
||||
- `dependency`
|
||||
- `iteration_budget`
|
||||
- `rate_limited`
|
||||
|
||||
### Canonical body
|
||||
|
||||
```markdown
|
||||
## Distress Signal
|
||||
- Blocked task: t_xxx
|
||||
- Worker: <profile_name>
|
||||
- Branch: <git_branch_name>
|
||||
- Workspace: <path>
|
||||
- Blocker type: <type>
|
||||
- Completed: <what was done>
|
||||
- Cannot touch: <out-of-scope packages/files>
|
||||
- Needs: <what the orchestrator should do>
|
||||
- State: committed | uncommitted | stashed(<stash_name>)
|
||||
|
||||
## Scope Guard
|
||||
DO NOT touch: anything outside diagnosing and remediating the blocker described above
|
||||
Only fix: assign, split, reassign, or unblock the source task
|
||||
```
|
||||
|
||||
## Routing rules
|
||||
|
||||
### Distress card routing
|
||||
|
||||
- `[BLOCKED]` title prefix should bypass normal triage.
|
||||
- The card should go directly to the orchestration profile.
|
||||
- The orchestrator should start from a clean session each time.
|
||||
|
||||
### Rate-limit fallback
|
||||
|
||||
When the source task is rate-limited:
|
||||
|
||||
- do not keep retrying in the worker
|
||||
- let the watcher synthesize the distress card
|
||||
- have the orchestrator reassign the source task to a different profile/provider combo
|
||||
|
||||
### Provider fallback principle
|
||||
|
||||
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
|
||||
|
||||
### Suggested fallback order
|
||||
|
||||
1. Keep the current task body and scope guards intact.
|
||||
2. Reassign to a different profile on a different provider.
|
||||
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
|
||||
4. If repeated failures continue, split the task into a narrower atomic card.
|
||||
|
||||
## Related recovery docs
|
||||
|
||||
- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
|
||||
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
|
||||
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
|
||||
|
||||
## Watchers to implement
|
||||
|
||||
### Auto-heal watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- reap stale workers
|
||||
- reset dead-PID crash loops
|
||||
- track reset counts
|
||||
- escalate after repeated resets
|
||||
|
||||
### Distress synthesizer watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- detect rate-limited / stuck workers
|
||||
- create `[BLOCKED]` cards mechanically
|
||||
- link the card to the source task
|
||||
- leave a comment for traceability
|
||||
|
||||
### Iteration-budget watcher
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- detect long-running tasks and repeated failure patterns
|
||||
- recommend splits when a task is clearly over-scoped
|
||||
- report tasks that need human review after multiple resets
|
||||
|
||||
## Operational principle
|
||||
|
||||
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
|
||||
|
||||
This is what makes the system robust across compaction, rate limits, and dead workers.
|
||||
|
||||
## Suggested implementation order
|
||||
|
||||
1. Atomic task metadata in task bodies
|
||||
2. Worker-authored distress card protocol
|
||||
3. Mechanical distress synthesizer watcher
|
||||
4. Auto-heal watcher for dead workers
|
||||
5. Orchestrator routing rules for `[BLOCKED]`
|
||||
6. Rate-limit fallback / model reassignment table
|
||||
|
||||
## Where this fits in Hermes
|
||||
|
||||
- Kanban = durable work graph and status engine
|
||||
- Watchers = mechanical healing and distress synthesis
|
||||
- Orchestrator = split / reassign / unblock decision-maker
|
||||
- Workers = execution inside atomic task boundaries
|
||||
|
||||
## Where this fits in Mosaic Stack
|
||||
|
||||
- PRD / coordination infra should encode the same patterns
|
||||
- Mosaic can use the same distress-card contract and watcher logic
|
||||
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
|
||||
|
||||
## Cross-project takeaway
|
||||
|
||||
The important pattern is not the specific tool names. It is the mechanical feedback loop:
|
||||
|
||||
- detect failure without requiring the failing worker to succeed
|
||||
- create a standardized help artifact
|
||||
- route that artifact to a fresh orchestrator context
|
||||
- repair the assignment graph
|
||||
- continue the mission
|
||||
|
||||
That pattern is reusable anywhere.
|
||||
Reference in New Issue
Block a user