Files
stack/docs/plans/2026-05-07-coordination-resilience.md

235 lines
7.9 KiB
Markdown

# Mosaic Stack ↔ Hermes Coordination Resilience
> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
## Summary
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
## SIBKISS operational summary
- mission on
- heartbeat always
- resume from packet
- block with `[BLOCKED]`
- reassign
- keep tasks tiny
- auto-heal dead workers
The design has four parts:
1. Atomic task decomposition — workers operate only within a small, explicit scope.
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
## Why this exists
Observed failure modes:
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
## Core workflow
### 1) Atomic task boundaries
Every task should have:
- one concern
- explicit files/packages in scope
- explicit files/packages out of scope
- a maximum file count if possible
- a stated expected iteration budget
When a worker discovers work outside scope, it must stop fixing it and hand off.
### 2) Worker-authored distress card
If the worker can still report status, it creates a card like:
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
- Assignee: `tuesday` / orchestrator role
- Status: `ready`
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
The orchestrator receives the card, acts on it, and closes the loop.
## Routing rules
### Distress card routing
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
- Assignee: `tuesday` / orchestrator role
- Status: `ready`
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
- Source task stays linked to the distress card so the recovery trail is auditable
The orchestrator receives the card, acts on it, and closes the loop.
### 3) Mechanical fallback for rate-limited workers
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
That watcher should:
- inspect running / blocked tasks
- detect repeated 429 / 503 / overload errors
- create the same standardized `[BLOCKED]` card on behalf of the worker
- link the distress card to the source task
- add a comment to the source task
- allow the dispatcher to pick up the new card immediately
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
### 4) Auto-heal for dead workers
A separate no-agent watcher should:
- reap dead PIDs stuck in `running`
- reset crash-loops whose failures are infrastructure-related
- escalate tasks that have been reset too many times
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
## Distress card contract
### Canonical title
```text
[BLOCKED] t_<source_task_id> <blocker_type>
```
### Canonical blocker types
- `scope_boundary`
- `env_blocker`
- `credential_failure`
- `dependency`
- `iteration_budget`
- `rate_limited`
### Canonical body
```markdown
## Distress Signal
- Blocked task: t_xxx
- Worker: <profile_name>
- Branch: <git_branch_name>
- Workspace: <path>
- Blocker type: <type>
- Completed: <what was done>
- Cannot touch: <out-of-scope packages/files>
- Needs: <what the orchestrator should do>
- State: committed | uncommitted | stashed(<stash_name>)
## Scope Guard
DO NOT touch: anything outside diagnosing and remediating the blocker described above
Only fix: assign, split, reassign, or unblock the source task
```
## Routing rules
### Distress card routing
- `[BLOCKED]` title prefix should bypass normal triage.
- The card should go directly to the orchestration profile.
- The orchestrator should start from a clean session each time.
### Rate-limit fallback
When the source task is rate-limited:
- do not keep retrying in the worker
- let the watcher synthesize the distress card
- have the orchestrator reassign the source task to a different profile/provider combo
### Provider fallback principle
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
### Suggested fallback order
1. Keep the current task body and scope guards intact.
2. Reassign to a different profile on a different provider.
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
4. If repeated failures continue, split the task into a narrower atomic card.
## Related recovery docs
- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
- New-session trigger: when a profile config changes, start a fresh session or `/reset` so the updated toolset is actually loaded.
## Watchers to implement
### Auto-heal watcher
Responsibilities:
- reap stale workers
- reset dead-PID crash loops
- track reset counts
- escalate after repeated resets
### Distress synthesizer watcher
Responsibilities:
- detect rate-limited / stuck workers
- create `[BLOCKED]` cards mechanically
- link the card to the source task
- leave a comment for traceability
### Iteration-budget watcher
Responsibilities:
- detect long-running tasks and repeated failure patterns
- recommend splits when a task is clearly over-scoped
- report tasks that need human review after multiple resets
## Operational principle
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
This is what makes the system robust across compaction, rate limits, and dead workers.
## Suggested implementation order
1. Atomic task metadata in task bodies
2. Worker-authored distress card protocol
3. Mechanical distress synthesizer watcher
4. Auto-heal watcher for dead workers
5. Orchestrator routing rules for `[BLOCKED]`
6. Rate-limit fallback / model reassignment table
## Where this fits in Hermes
- Kanban = durable work graph and status engine
- Watchers = mechanical healing and distress synthesis
- Orchestrator = split / reassign / unblock decision-maker
- Workers = execution inside atomic task boundaries
## Where this fits in Mosaic Stack
- PRD / coordination infra should encode the same patterns
- Mosaic can use the same distress-card contract and watcher logic
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
## Cross-project takeaway
The important pattern is not the specific tool names. It is the mechanical feedback loop:
- detect failure without requiring the failing worker to succeed
- create a standardized help artifact
- route that artifact to a fresh orchestrator context
- repair the assignment graph
- continue the mission
That pattern is reusable anywhere.