7.9 KiB
Mosaic Stack ↔ Hermes Coordination Resilience
Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
Summary
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
SIBKISS operational summary
- mission on
- heartbeat always
- resume from packet
- block with
[BLOCKED] - reassign
- keep tasks tiny
- auto-heal dead workers
The design has four parts:
- Atomic task decomposition — workers operate only within a small, explicit scope.
- Distress signaling — workers create a standardized
[BLOCKED]card when they encounter a blocker outside their scope. - Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
- Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
Why this exists
Observed failure modes:
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
Core workflow
1) Atomic task boundaries
Every task should have:
- one concern
- explicit files/packages in scope
- explicit files/packages out of scope
- a maximum file count if possible
- a stated expected iteration budget
When a worker discovers work outside scope, it must stop fixing it and hand off.
2) Worker-authored distress card
If the worker can still report status, it creates a card like:
- Title:
[BLOCKED] t_<source_id> <blocker_type> - Assignee:
tuesday/ orchestrator role - Status:
ready - Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
The orchestrator receives the card, acts on it, and closes the loop.
Routing rules
Distress card routing
- Title:
[BLOCKED] t_<source_id> <blocker_type> - Assignee:
tuesday/ orchestrator role - Status:
ready - Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
- Source task stays linked to the distress card so the recovery trail is auditable
The orchestrator receives the card, acts on it, and closes the loop.
3) Mechanical fallback for rate-limited workers
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
That watcher should:
- inspect running / blocked tasks
- detect repeated 429 / 503 / overload errors
- create the same standardized
[BLOCKED]card on behalf of the worker - link the distress card to the source task
- add a comment to the source task
- allow the dispatcher to pick up the new card immediately
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
4) Auto-heal for dead workers
A separate no-agent watcher should:
- reap dead PIDs stuck in
running - reset crash-loops whose failures are infrastructure-related
- escalate tasks that have been reset too many times
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
Distress card contract
Canonical title
[BLOCKED] t_<source_task_id> <blocker_type>
Canonical blocker types
scope_boundaryenv_blockercredential_failuredependencyiteration_budgetrate_limited
Canonical body
## Distress Signal
- Blocked task: t_xxx
- Worker: <profile_name>
- Branch: <git_branch_name>
- Workspace: <path>
- Blocker type: <type>
- Completed: <what was done>
- Cannot touch: <out-of-scope packages/files>
- Needs: <what the orchestrator should do>
- State: committed | uncommitted | stashed(<stash_name>)
## Scope Guard
DO NOT touch: anything outside diagnosing and remediating the blocker described above
Only fix: assign, split, reassign, or unblock the source task
Routing rules
Distress card routing
[BLOCKED]title prefix should bypass normal triage.- The card should go directly to the orchestration profile.
- The orchestrator should start from a clean session each time.
Rate-limit fallback
When the source task is rate-limited:
- do not keep retrying in the worker
- let the watcher synthesize the distress card
- have the orchestrator reassign the source task to a different profile/provider combo
Provider fallback principle
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
Suggested fallback order
- Keep the current task body and scope guards intact.
- Reassign to a different profile on a different provider.
- If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
- If repeated failures continue, split the task into a narrower atomic card.
Related recovery docs
- Mission packet recovery contract:
/opt/hermes/docs/mission-toolset-heartbeat.md - Hermes mission implementation plan:
/opt/hermes/docs/plans/mission-toolset-implementation.md - The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
- New-session trigger: when a profile config changes, start a fresh session or
/resetso the updated toolset is actually loaded.
Watchers to implement
Auto-heal watcher
Responsibilities:
- reap stale workers
- reset dead-PID crash loops
- track reset counts
- escalate after repeated resets
Distress synthesizer watcher
Responsibilities:
- detect rate-limited / stuck workers
- create
[BLOCKED]cards mechanically - link the card to the source task
- leave a comment for traceability
Iteration-budget watcher
Responsibilities:
- detect long-running tasks and repeated failure patterns
- recommend splits when a task is clearly over-scoped
- report tasks that need human review after multiple resets
Operational principle
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
This is what makes the system robust across compaction, rate limits, and dead workers.
Suggested implementation order
- Atomic task metadata in task bodies
- Worker-authored distress card protocol
- Mechanical distress synthesizer watcher
- Auto-heal watcher for dead workers
- Orchestrator routing rules for
[BLOCKED] - Rate-limit fallback / model reassignment table
Where this fits in Hermes
- Kanban = durable work graph and status engine
- Watchers = mechanical healing and distress synthesis
- Orchestrator = split / reassign / unblock decision-maker
- Workers = execution inside atomic task boundaries
Where this fits in Mosaic Stack
- PRD / coordination infra should encode the same patterns
- Mosaic can use the same distress-card contract and watcher logic
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
Cross-project takeaway
The important pattern is not the specific tool names. It is the mechanical feedback loop:
- detect failure without requiring the failing worker to succeed
- create a standardized help artifact
- route that artifact to a fresh orchestrator context
- repair the assignment graph
- continue the mission
That pattern is reusable anywhere.