Files
stack/docs/plans/2026-05-07-coordination-resilience.md

7.9 KiB

Mosaic Stack ↔ Hermes Coordination Resilience

Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.

Summary

The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.

SIBKISS operational summary

  • mission on
  • heartbeat always
  • resume from packet
  • block with [BLOCKED]
  • reassign
  • keep tasks tiny
  • auto-heal dead workers

The design has four parts:

  1. Atomic task decomposition — workers operate only within a small, explicit scope.
  2. Distress signaling — workers create a standardized [BLOCKED] card when they encounter a blocker outside their scope.
  3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
  4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.

Why this exists

Observed failure modes:

  • Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
  • Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
  • Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.

The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.

Core workflow

1) Atomic task boundaries

Every task should have:

  • one concern
  • explicit files/packages in scope
  • explicit files/packages out of scope
  • a maximum file count if possible
  • a stated expected iteration budget

When a worker discovers work outside scope, it must stop fixing it and hand off.

2) Worker-authored distress card

If the worker can still report status, it creates a card like:

  • Title: [BLOCKED] t_<source_id> <blocker_type>
  • Assignee: tuesday / orchestrator role
  • Status: ready
  • Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action

The orchestrator receives the card, acts on it, and closes the loop.

Routing rules

Distress card routing

  • Title: [BLOCKED] t_<source_id> <blocker_type>
  • Assignee: tuesday / orchestrator role
  • Status: ready
  • Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
  • Source task stays linked to the distress card so the recovery trail is auditable

The orchestrator receives the card, acts on it, and closes the loop.

3) Mechanical fallback for rate-limited workers

If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.

That watcher should:

  • inspect running / blocked tasks
  • detect repeated 429 / 503 / overload errors
  • create the same standardized [BLOCKED] card on behalf of the worker
  • link the distress card to the source task
  • add a comment to the source task
  • allow the dispatcher to pick up the new card immediately

This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.

4) Auto-heal for dead workers

A separate no-agent watcher should:

  • reap dead PIDs stuck in running
  • reset crash-loops whose failures are infrastructure-related
  • escalate tasks that have been reset too many times

This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.

Distress card contract

Canonical title

[BLOCKED] t_<source_task_id> <blocker_type>

Canonical blocker types

  • scope_boundary
  • env_blocker
  • credential_failure
  • dependency
  • iteration_budget
  • rate_limited

Canonical body

## Distress Signal
- Blocked task: t_xxx
- Worker: <profile_name>
- Branch: <git_branch_name>
- Workspace: <path>
- Blocker type: <type>
- Completed: <what was done>
- Cannot touch: <out-of-scope packages/files>
- Needs: <what the orchestrator should do>
- State: committed | uncommitted | stashed(<stash_name>)

## Scope Guard
DO NOT touch: anything outside diagnosing and remediating the blocker described above
Only fix: assign, split, reassign, or unblock the source task

Routing rules

Distress card routing

  • [BLOCKED] title prefix should bypass normal triage.
  • The card should go directly to the orchestration profile.
  • The orchestrator should start from a clean session each time.

Rate-limit fallback

When the source task is rate-limited:

  • do not keep retrying in the worker
  • let the watcher synthesize the distress card
  • have the orchestrator reassign the source task to a different profile/provider combo

Provider fallback principle

Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.

Suggested fallback order

  1. Keep the current task body and scope guards intact.
  2. Reassign to a different profile on a different provider.
  3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
  4. If repeated failures continue, split the task into a narrower atomic card.
  • Mission packet recovery contract: /opt/hermes/docs/mission-toolset-heartbeat.md
  • Hermes mission implementation plan: /opt/hermes/docs/plans/mission-toolset-implementation.md
  • The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
  • New-session trigger: when a profile config changes, start a fresh session or /reset so the updated toolset is actually loaded.

Watchers to implement

Auto-heal watcher

Responsibilities:

  • reap stale workers
  • reset dead-PID crash loops
  • track reset counts
  • escalate after repeated resets

Distress synthesizer watcher

Responsibilities:

  • detect rate-limited / stuck workers
  • create [BLOCKED] cards mechanically
  • link the card to the source task
  • leave a comment for traceability

Iteration-budget watcher

Responsibilities:

  • detect long-running tasks and repeated failure patterns
  • recommend splits when a task is clearly over-scoped
  • report tasks that need human review after multiple resets

Operational principle

If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.

This is what makes the system robust across compaction, rate limits, and dead workers.

Suggested implementation order

  1. Atomic task metadata in task bodies
  2. Worker-authored distress card protocol
  3. Mechanical distress synthesizer watcher
  4. Auto-heal watcher for dead workers
  5. Orchestrator routing rules for [BLOCKED]
  6. Rate-limit fallback / model reassignment table

Where this fits in Hermes

  • Kanban = durable work graph and status engine
  • Watchers = mechanical healing and distress synthesis
  • Orchestrator = split / reassign / unblock decision-maker
  • Workers = execution inside atomic task boundaries

Where this fits in Mosaic Stack

  • PRD / coordination infra should encode the same patterns
  • Mosaic can use the same distress-card contract and watcher logic
  • The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue

Cross-project takeaway

The important pattern is not the specific tool names. It is the mechanical feedback loop:

  • detect failure without requiring the failing worker to succeed
  • create a standardized help artifact
  • route that artifact to a fresh orchestrator context
  • repair the assignment graph
  • continue the mission

That pattern is reusable anywhere.