stack/docs/plans/2026-05-07-coordination-resilience.md

# Mosaic Stack ↔ Hermes Coordination Resilience

> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.

## Summary

The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.

## SIBKISS operational summary

- mission on
- heartbeat always
- resume from packet
- block with `[BLOCKED]`
- reassign
- keep tasks tiny
- auto-heal dead workers

The design has four parts:

1. Atomic task decomposition — workers operate only within a small, explicit scope.
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.

## Why this exists

Observed failure modes:

- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.

The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.

## Core workflow

### 1) Atomic task boundaries

Every task should have:

- one concern
- explicit files/packages in scope
- explicit files/packages out of scope
- a maximum file count if possible
- a stated expected iteration budget

When a worker discovers work outside scope, it must stop fixing it and hand off.

### 2) Worker-authored distress card

If the worker can still report status, it creates a card like:

- Title: `[BLOCKED] t_<source_id> <blocker_type>`
- Assignee: `tuesday` / orchestrator role
- Status: `ready`
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action

The orchestrator receives the card, acts on it, and closes the loop.

## Routing rules

### Distress card routing

- Title: `[BLOCKED] t_<source_id> <blocker_type>`
- Assignee: `tuesday` / orchestrator role
- Status: `ready`
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
- Source task stays linked to the distress card so the recovery trail is auditable

The orchestrator receives the card, acts on it, and closes the loop.

### 3) Mechanical fallback for rate-limited workers

If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.

That watcher should:

- inspect running / blocked tasks
- detect repeated 429 / 503 / overload errors
- create the same standardized `[BLOCKED]` card on behalf of the worker
- link the distress card to the source task
- add a comment to the source task
- allow the dispatcher to pick up the new card immediately

This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.

### 4) Auto-heal for dead workers

A separate no-agent watcher should:

- reap dead PIDs stuck in `running`
- reset crash-loops whose failures are infrastructure-related
- escalate tasks that have been reset too many times

This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.

## Distress card contract

### Canonical title

```text
[BLOCKED] t_<source_task_id> <blocker_type>
```

### Canonical blocker types

- `scope_boundary`
- `env_blocker`
- `credential_failure`
- `dependency`
- `iteration_budget`
- `rate_limited`

### Canonical body

```markdown
## Distress Signal
- Blocked task: t_xxx
- Worker: <profile_name>
- Branch: <git_branch_name>
- Workspace: <path>
- Blocker type: <type>
- Completed: <what was done>
- Cannot touch: <out-of-scope packages/files>
- Needs: <what the orchestrator should do>
- State: committed | uncommitted | stashed(<stash_name>)

## Scope Guard
DO NOT touch: anything outside diagnosing and remediating the blocker described above
Only fix: assign, split, reassign, or unblock the source task
```

## Routing rules

### Distress card routing

- `[BLOCKED]` title prefix should bypass normal triage.
- The card should go directly to the orchestration profile.
- The orchestrator should start from a clean session each time.

### Rate-limit fallback

When the source task is rate-limited:

- do not keep retrying in the worker
- let the watcher synthesize the distress card
- have the orchestrator reassign the source task to a different profile/provider combo

### Provider fallback principle

Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.

### Suggested fallback order

1. Keep the current task body and scope guards intact.
2. Reassign to a different profile on a different provider.
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
4. If repeated failures continue, split the task into a narrower atomic card.

## Related recovery docs

- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
- New-session trigger: when a profile config changes, start a fresh session or `/reset` so the updated toolset is actually loaded.

## Watchers to implement

### Auto-heal watcher

Responsibilities:

- reap stale workers
- reset dead-PID crash loops
- track reset counts
- escalate after repeated resets

### Distress synthesizer watcher

Responsibilities:

- detect rate-limited / stuck workers
- create `[BLOCKED]` cards mechanically
- link the card to the source task
- leave a comment for traceability

### Iteration-budget watcher

Responsibilities:

- detect long-running tasks and repeated failure patterns
- recommend splits when a task is clearly over-scoped
- report tasks that need human review after multiple resets

## Operational principle

If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.

This is what makes the system robust across compaction, rate limits, and dead workers.

## Suggested implementation order

1. Atomic task metadata in task bodies
2. Worker-authored distress card protocol
3. Mechanical distress synthesizer watcher
4. Auto-heal watcher for dead workers
5. Orchestrator routing rules for `[BLOCKED]`
6. Rate-limit fallback / model reassignment table

## Where this fits in Hermes

- Kanban = durable work graph and status engine
- Watchers = mechanical healing and distress synthesis
- Orchestrator = split / reassign / unblock decision-maker
- Workers = execution inside atomic task boundaries

## Where this fits in Mosaic Stack

- PRD / coordination infra should encode the same patterns
- Mosaic can use the same distress-card contract and watcher logic
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue

## Cross-project takeaway

The important pattern is not the specific tool names. It is the mechanical feedback loop:

- detect failure without requiring the failing worker to succeed
- create a standardized help artifact
- route that artifact to a fresh orchestrator context
- repair the assignment graph
- continue the mission

That pattern is reusable anywhere.