235 lines
7.9 KiB
Markdown
235 lines
7.9 KiB
Markdown
# Mosaic Stack ↔ Hermes Coordination Resilience
|
|
|
|
> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
|
|
|
|
## Summary
|
|
|
|
The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
|
|
|
|
## SIBKISS operational summary
|
|
|
|
- mission on
|
|
- heartbeat always
|
|
- resume from packet
|
|
- block with `[BLOCKED]`
|
|
- reassign
|
|
- keep tasks tiny
|
|
- auto-heal dead workers
|
|
|
|
The design has four parts:
|
|
|
|
1. Atomic task decomposition — workers operate only within a small, explicit scope.
|
|
2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
|
|
3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
|
|
4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
|
|
|
|
## Why this exists
|
|
|
|
Observed failure modes:
|
|
|
|
- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
|
|
- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
|
|
- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
|
|
|
|
The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
|
|
|
|
## Core workflow
|
|
|
|
### 1) Atomic task boundaries
|
|
|
|
Every task should have:
|
|
|
|
- one concern
|
|
- explicit files/packages in scope
|
|
- explicit files/packages out of scope
|
|
- a maximum file count if possible
|
|
- a stated expected iteration budget
|
|
|
|
When a worker discovers work outside scope, it must stop fixing it and hand off.
|
|
|
|
### 2) Worker-authored distress card
|
|
|
|
If the worker can still report status, it creates a card like:
|
|
|
|
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
|
- Assignee: `tuesday` / orchestrator role
|
|
- Status: `ready`
|
|
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
|
|
|
The orchestrator receives the card, acts on it, and closes the loop.
|
|
|
|
## Routing rules
|
|
|
|
### Distress card routing
|
|
|
|
- Title: `[BLOCKED] t_<source_id> <blocker_type>`
|
|
- Assignee: `tuesday` / orchestrator role
|
|
- Status: `ready`
|
|
- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
|
|
- Source task stays linked to the distress card so the recovery trail is auditable
|
|
|
|
The orchestrator receives the card, acts on it, and closes the loop.
|
|
|
|
### 3) Mechanical fallback for rate-limited workers
|
|
|
|
If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
|
|
|
|
That watcher should:
|
|
|
|
- inspect running / blocked tasks
|
|
- detect repeated 429 / 503 / overload errors
|
|
- create the same standardized `[BLOCKED]` card on behalf of the worker
|
|
- link the distress card to the source task
|
|
- add a comment to the source task
|
|
- allow the dispatcher to pick up the new card immediately
|
|
|
|
This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
|
|
|
|
### 4) Auto-heal for dead workers
|
|
|
|
A separate no-agent watcher should:
|
|
|
|
- reap dead PIDs stuck in `running`
|
|
- reset crash-loops whose failures are infrastructure-related
|
|
- escalate tasks that have been reset too many times
|
|
|
|
This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
|
|
|
|
## Distress card contract
|
|
|
|
### Canonical title
|
|
|
|
```text
|
|
[BLOCKED] t_<source_task_id> <blocker_type>
|
|
```
|
|
|
|
### Canonical blocker types
|
|
|
|
- `scope_boundary`
|
|
- `env_blocker`
|
|
- `credential_failure`
|
|
- `dependency`
|
|
- `iteration_budget`
|
|
- `rate_limited`
|
|
|
|
### Canonical body
|
|
|
|
```markdown
|
|
## Distress Signal
|
|
- Blocked task: t_xxx
|
|
- Worker: <profile_name>
|
|
- Branch: <git_branch_name>
|
|
- Workspace: <path>
|
|
- Blocker type: <type>
|
|
- Completed: <what was done>
|
|
- Cannot touch: <out-of-scope packages/files>
|
|
- Needs: <what the orchestrator should do>
|
|
- State: committed | uncommitted | stashed(<stash_name>)
|
|
|
|
## Scope Guard
|
|
DO NOT touch: anything outside diagnosing and remediating the blocker described above
|
|
Only fix: assign, split, reassign, or unblock the source task
|
|
```
|
|
|
|
## Routing rules
|
|
|
|
### Distress card routing
|
|
|
|
- `[BLOCKED]` title prefix should bypass normal triage.
|
|
- The card should go directly to the orchestration profile.
|
|
- The orchestrator should start from a clean session each time.
|
|
|
|
### Rate-limit fallback
|
|
|
|
When the source task is rate-limited:
|
|
|
|
- do not keep retrying in the worker
|
|
- let the watcher synthesize the distress card
|
|
- have the orchestrator reassign the source task to a different profile/provider combo
|
|
|
|
### Provider fallback principle
|
|
|
|
Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
|
|
|
|
### Suggested fallback order
|
|
|
|
1. Keep the current task body and scope guards intact.
|
|
2. Reassign to a different profile on a different provider.
|
|
3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
|
|
4. If repeated failures continue, split the task into a narrower atomic card.
|
|
|
|
## Related recovery docs
|
|
|
|
- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
|
|
- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
|
|
- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
|
|
- New-session trigger: when a profile config changes, start a fresh session or `/reset` so the updated toolset is actually loaded.
|
|
|
|
## Watchers to implement
|
|
|
|
### Auto-heal watcher
|
|
|
|
Responsibilities:
|
|
|
|
- reap stale workers
|
|
- reset dead-PID crash loops
|
|
- track reset counts
|
|
- escalate after repeated resets
|
|
|
|
### Distress synthesizer watcher
|
|
|
|
Responsibilities:
|
|
|
|
- detect rate-limited / stuck workers
|
|
- create `[BLOCKED]` cards mechanically
|
|
- link the card to the source task
|
|
- leave a comment for traceability
|
|
|
|
### Iteration-budget watcher
|
|
|
|
Responsibilities:
|
|
|
|
- detect long-running tasks and repeated failure patterns
|
|
- recommend splits when a task is clearly over-scoped
|
|
- report tasks that need human review after multiple resets
|
|
|
|
## Operational principle
|
|
|
|
If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
|
|
|
|
This is what makes the system robust across compaction, rate limits, and dead workers.
|
|
|
|
## Suggested implementation order
|
|
|
|
1. Atomic task metadata in task bodies
|
|
2. Worker-authored distress card protocol
|
|
3. Mechanical distress synthesizer watcher
|
|
4. Auto-heal watcher for dead workers
|
|
5. Orchestrator routing rules for `[BLOCKED]`
|
|
6. Rate-limit fallback / model reassignment table
|
|
|
|
## Where this fits in Hermes
|
|
|
|
- Kanban = durable work graph and status engine
|
|
- Watchers = mechanical healing and distress synthesis
|
|
- Orchestrator = split / reassign / unblock decision-maker
|
|
- Workers = execution inside atomic task boundaries
|
|
|
|
## Where this fits in Mosaic Stack
|
|
|
|
- PRD / coordination infra should encode the same patterns
|
|
- Mosaic can use the same distress-card contract and watcher logic
|
|
- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
|
|
|
|
## Cross-project takeaway
|
|
|
|
The important pattern is not the specific tool names. It is the mechanical feedback loop:
|
|
|
|
- detect failure without requiring the failing worker to succeed
|
|
- create a standardized help artifact
|
|
- route that artifact to a fresh orchestrator context
|
|
- repair the assignment graph
|
|
- continue the mission
|
|
|
|
That pattern is reusable anywhere.
|