Files

Jason Woltje 5cde3a3b6d docs: note new-session trigger for mission toolset reload

2026-05-07 13:31:43 -05:00

7.9 KiB

Raw Blame History

Mosaic Stack ↔ Hermes Coordination Resilience

Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.

Summary

The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.

SIBKISS operational summary

mission on
heartbeat always
resume from packet
block with [BLOCKED]
reassign
keep tasks tiny
auto-heal dead workers

The design has four parts:

Atomic task decomposition — workers operate only within a small, explicit scope.
Distress signaling — workers create a standardized [BLOCKED] card when they encounter a blocker outside their scope.
Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.

Why this exists

Observed failure modes:

Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.

The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.

Core workflow

1) Atomic task boundaries

Every task should have:

one concern
explicit files/packages in scope
explicit files/packages out of scope
a maximum file count if possible
a stated expected iteration budget

When a worker discovers work outside scope, it must stop fixing it and hand off.

2) Worker-authored distress card

If the worker can still report status, it creates a card like:

Title: [BLOCKED] t_<source_id> <blocker_type>
Assignee: tuesday / orchestrator role
Status: ready
Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action

The orchestrator receives the card, acts on it, and closes the loop.

Routing rules

Distress card routing

Title: [BLOCKED] t_<source_id> <blocker_type>
Assignee: tuesday / orchestrator role
Status: ready
Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
Source task stays linked to the distress card so the recovery trail is auditable

The orchestrator receives the card, acts on it, and closes the loop.

3) Mechanical fallback for rate-limited workers

If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.

That watcher should:

inspect running / blocked tasks
detect repeated 429 / 503 / overload errors
create the same standardized [BLOCKED] card on behalf of the worker
link the distress card to the source task
add a comment to the source task
allow the dispatcher to pick up the new card immediately

This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.

4) Auto-heal for dead workers

A separate no-agent watcher should:

reap dead PIDs stuck in running
reset crash-loops whose failures are infrastructure-related
escalate tasks that have been reset too many times

This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.

Distress card contract

Canonical title

[BLOCKED] t_<source_task_id> <blocker_type>

Canonical blocker types

scope_boundary
env_blocker
credential_failure
dependency
iteration_budget
rate_limited

Canonical body

## Distress Signal
- Blocked task: t_xxx
- Worker: <profile_name>
- Branch: <git_branch_name>
- Workspace: <path>
- Blocker type: <type>
- Completed: <what was done>
- Cannot touch: <out-of-scope packages/files>
- Needs: <what the orchestrator should do>
- State: committed | uncommitted | stashed(<stash_name>)

## Scope Guard
DO NOT touch: anything outside diagnosing and remediating the blocker described above
Only fix: assign, split, reassign, or unblock the source task

Routing rules

Distress card routing

[BLOCKED] title prefix should bypass normal triage.
The card should go directly to the orchestration profile.
The orchestrator should start from a clean session each time.

Rate-limit fallback

When the source task is rate-limited:

do not keep retrying in the worker
let the watcher synthesize the distress card
have the orchestrator reassign the source task to a different profile/provider combo

Provider fallback principle

Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.

Suggested fallback order

Keep the current task body and scope guards intact.
Reassign to a different profile on a different provider.
If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
If repeated failures continue, split the task into a narrower atomic card.

Mission packet recovery contract: /opt/hermes/docs/mission-toolset-heartbeat.md
Hermes mission implementation plan: /opt/hermes/docs/plans/mission-toolset-implementation.md
The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
New-session trigger: when a profile config changes, start a fresh session or /reset so the updated toolset is actually loaded.

Watchers to implement

Auto-heal watcher

Responsibilities:

reap stale workers
reset dead-PID crash loops
track reset counts
escalate after repeated resets

Distress synthesizer watcher

Responsibilities:

detect rate-limited / stuck workers
create [BLOCKED] cards mechanically
link the card to the source task
leave a comment for traceability

Iteration-budget watcher

Responsibilities:

detect long-running tasks and repeated failure patterns
recommend splits when a task is clearly over-scoped
report tasks that need human review after multiple resets

Operational principle

If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.

This is what makes the system robust across compaction, rate limits, and dead workers.

Suggested implementation order

Atomic task metadata in task bodies
Worker-authored distress card protocol
Mechanical distress synthesizer watcher
Auto-heal watcher for dead workers
Orchestrator routing rules for [BLOCKED]
Rate-limit fallback / model reassignment table

Where this fits in Hermes

Kanban = durable work graph and status engine
Watchers = mechanical healing and distress synthesis
Orchestrator = split / reassign / unblock decision-maker
Workers = execution inside atomic task boundaries

Where this fits in Mosaic Stack

PRD / coordination infra should encode the same patterns
Mosaic can use the same distress-card contract and watcher logic
The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue

Cross-project takeaway

The important pattern is not the specific tool names. It is the mechanical feedback loop:

detect failure without requiring the failing worker to succeed
create a standardized help artifact
route that artifact to a fresh orchestrator context
repair the assignment graph
continue the mission

That pattern is reusable anywhere.

7.9 KiB Raw Blame History

Mosaic Stack ↔ Hermes Coordination Resilience

Summary

SIBKISS operational summary

Why this exists

Core workflow

1) Atomic task boundaries

2) Worker-authored distress card

Routing rules

Distress card routing

3) Mechanical fallback for rate-limited workers

4) Auto-heal for dead workers

Distress card contract

Canonical title

Canonical blocker types

Canonical body

Routing rules

Distress card routing

Rate-limit fallback

Provider fallback principle

Suggested fallback order

Related recovery docs

Watchers to implement

Auto-heal watcher

Distress synthesizer watcher

Iteration-budget watcher

Operational principle

Suggested implementation order

Where this fits in Hermes

Where this fits in Mosaic Stack

Cross-project takeaway

7.9 KiB

Raw Blame History