[ORCH-005] ClawdBot Failure Handling #100

Closed
opened 2026-01-29 23:30:25 +00:00 by jason.woltje · 0 comments
Owner

Phase 3: Failure Handling

Handle failures reported by Orchestrator service (`apps/orchestrator/`) (not detecting them ourselves).

Deliverables

[ ] Failure callback handler (Orchestrator reports task failed)
[ ] Retry configuration per task type (max retries, backoff)
[ ] Automatic retry dispatch (if retries remaining)
[ ] Escalation logic (notify user after max retries)
[ ] Checkpoint preservation (save last known state for potential resume)
[ ] Failure audit trail (full history in AgentTaskLog)

Failure Flow

Orchestrator reports failure → Callback received
                            → Log failure details
                            → Check retry config
                            → If retries remaining: re-dispatch
                            → If max retries: mark FAILED, notify user

Removed (handled by Orchestrator service)

Stale agent detection — Orchestrator monitors its own agents
Direct health monitoring — Orchestrator handles heartbeats
Agent-level recovery — Orchestrator restarts failed agents

Dependencies

#99 Task Dispatcher Service

#95 Agent Orchestration EPIC
#114 Kill Authority Implementation
• ORCH-118 (Orchestrator resource cleanup)

## Phase 3: Failure Handling Handle failures **reported by Orchestrator service (\`apps/orchestrator/\`)** (not detecting them ourselves). ## Deliverables [ ] Failure callback handler (Orchestrator reports task failed) [ ] Retry configuration per task type (max retries, backoff) [ ] Automatic retry dispatch (if retries remaining) [ ] Escalation logic (notify user after max retries) [ ] Checkpoint preservation (save last known state for potential resume) [ ] Failure audit trail (full history in AgentTaskLog) ## Failure Flow ``` Orchestrator reports failure → Callback received → Log failure details → Check retry config → If retries remaining: re-dispatch → If max retries: mark FAILED, notify user ``` ## Removed (handled by Orchestrator service) • ~~Stale agent detection~~ — Orchestrator monitors its own agents • ~~Direct health monitoring~~ — Orchestrator handles heartbeats • ~~Agent-level recovery~~ — Orchestrator restarts failed agents ## Dependencies • #99 Task Dispatcher Service ## Related • #95 Agent Orchestration EPIC • #114 Kill Authority Implementation • ORCH-118 (Orchestrator resource cleanup)
jason.woltje added this to the M6-AgentOrchestration (0.0.6) milestone 2026-01-29 23:30:25 +00:00
jason.woltje added the apiapip0phase-3 labels 2026-01-29 23:30:25 +00:00
jason.woltje changed title from [ORCH-005] Agent Failure Recovery to [ORCH-005] ClawdBot Failure Handling 2026-01-30 03:04:05 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaic/stack#100