Files
stack/docs/scratchpads/185-fix-herald-error-handling.md
Jason Woltje 12abdfe81d feat(#93): implement agent spawn via federation
Implements FED-010: Agent Spawn via Federation feature that enables
spawning and managing Claude agents on remote federated Mosaic Stack
instances via COMMAND message type.

Features:
- Federation agent command types (spawn, status, kill)
- FederationAgentService for handling agent operations
- Integration with orchestrator's agent spawner/lifecycle services
- API endpoints for spawning, querying status, and killing agents
- Full command routing through federation COMMAND infrastructure
- Comprehensive test coverage (12/12 tests passing)

Architecture:
- Hub → Spoke: Spawn agents on remote instances
- Command flow: FederationController → FederationAgentService →
  CommandService → Remote Orchestrator
- Response handling: Remote orchestrator returns agent status/results
- Security: Connection validation, signature verification

Files created:
- apps/api/src/federation/types/federation-agent.types.ts
- apps/api/src/federation/federation-agent.service.ts
- apps/api/src/federation/federation-agent.service.spec.ts

Files modified:
- apps/api/src/federation/command.service.ts (agent command routing)
- apps/api/src/federation/federation.controller.ts (agent endpoints)
- apps/api/src/federation/federation.module.ts (service registration)
- apps/orchestrator/src/api/agents/agents.controller.ts (status endpoint)
- apps/orchestrator/src/api/agents/agents.module.ts (lifecycle integration)

Testing:
- 12/12 tests passing for FederationAgentService
- All command service tests passing
- TypeScript compilation successful
- Linting passed

Refs #93

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 14:37:06 -06:00

5.3 KiB

Issue #185: Fix silent error swallowing in Herald broadcasting

Objective

Fix silent error swallowing in Herald broadcasting to ensure errors are properly logged, propagated, and surfaced. This is a BLOCKER for monitoring and debugging - silent errors prevent proper system observability.

Problem Analysis

Location of Issue

File: /home/localadmin/src/mosaic-stack/apps/api/src/herald/herald.service.ts

Lines 102-104:

} catch (error) {
  this.logger.error(`Failed to broadcast event for job ${jobId}:`, error);
}

The Problem

The broadcastJobEvent method has a try-catch block that:

  1. Logs the error (good)
  2. Swallows the error completely (bad) - returns void without throwing
  3. Prevents callers from knowing if broadcasting failed
  4. Makes debugging and monitoring impossible

Impact

  • Callers like CoordinatorIntegrationService have no way to know if Herald broadcasting failed
  • Silent failures prevent proper error tracking and alerting
  • No way to implement retry logic or fallback mechanisms
  • Violates observability best practices

Approach

TDD Protocol

  1. RED - Write failing tests for error scenarios
  2. GREEN - Implement proper error handling
  3. REFACTOR - Clean up and ensure coverage

Solution Design

Option 1: Propagate Errors (CHOSEN)

  • Throw errors after logging them
  • Let callers decide how to handle (retry, ignore, alert)
  • Add context to errors for better debugging
  • Pros: Explicit error handling, better observability
  • Cons: Breaking change for callers

Option 2: Return Error Result

  • Return { success: boolean, error?: Error }
  • Callers can check result
  • Pros: Non-breaking
  • Cons: Easy to ignore, not idiomatic for async operations

Decision: Go with Option 1 (propagate errors) because:

  • This is version 0.0.x, breaking changes acceptable
  • Explicit error handling is better for system reliability
  • Forces proper error handling at call sites

Implementation Steps

  1. Add test for database errors in broadcastJobEvent
  2. Add test for Discord send failures
  3. Add test for error context preservation
  4. Remove error swallowing from try-catch
  5. Add custom error class for Herald failures
  6. Update error logging to include full context
  7. Verify all tests pass

Progress

  • Create scratchpad
  • Analyze the problem
  • Design solution
  • RED: Write failing tests (4 new error handling tests)
  • GREEN: Implement error propagation
  • GREEN: Update error logging with context
  • REFACTOR: Add coverage tests for formatting methods
  • Run test coverage verification (96.1% - exceeds 85% requirement)
  • Commit changes

Testing Strategy

Test Cases to Add

  1. Database failure during job lookup

    • Mock Prisma to throw error
    • Verify error is propagated with context
  2. Discord send failure

    • Mock Discord service to reject
    • Verify error is propagated with context
  3. Error context preservation

    • Verify jobId and event type are included in error
    • Verify original error is preserved
  4. Successful broadcast still works

    • Ensure existing tests still pass
    • No regression in happy path

Coverage Target

  • Minimum 85% coverage (project requirement)
  • Focus on error paths and edge cases

Results

Tests Added

  1. Database failure test - Verifies errors propagate when job lookup fails
  2. Discord send failure test - Verifies errors propagate when message sending fails
  3. Job events fetch failure test - Verifies errors propagate when fetching events fails
  4. Error context test - Verifies original error is preserved
  5. Coverage tests - 7 additional tests for formatting methods to reach 96.1% coverage

Coverage Achieved

  • 96.1% statement coverage (target: 85%)
  • 78.43% branch coverage
  • 100% function coverage
  • 25 tests total (18 existing + 7 new)

Changes Made

File: /home/localadmin/src/mosaic-stack/apps/api/src/herald/herald.service.ts

  • Lines 102-110: Enhanced error logging with event type context
  • Line 110: Added throw error; to propagate errors instead of swallowing them

File: /home/localadmin/src/mosaic-stack/apps/api/src/herald/herald.service.spec.ts

  • Added 4 error handling tests (lines 328-454)
  • Added 7 coverage tests for formatting methods

Notes

  • CoordinatorIntegrationService calls broadcastJobEvent at lines 148, 249
  • No error handling at call sites (assumes success)
  • Follow-up required: Update callers to handle errors properly (separate issue)

Impact of Changes

BREAKING CHANGE: This is a breaking change for callers of broadcastJobEvent, but acceptable because:

  1. Project is at version 0.0.x (pre-release)
  2. Improves system reliability and observability
  3. Forces explicit error handling at call sites
  4. Only 2 call sites in the codebase to update

Custom Error Class

export class HeraldBroadcastError extends Error {
  constructor(
    message: string,
    public readonly jobId: string,
    public readonly eventType: string,
    public readonly cause: Error
  ) {
    super(message);
    this.name = "HeraldBroadcastError";
  }
}

Migration Path

  1. Fix Herald service first (this issue)
  2. Update callers to handle errors (follow-up issue)
  3. Add retry logic if needed (follow-up issue)