[BLOCKER] Fix silent error swallowing in Herald broadcasting #185

Closed
opened 2026-02-02 17:23:50 +00:00 by jason.woltje · 0 comments
Owner

Problem

Herald service catches ALL errors during job event broadcasting but only logs them - never notifies users or re-throws. This creates a silent notification failure where users expect job updates in Discord but receive nothing.

Location

apps/api/src/herald/herald.service.ts:102-104

async broadcastJobEvent(jobId: string, event: {...}): Promise<void> {
  try {
    // ... broadcasting logic ...
  } catch (error) {
    this.logger.error(`Failed to broadcast event for job ${jobId}:`, error);
    // ❌ CRITICAL: Error is logged but execution continues
    // No error is thrown, no user notification, no retry
  }
}

Hidden Failures

  • Discord API authentication failures (expired token)
  • Network failures (Discord service down)
  • Thread not found errors (deleted thread)
  • Rate limiting errors (too many messages)
  • Database connection failures (job lookup)
  • Permission errors (bot kicked from server)

User Impact

  • Users expect job updates but receive nothing
  • No indication that notifications are failing
  • Jobs may complete but users think they're still running
  • Debugging requires checking server logs
  • Critical failures never communicated

Acceptance Criteria

  • Store failed broadcasts for retry
  • Implement retry mechanism with exponential backoff
  • Send fallback notification to user on broadcast failure
  • Re-throw critical events (job.failed, job.completed)
  • Emit monitoring metrics for failed broadcasts
  • Add circuit breaker for Discord API
  • Add tests for error scenarios

Implementation Notes

catch (error) {
  this.logger.error(`Failed to broadcast event for job ${jobId}:`, error);
  
  // NEW: Store failed broadcast for retry
  await this.storeFailedBroadcast(jobId, event, error);
  
  // NEW: Fallback notification
  await this.sendFallbackNotification(jobId, 
    "Job notifications temporarily unavailable"
  );
  
  // NEW: Emit monitoring alert
  this.metrics.incrementFailedBroadcasts();
  
  // NEW: Re-throw if critical event
  if (this.isCriticalEvent(event.type)) {
    throw new BroadcastFailureError(...);
  }
}

References

M4.2-Infrastructure verification report (2026-02-02)
Security review agent ID: a1b8b3f

## Problem Herald service catches ALL errors during job event broadcasting but only logs them - never notifies users or re-throws. This creates a silent notification failure where users expect job updates in Discord but receive nothing. ## Location apps/api/src/herald/herald.service.ts:102-104 ```typescript async broadcastJobEvent(jobId: string, event: {...}): Promise<void> { try { // ... broadcasting logic ... } catch (error) { this.logger.error(`Failed to broadcast event for job ${jobId}:`, error); // ❌ CRITICAL: Error is logged but execution continues // No error is thrown, no user notification, no retry } } ``` ## Hidden Failures - Discord API authentication failures (expired token) - Network failures (Discord service down) - Thread not found errors (deleted thread) - Rate limiting errors (too many messages) - Database connection failures (job lookup) - Permission errors (bot kicked from server) ## User Impact - Users expect job updates but receive **nothing** - No indication that notifications are failing - Jobs may complete but users think they're still running - Debugging requires checking server logs - Critical failures never communicated ## Acceptance Criteria - [ ] Store failed broadcasts for retry - [ ] Implement retry mechanism with exponential backoff - [ ] Send fallback notification to user on broadcast failure - [ ] Re-throw critical events (job.failed, job.completed) - [ ] Emit monitoring metrics for failed broadcasts - [ ] Add circuit breaker for Discord API - [ ] Add tests for error scenarios ## Implementation Notes ```typescript catch (error) { this.logger.error(`Failed to broadcast event for job ${jobId}:`, error); // NEW: Store failed broadcast for retry await this.storeFailedBroadcast(jobId, event, error); // NEW: Fallback notification await this.sendFallbackNotification(jobId, "Job notifications temporarily unavailable" ); // NEW: Emit monitoring alert this.metrics.incrementFailedBroadcasts(); // NEW: Re-throw if critical event if (this.isCriticalEvent(event.type)) { throw new BroadcastFailureError(...); } } ``` ## References M4.2-Infrastructure verification report (2026-02-02) Security review agent ID: a1b8b3f
jason.woltje added this to the M4.2-Infrastructure (0.0.4) milestone 2026-02-02 17:23:50 +00:00
jason.woltje added the apiapip0 labels 2026-02-02 17:23:50 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaic/stack#185