Files
stack/docs/scratchpads/187-implement-sse-error-recovery.md
Jason Woltje a3b48dd631 fix(#187): implement server-side SSE error recovery
Server-side improvements (ALL 27/27 TESTS PASSING):
- Add streamEventsFrom() method with lastEventId parameter for resuming streams
- Include event IDs in SSE messages (id: event-123) for reconnection support
- Send retry interval header (retry: 3000ms) to clients
- Classify errors as retryable vs non-retryable
- Handle transient errors gracefully with retry logic
- Support Last-Event-ID header in controller for automatic reconnection

Files modified:
- apps/api/src/runner-jobs/runner-jobs.service.ts (new streamEventsFrom method)
- apps/api/src/runner-jobs/runner-jobs.controller.ts (Last-Event-ID header support)
- apps/api/src/runner-jobs/runner-jobs.service.spec.ts (comprehensive error recovery tests)
- docs/scratchpads/187-implement-sse-error-recovery.md (implementation notes)

This ensures robust real-time updates with automatic recovery from network issues.
Client-side React hook will be added in a follow-up PR after fixing Quality Rails lint issues.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-02 12:41:12 -06:00

3.9 KiB

Issue #187: Implement Error Recovery in SSE Streaming

Objective

Implement comprehensive error recovery for Server-Sent Events (SSE) streaming to ensure robust real-time updates with automatic reconnection, exponential backoff, and graceful degradation.

Approach

  1. Locate all SSE streaming code (server and client)
  2. Write comprehensive tests for error recovery scenarios (TDD)
  3. Implement server-side improvements:
    • Heartbeat/ping mechanism
    • Proper connection tracking
    • Error event handling
  4. Implement client-side error recovery:
    • Automatic reconnection with exponential backoff
    • Connection state tracking
    • Graceful degradation
  5. Verify all tests pass with ≥85% coverage

Progress

  • Create scratchpad
  • Locate SSE server code (apps/api/src/runner-jobs/)
  • Locate SSE client code (NO client code exists yet)
  • Write error recovery tests (RED phase) - 8 new tests
  • Implement server-side improvements (GREEN phase) - ALL TESTS PASSING!
  • Create client-side SSE hook with error recovery (GREEN phase)
  • Refactor and optimize (REFACTOR phase)
  • Verify test coverage ≥85%
  • Update issue #187

Test Results (GREEN Phase - Server-Side)

ALL 27 service tests PASSING including:

  1. should support resuming stream from lastEventId
  2. should send event IDs for reconnection support
  3. should handle database connection errors gracefully
  4. should send retry hint on transient errors
  5. should respect client disconnect and stop polling
  6. should include connection metadata in stream headers

Server-Side Implementation Complete

Added to /home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.service.ts:

  • streamEventsFrom() method with lastEventId support
  • Event ID tracking in SSE messages (id: event-123)
  • Retry interval header (retry: 3000)
  • Error recovery with retryable/non-retryable classification
  • Proper cleanup on connection close
  • Support for resuming streams from last event

Added to controller:

  • Support for Last-Event-ID header
  • Automatic reconnection via EventSource

Code Location Analysis

Server-Side SSE:

  • /home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.controller.ts
    • Line 97-119: streamEvents endpoint
    • Sets SSE headers, delegates to service
  • /home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.service.ts
    • Line 237-326: streamEvents implementation
    • Database polling (500ms)
    • Keep-alive pings (15s)
    • Basic cleanup on connection close

Client-Side:

  • NO SSE client code exists yet
  • Need to create React hook for SSE consumption

Current Gaps:

  1. Server: No reconnection token/cursor for resuming streams
  2. Server: No heartbeat timeout detection on server side
  3. Server: No graceful degradation support
  4. Client: No EventSource wrapper with error recovery
  5. Client: No exponential backoff
  6. Client: No connection state management

Testing

Server-Side ( Complete - 27/27 tests passing)

  • Network interruption recovery
  • Event ID tracking for reconnection
  • Retry interval headers
  • Error classification (retryable vs non-retryable)
  • Connection cleanup
  • Stream resumption from lastEventId

Client-Side (🟡 In Progress - 4/11 tests passing)

  • Connection establishment
  • Connection state tracking
  • Connection cleanup on unmount
  • EventSource unavailable detection
  • 🟡 Error recovery with exponential backoff (timeout issues)
  • 🟡 Max retry handling (timeout issues)
  • 🟡 Custom event handling (needs async fix)
  • 🟡 Stream completion (needs async fix)
  • 🟡 Error event handling (needs async fix)
  • 🟡 Fallback mechanism (timeout issues)

Notes

  • This is a P1 RELIABILITY issue
  • Must follow TDD protocol (RED-GREEN-REFACTOR)
  • Check apps/api/src/herald/ and apps/web/ for SSE code
  • Ensure proper error handling and logging