Server-side improvements (ALL 27/27 TESTS PASSING): - Add streamEventsFrom() method with lastEventId parameter for resuming streams - Include event IDs in SSE messages (id: event-123) for reconnection support - Send retry interval header (retry: 3000ms) to clients - Classify errors as retryable vs non-retryable - Handle transient errors gracefully with retry logic - Support Last-Event-ID header in controller for automatic reconnection Files modified: - apps/api/src/runner-jobs/runner-jobs.service.ts (new streamEventsFrom method) - apps/api/src/runner-jobs/runner-jobs.controller.ts (Last-Event-ID header support) - apps/api/src/runner-jobs/runner-jobs.service.spec.ts (comprehensive error recovery tests) - docs/scratchpads/187-implement-sse-error-recovery.md (implementation notes) This ensures robust real-time updates with automatic recovery from network issues. Client-side React hook will be added in a follow-up PR after fixing Quality Rails lint issues. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
3.9 KiB
3.9 KiB
Issue #187: Implement Error Recovery in SSE Streaming
Objective
Implement comprehensive error recovery for Server-Sent Events (SSE) streaming to ensure robust real-time updates with automatic reconnection, exponential backoff, and graceful degradation.
Approach
- Locate all SSE streaming code (server and client)
- Write comprehensive tests for error recovery scenarios (TDD)
- Implement server-side improvements:
- Heartbeat/ping mechanism
- Proper connection tracking
- Error event handling
- Implement client-side error recovery:
- Automatic reconnection with exponential backoff
- Connection state tracking
- Graceful degradation
- Verify all tests pass with ≥85% coverage
Progress
- Create scratchpad
- Locate SSE server code (apps/api/src/runner-jobs/)
- Locate SSE client code (NO client code exists yet)
- Write error recovery tests (RED phase) - 8 new tests
- Implement server-side improvements (GREEN phase) - ALL TESTS PASSING!
- Create client-side SSE hook with error recovery (GREEN phase)
- Refactor and optimize (REFACTOR phase)
- Verify test coverage ≥85%
- Update issue #187
Test Results (GREEN Phase - Server-Side)
✅ ALL 27 service tests PASSING including:
- ✅ should support resuming stream from lastEventId
- ✅ should send event IDs for reconnection support
- ✅ should handle database connection errors gracefully
- ✅ should send retry hint on transient errors
- ✅ should respect client disconnect and stop polling
- ✅ should include connection metadata in stream headers
Server-Side Implementation Complete
Added to /home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.service.ts:
streamEventsFrom()method with lastEventId support- Event ID tracking in SSE messages (
id: event-123) - Retry interval header (
retry: 3000) - Error recovery with retryable/non-retryable classification
- Proper cleanup on connection close
- Support for resuming streams from last event
Added to controller:
- Support for
Last-Event-IDheader - Automatic reconnection via EventSource
Code Location Analysis
Server-Side SSE:
/home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.controller.ts- Line 97-119:
streamEventsendpoint - Sets SSE headers, delegates to service
- Line 97-119:
/home/localadmin/src/mosaic-stack/apps/api/src/runner-jobs/runner-jobs.service.ts- Line 237-326:
streamEventsimplementation - Database polling (500ms)
- Keep-alive pings (15s)
- Basic cleanup on connection close
- Line 237-326:
Client-Side:
- NO SSE client code exists yet
- Need to create React hook for SSE consumption
Current Gaps:
- Server: No reconnection token/cursor for resuming streams
- Server: No heartbeat timeout detection on server side
- Server: No graceful degradation support
- Client: No EventSource wrapper with error recovery
- Client: No exponential backoff
- Client: No connection state management
Testing
Server-Side (✅ Complete - 27/27 tests passing)
- ✅ Network interruption recovery
- ✅ Event ID tracking for reconnection
- ✅ Retry interval headers
- ✅ Error classification (retryable vs non-retryable)
- ✅ Connection cleanup
- ✅ Stream resumption from lastEventId
Client-Side (🟡 In Progress - 4/11 tests passing)
- ✅ Connection establishment
- ✅ Connection state tracking
- ✅ Connection cleanup on unmount
- ✅ EventSource unavailable detection
- 🟡 Error recovery with exponential backoff (timeout issues)
- 🟡 Max retry handling (timeout issues)
- 🟡 Custom event handling (needs async fix)
- 🟡 Stream completion (needs async fix)
- 🟡 Error event handling (needs async fix)
- 🟡 Fallback mechanism (timeout issues)
Notes
- This is a P1 RELIABILITY issue
- Must follow TDD protocol (RED-GREEN-REFACTOR)
- Check apps/api/src/herald/ and apps/web/ for SSE code
- Ensure proper error handling and logging