API:
- Add AuthModule import to RunnerJobsModule
- Fixes: Nest can't resolve dependencies of AuthGuard
Orchestrator:
- Remove --prod flag from dependency installation
- Copy full node_modules tree to production stage
- Align Dockerfile with API pattern for monorepo builds
- Fixes: Cannot find module '@nestjs/core'
Both services now match the working API Dockerfile pattern.
Fixes 4 test failures identified in pipeline run 239:
1. RunnerJobsService cancel tests:
- Use updateMany mock instead of update (service uses optimistic locking)
- Add version field to mock objects
- Use mockResolvedValueOnce for sequential findUnique calls
2. ActivityService error handling tests:
- Update tests to expect null return (fire-and-forget pattern)
- Activity logging now returns null on DB errors per security fix
3. SecretScannerService unreadable file test:
- Handle root user case where chmod 0o000 doesn't prevent reads
- Test now adapts expectations based on runtime permissions
Quality gates: lint ✓ typecheck ✓ tests ✓
- @mosaic/orchestrator: 612 tests passing
- @mosaic/web: 650 tests passing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test verifying clearInterval is called in finally block
- Add test verifying interval is cleared even when stream throws error
- Prevents memory leaks from leaked intervals
The clearInterval was already present in the codebase at line 409 of
runner-jobs.service.ts. These tests provide explicit verification
of the cleanup behavior.
Refs #338
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implemented optimistic locking with version field and SELECT FOR UPDATE
transactions to prevent data corruption from concurrent job status updates.
Changes:
- Added version field to RunnerJob schema for optimistic locking
- Created migration 20260202_add_runner_job_version_for_concurrency
- Implemented ConcurrentUpdateException for conflict detection
- Updated RunnerJobsService methods with optimistic locking:
* updateStatus() - with version checking and retry logic
* updateProgress() - with version checking and retry logic
* cancel() - with version checking and retry logic
- Updated CoordinatorIntegrationService with SELECT FOR UPDATE:
* updateJobStatus() - transaction with row locking
* completeJob() - transaction with row locking
* failJob() - transaction with row locking
* updateJobProgress() - optimistic locking
- Added retry mechanism (3 attempts) with exponential backoff
- Added comprehensive concurrency tests (10 tests, all passing)
- Updated existing test mocks to support updateMany
Test Results:
- All 10 concurrency tests passing ✓
- Tests cover concurrent status updates, progress updates, completions,
cancellations, retry logic, and exponential backoff
This fix prevents race conditions that could cause:
- Lost job results (double completion)
- Lost progress updates
- Invalid status transitions
- Data corruption under concurrent access
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Server-side improvements (ALL 27/27 TESTS PASSING):
- Add streamEventsFrom() method with lastEventId parameter for resuming streams
- Include event IDs in SSE messages (id: event-123) for reconnection support
- Send retry interval header (retry: 3000ms) to clients
- Classify errors as retryable vs non-retryable
- Handle transient errors gracefully with retry logic
- Support Last-Event-ID header in controller for automatic reconnection
Files modified:
- apps/api/src/runner-jobs/runner-jobs.service.ts (new streamEventsFrom method)
- apps/api/src/runner-jobs/runner-jobs.controller.ts (Last-Event-ID header support)
- apps/api/src/runner-jobs/runner-jobs.service.spec.ts (comprehensive error recovery tests)
- docs/scratchpads/187-implement-sse-error-recovery.md (implementation notes)
This ensures robust real-time updates with automatic recovery from network issues.
Client-side React hook will be added in a follow-up PR after fixing Quality Rails lint issues.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add composite index [jobId, timestamp] to improve query performance
for the most common job_events access patterns.
Changes:
- Add @@index([jobId, timestamp]) to JobEvent model in schema.prisma
- Create migration 20260202122655_add_job_events_composite_index
- Add performance tests to validate index effectiveness
- Document index design rationale in scratchpad
- Fix lint errors in api-key.guard, herald.service, runner-jobs.service
Rationale:
The composite index [jobId, timestamp] optimizes the dominant query
pattern used across all services:
- JobEventsService.getEventsByJobId (WHERE jobId, ORDER BY timestamp)
- RunnerJobsService.streamEvents (WHERE jobId + timestamp range)
- RunnerJobsService.findOne (implicit jobId filter + timestamp order)
This index provides:
- Fast filtering by jobId (highly selective)
- Efficient timestamp-based ordering
- Optimal support for timestamp range queries
- Backward compatibility with jobId-only queries
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add Server-Sent Events (SSE) endpoint for streaming job events to CLI
consumers who prefer HTTP streaming over WebSocket.
Endpoint: GET /runner-jobs/:id/events/stream
Features:
- Database polling (500ms interval) for new events
- Keep-alive pings (15s interval) to prevent timeout
- Auto-cleanup on connection close or job completion
- Authentication required (workspace member)
- SSE format: event: <type>\ndata: <json>\n\n
Implementation:
- Added streamEvents method to RunnerJobsService
- Added streamEvents endpoint to RunnerJobsController
- Comprehensive unit tests for both controller and service
- All quality gates pass (typecheck, lint, build, test)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements runner-jobs module for job lifecycle management and queue submission.
Changes:
- Created RunnerJobsModule with service, controller, and DTOs
- Implemented job creation with BullMQ queue submission
- Implemented job listing with filters (status, type, agentTaskId)
- Implemented job detail retrieval with steps and events
- Implemented cancel operation for pending/queued jobs
- Implemented retry operation for failed jobs
- Added comprehensive unit tests (24 tests, 100% coverage)
- Integrated with BullMQ for async job processing
- Integrated with Prisma for database operations
- Followed existing CRUD patterns from tasks/events modules
API Endpoints:
- POST /runner-jobs - Create and queue a new job
- GET /runner-jobs - List jobs (with filters)
- GET /runner-jobs/:id - Get job details
- POST /runner-jobs/:id/cancel - Cancel a running job
- POST /runner-jobs/:id/retry - Retry a failed job
Quality Gates:
- Typecheck: ✅ PASSED
- Lint: ✅ PASSED
- Build: ✅ PASSED
- Tests: ✅ PASSED (24/24 tests)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>