Implemented optimistic locking with version field and SELECT FOR UPDATE transactions to prevent data corruption from concurrent job status updates. Changes: - Added version field to RunnerJob schema for optimistic locking - Created migration 20260202_add_runner_job_version_for_concurrency - Implemented ConcurrentUpdateException for conflict detection - Updated RunnerJobsService methods with optimistic locking: * updateStatus() - with version checking and retry logic * updateProgress() - with version checking and retry logic * cancel() - with version checking and retry logic - Updated CoordinatorIntegrationService with SELECT FOR UPDATE: * updateJobStatus() - transaction with row locking * completeJob() - transaction with row locking * failJob() - transaction with row locking * updateJobProgress() - optimistic locking - Added retry mechanism (3 attempts) with exponential backoff - Added comprehensive concurrency tests (10 tests, all passing) - Updated existing test mocks to support updateMany Test Results: - All 10 concurrency tests passing ✓ - Tests cover concurrent status updates, progress updates, completions, cancellations, retry logic, and exponential backoff This fix prevents race conditions that could cause: - Lost job results (double completion) - Lost progress updates - Invalid status transitions - Data corruption under concurrent access Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.1 KiB
Issue #196: Fix race condition in job status updates
Objective
Fix race condition in job status update logic that can cause data corruption when multiple processes attempt to update the same job simultaneously. This is a P2 RELIABILITY issue.
Race Condition Analysis
Current Implementation Problems
-
RunnerJobsService.updateStatus() (lines 418-462)
- Read job:
prisma.runnerJob.findUnique() - Make decision based on read data
- Update job:
prisma.runnerJob.update() - RACE CONDITION: Between read and update, another process can modify the job
- Read job:
-
RunnerJobsService.updateProgress() (lines 467-485)
- Same pattern: read, check, update
- RACE CONDITION: Progress updates can be lost or overwritten
-
CoordinatorIntegrationService.updateJobStatus() (lines 103-152)
- Reads job to validate status transition
- RACE CONDITION: Status can change between validation and update
-
RunnerJobsService.cancel() (lines 149-178)
- Similar pattern with race condition
Concurrent Scenarios That Cause Issues
Scenario 1: Double completion
- Process A: Reads job (status=RUNNING), decides to complete it
- Process B: Reads job (status=RUNNING), decides to complete it
- Process A: Updates job to COMPLETED with resultA
- Process B: Updates job to COMPLETED with resultB (overwrites resultA)
- Result: Lost data (resultA lost)
Scenario 2: Progress updates lost
- Process A: Updates progress to 50%
- Process B: Updates progress to 75% (concurrent)
- Result: One update lost depending on race timing
Scenario 3: Invalid status transitions
- Process A: Reads job (status=RUNNING), validates transition to COMPLETED
- Process B: Reads job (status=RUNNING), validates transition to FAILED
- Process A: Updates to COMPLETED
- Process B: Updates to FAILED (overwrites COMPLETED)
- Result: Invalid state - job marked as FAILED when it actually completed
Approach
Solution 1: Add Version Field (Optimistic Locking)
Add a version field to RunnerJob model:
model RunnerJob {
// ... existing fields
version Int @default(0)
}
Update pattern:
const result = await prisma.runnerJob.updateMany({
where: {
id: jobId,
workspaceId: workspaceId,
version: currentVersion, // Only update if version matches
},
data: {
status: newStatus,
version: { increment: 1 },
},
});
if (result.count === 0) {
// Concurrent update detected - retry or throw error
}
Solution 2: Use Database Transactions with SELECT FOR UPDATE
await prisma.$transaction(async (tx) => {
// Lock the row
const job = await tx.$queryRaw`
SELECT * FROM "RunnerJob"
WHERE id = ${jobId} AND workspace_id = ${workspaceId}
FOR UPDATE
`;
// Validate and update
// Row is locked until transaction commits
});
Solution 3: Hybrid Approach (Recommended)
- Use optimistic locking (version field) for most updates (better performance)
- Use SELECT FOR UPDATE for critical sections (status transitions)
- Implement retry logic for optimistic lock failures
Progress
- Analyze current implementation
- Identify race conditions
- Design solution approach
- Write concurrency tests (RED phase)
- Add version field to schema
- Create migration for version field
- Implement optimistic locking in updateStatus()
- Implement optimistic locking in updateProgress()
- Implement optimistic locking in cancel()
- Implement SELECT FOR UPDATE for coordinator updates (updateJobStatus, completeJob, failJob)
- Add retry logic for concurrent update conflicts
- Create ConcurrentUpdateException
- Verify all tests pass
- Run coverage check (≥85%)
- Commit changes
Testing Strategy
Concurrency Tests to Write
-
Test concurrent status updates
- Simulate 2+ processes updating same job status
- Verify only one succeeds or updates are properly serialized
- Verify no data loss
-
Test concurrent progress updates
- Simulate rapid progress updates
- Verify all updates are recorded or properly merged
-
Test status transition validation with concurrency
- Simulate concurrent invalid transitions
- Verify invalid transitions are rejected
-
Test completion race
- Simulate concurrent completion with different results
- Verify only one completion succeeds and data isn't lost
-
Test optimistic lock retry logic
- Simulate version conflicts
- Verify retry mechanism works correctly
Implementation Plan
Phase 1: Schema Changes (with migration)
- Add
versionfield to RunnerJob model - Create migration
- Run migration
Phase 2: Update Methods (TDD)
- updateStatus() - Add optimistic locking
- updateProgress() - Add optimistic locking
- completeJob() - Add optimistic locking
- failJob() - Add optimistic locking
- cancel() - Add optimistic locking
Phase 3: Critical Sections
- updateJobStatus() in coordinator integration - Add transaction with SELECT FOR UPDATE
- Add retry logic wrapper
Phase 4: Error Handling
- Add custom exception for concurrent update conflicts
- Implement retry logic (max 3 retries with exponential backoff)
- Log concurrent update conflicts for monitoring
Notes
Version Field vs SELECT FOR UPDATE
Optimistic Locking (version field):
- ✅ Better performance (no row locks)
- ✅ Works well for high-concurrency scenarios
- ✅ Simple to implement
- ❌ Requires retry logic
- ❌ Client must handle conflicts
Pessimistic Locking (SELECT FOR UPDATE):
- ✅ Guarantees no conflicts
- ✅ No retry logic needed
- ❌ Locks rows (can cause contention)
- ❌ Risk of deadlocks if not careful
- ❌ Lower throughput under high concurrency
Recommendation: Use optimistic locking as default, SELECT FOR UPDATE only for critical status transitions.
Prisma Limitations
Prisma doesn't have native optimistic locking support. We need to:
- Add version field manually
- Use
updateMany()with version check (returns count) - Handle count=0 as conflict
Retry Strategy
For optimistic lock failures:
async function retryOnConflict<T>(operation: () => Promise<T>, maxRetries = 3): Promise<T> {
for (let i = 0; i < maxRetries; i++) {
try {
return await operation();
} catch (error) {
if (error instanceof ConcurrentUpdateError && i < maxRetries - 1) {
await sleep(Math.pow(2, i) * 100); // Exponential backoff
continue;
}
throw error;
}
}
}
Findings
Current State
- No concurrency protection exists
- All update methods are vulnerable to race conditions
- No version tracking or locking mechanism
- High risk under concurrent job processing
Risk Assessment
- P2 RELIABILITY is correct - can cause data corruption
- Most likely to occur when:
- Multiple workers process same job queue
- Coordinator and API update job simultaneously
- Retry logic causes concurrent updates
Next Steps
- Write failing concurrency tests
- Implement version field with migration
- Update all job update methods
- Verify tests pass
- Document behavior for developers