mosaic/stack

Files

Jason Woltje ef25167c24 fix(#196 ): fix race condition in job status updates

Implemented optimistic locking with version field and SELECT FOR UPDATE
transactions to prevent data corruption from concurrent job status updates.

Changes:
- Added version field to RunnerJob schema for optimistic locking
- Created migration 20260202_add_runner_job_version_for_concurrency
- Implemented ConcurrentUpdateException for conflict detection
- Updated RunnerJobsService methods with optimistic locking:
  * updateStatus() - with version checking and retry logic
  * updateProgress() - with version checking and retry logic
  * cancel() - with version checking and retry logic
- Updated CoordinatorIntegrationService with SELECT FOR UPDATE:
  * updateJobStatus() - transaction with row locking
  * completeJob() - transaction with row locking
  * failJob() - transaction with row locking
  * updateJobProgress() - optimistic locking
- Added retry mechanism (3 attempts) with exponential backoff
- Added comprehensive concurrency tests (10 tests, all passing)
- Updated existing test mocks to support updateMany

Test Results:
- All 10 concurrency tests passing ✓
- Tests cover concurrent status updates, progress updates, completions,
  cancellations, retry logic, and exponential backoff

This fix prevents race conditions that could cause:
- Lost job results (double completion)
- Lost progress updates
- Invalid status transitions
- Data corruption under concurrent access

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-02 12:51:17 -06:00

7.1 KiB

Raw Permalink Blame History

Issue #196: Fix race condition in job status updates

Objective

Fix race condition in job status update logic that can cause data corruption when multiple processes attempt to update the same job simultaneously. This is a P2 RELIABILITY issue.

Race Condition Analysis

Current Implementation Problems

RunnerJobsService.updateStatus() (lines 418-462)
- Read job: prisma.runnerJob.findUnique()
- Make decision based on read data
- Update job: prisma.runnerJob.update()
- RACE CONDITION: Between read and update, another process can modify the job
RunnerJobsService.updateProgress() (lines 467-485)
- Same pattern: read, check, update
- RACE CONDITION: Progress updates can be lost or overwritten
CoordinatorIntegrationService.updateJobStatus() (lines 103-152)
- Reads job to validate status transition
- RACE CONDITION: Status can change between validation and update
RunnerJobsService.cancel() (lines 149-178)
- Similar pattern with race condition

Concurrent Scenarios That Cause Issues

Scenario 1: Double completion

Process A: Reads job (status=RUNNING), decides to complete it
Process B: Reads job (status=RUNNING), decides to complete it
Process A: Updates job to COMPLETED with resultA
Process B: Updates job to COMPLETED with resultB (overwrites resultA)
Result: Lost data (resultA lost)

Scenario 2: Progress updates lost

Process A: Updates progress to 50%
Process B: Updates progress to 75% (concurrent)
Result: One update lost depending on race timing

Scenario 3: Invalid status transitions

Process A: Reads job (status=RUNNING), validates transition to COMPLETED
Process B: Reads job (status=RUNNING), validates transition to FAILED
Process A: Updates to COMPLETED
Process B: Updates to FAILED (overwrites COMPLETED)
Result: Invalid state - job marked as FAILED when it actually completed

Approach

Solution 1: Add Version Field (Optimistic Locking)

Add a version field to RunnerJob model:

model RunnerJob {
  // ... existing fields
  version Int @default(0)
}

Update pattern:

const result = await prisma.runnerJob.updateMany({
  where: {
    id: jobId,
    workspaceId: workspaceId,
    version: currentVersion, // Only update if version matches
  },
  data: {
    status: newStatus,
    version: { increment: 1 },
  },
});

if (result.count === 0) {
  // Concurrent update detected - retry or throw error
}

Solution 2: Use Database Transactions with SELECT FOR UPDATE

await prisma.$transaction(async (tx) => {
  // Lock the row
  const job = await tx.$queryRaw`
    SELECT * FROM "RunnerJob"
    WHERE id = ${jobId} AND workspace_id = ${workspaceId}
    FOR UPDATE
  `;

  // Validate and update
  // Row is locked until transaction commits
});

Solution 3: Hybrid Approach (Recommended)

Use optimistic locking (version field) for most updates (better performance)
Use SELECT FOR UPDATE for critical sections (status transitions)
Implement retry logic for optimistic lock failures

Progress

Analyze current implementation
Identify race conditions
Design solution approach
Write concurrency tests (RED phase)
Add version field to schema
Create migration for version field
Implement optimistic locking in updateStatus()
Implement optimistic locking in updateProgress()
Implement optimistic locking in cancel()
Implement SELECT FOR UPDATE for coordinator updates (updateJobStatus, completeJob, failJob)
Add retry logic for concurrent update conflicts
Create ConcurrentUpdateException
Verify all tests pass
Run coverage check (≥85%)
Commit changes

Testing Strategy

Concurrency Tests to Write

Test concurrent status updates
- Simulate 2+ processes updating same job status
- Verify only one succeeds or updates are properly serialized
- Verify no data loss
Test concurrent progress updates
- Simulate rapid progress updates
- Verify all updates are recorded or properly merged
Test status transition validation with concurrency
- Simulate concurrent invalid transitions
- Verify invalid transitions are rejected
Test completion race
- Simulate concurrent completion with different results
- Verify only one completion succeeds and data isn't lost
Test optimistic lock retry logic
- Simulate version conflicts
- Verify retry mechanism works correctly

Implementation Plan

Phase 1: Schema Changes (with migration)

Add version field to RunnerJob model
Create migration
Run migration

Phase 2: Update Methods (TDD)

updateStatus() - Add optimistic locking
updateProgress() - Add optimistic locking
completeJob() - Add optimistic locking
failJob() - Add optimistic locking
cancel() - Add optimistic locking

Phase 3: Critical Sections

updateJobStatus() in coordinator integration - Add transaction with SELECT FOR UPDATE
Add retry logic wrapper

Phase 4: Error Handling

Add custom exception for concurrent update conflicts
Implement retry logic (max 3 retries with exponential backoff)
Log concurrent update conflicts for monitoring

Notes

Version Field vs SELECT FOR UPDATE

Optimistic Locking (version field):

✅ Better performance (no row locks)
✅ Works well for high-concurrency scenarios
✅ Simple to implement
❌ Requires retry logic
❌ Client must handle conflicts

Pessimistic Locking (SELECT FOR UPDATE):

✅ Guarantees no conflicts
✅ No retry logic needed
❌ Locks rows (can cause contention)
❌ Risk of deadlocks if not careful
❌ Lower throughput under high concurrency

Recommendation: Use optimistic locking as default, SELECT FOR UPDATE only for critical status transitions.

Prisma Limitations

Prisma doesn't have native optimistic locking support. We need to:

Add version field manually
Use updateMany() with version check (returns count)
Handle count=0 as conflict

Retry Strategy

For optimistic lock failures:

async function retryOnConflict<T>(operation: () => Promise<T>, maxRetries = 3): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (error instanceof ConcurrentUpdateError && i < maxRetries - 1) {
        await sleep(Math.pow(2, i) * 100); // Exponential backoff
        continue;
      }
      throw error;
    }
  }
}

Findings

Current State

No concurrency protection exists
All update methods are vulnerable to race conditions
No version tracking or locking mechanism
High risk under concurrent job processing

Risk Assessment

P2 RELIABILITY is correct - can cause data corruption
Most likely to occur when:
- Multiple workers process same job queue
- Coordinator and API update job simultaneously
- Retry logic causes concurrent updates

Next Steps

Write failing concurrency tests
Implement version field with migration
Update all job update methods
Verify tests pass
Document behavior for developers