Fix race condition in job status updates #196

Closed
opened 2026-02-02 17:27:09 +00:00 by jason.woltje · 0 comments
Owner

Problem

Job status updates use read-then-update pattern that allows concurrent updates to violate state machine invariants.

Location

apps/api/src/coordinator-integration/coordinator-integration.service.ts:103-152

// T1: Read current status
const job = await this.prisma.runnerJob.findUnique({...});

// T2: Validate transition  
if (!this.isValidStatusTransition(job.status, dto.status)) {
  throw new BadRequestException();
}

// T3: Update (RACE: Another request could update between read and write)
const updatedJob = await this.prisma.runnerJob.update({...});

Race Scenario

Time | Request A (→RUNNING)    | Request B (→CANCELLED)
-----|-------------------------|------------------------
T1   | Read: status=PENDING    |
T2   |                         | Read: status=PENDING
T3   | Validate: PENDING→RUNNING ✓ |
T4   |                         | Validate: PENDING→CANCELLED ✓
T5   | Update: RUNNING         |
T6   |                         | Update: CANCELLED ✓
T7   | Job=CANCELLED (wrong!)  |

Impact

  • Invalid state transitions allowed
  • Lost updates (RUNNING overwritten by stale CANCELLED)
  • Event ordering confusion
  • Herald broadcasts conflicting statuses

Acceptance Criteria

  • Implement optimistic locking with database constraint
  • Use atomic check-and-set operations
  • Return conflict errors with current state
  • Add tests for concurrent updates
  • Add retry logic in coordinator

Implementation

async updateJobStatus(jobId: string, dto: UpdateJobStatusDto) {
  // Atomic check-and-set with raw SQL
  const result = await this.prisma.$executeRaw`
    UPDATE runner_jobs 
    SET status = ${dto.status}, updated_at = NOW()
    WHERE id = ${jobId} 
      AND status = ${expectedCurrentStatus}
    RETURNING *
  `;
  
  if (result.rowCount === 0) {
    const current = await this.prisma.runnerJob.findUnique({...});
    throw new ConflictException(
      `Status changed from ${expectedCurrentStatus} to ${current.status}`
    );
  }
  
  await this.emitStatusChange(jobId, dto.status);
}

Testing

  • Test concurrent updates cause conflicts
  • Test retry succeeds with current state
  • Test valid transitions allowed
  • Test invalid transitions blocked

References

M4.2-Infrastructure verification report (2026-02-02)

## Problem Job status updates use read-then-update pattern that allows concurrent updates to violate state machine invariants. ## Location apps/api/src/coordinator-integration/coordinator-integration.service.ts:103-152 ```typescript // T1: Read current status const job = await this.prisma.runnerJob.findUnique({...}); // T2: Validate transition if (!this.isValidStatusTransition(job.status, dto.status)) { throw new BadRequestException(); } // T3: Update (RACE: Another request could update between read and write) const updatedJob = await this.prisma.runnerJob.update({...}); ``` ## Race Scenario ``` Time | Request A (→RUNNING) | Request B (→CANCELLED) -----|-------------------------|------------------------ T1 | Read: status=PENDING | T2 | | Read: status=PENDING T3 | Validate: PENDING→RUNNING ✓ | T4 | | Validate: PENDING→CANCELLED ✓ T5 | Update: RUNNING | T6 | | Update: CANCELLED ✓ T7 | Job=CANCELLED (wrong!) | ``` ## Impact - Invalid state transitions allowed - Lost updates (RUNNING overwritten by stale CANCELLED) - Event ordering confusion - Herald broadcasts conflicting statuses ## Acceptance Criteria - [ ] Implement optimistic locking with database constraint - [ ] Use atomic check-and-set operations - [ ] Return conflict errors with current state - [ ] Add tests for concurrent updates - [ ] Add retry logic in coordinator ## Implementation ```typescript async updateJobStatus(jobId: string, dto: UpdateJobStatusDto) { // Atomic check-and-set with raw SQL const result = await this.prisma.$executeRaw` UPDATE runner_jobs SET status = ${dto.status}, updated_at = NOW() WHERE id = ${jobId} AND status = ${expectedCurrentStatus} RETURNING * `; if (result.rowCount === 0) { const current = await this.prisma.runnerJob.findUnique({...}); throw new ConflictException( `Status changed from ${expectedCurrentStatus} to ${current.status}` ); } await this.emitStatusChange(jobId, dto.status); } ``` ## Testing - [ ] Test concurrent updates cause conflicts - [ ] Test retry succeeds with current state - [ ] Test valid transitions allowed - [ ] Test invalid transitions blocked ## References M4.2-Infrastructure verification report (2026-02-02)
jason.woltje added this to the M4.2-Infrastructure (0.0.4) milestone 2026-02-02 17:27:09 +00:00
jason.woltje added the p2apiapi labels 2026-02-02 17:27:09 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaic/stack#196