Files
stack/docs/M6-ISSUE-AUDIT.md
Jason Woltje ef25167c24 fix(#196): fix race condition in job status updates
Implemented optimistic locking with version field and SELECT FOR UPDATE
transactions to prevent data corruption from concurrent job status updates.

Changes:
- Added version field to RunnerJob schema for optimistic locking
- Created migration 20260202_add_runner_job_version_for_concurrency
- Implemented ConcurrentUpdateException for conflict detection
- Updated RunnerJobsService methods with optimistic locking:
  * updateStatus() - with version checking and retry logic
  * updateProgress() - with version checking and retry logic
  * cancel() - with version checking and retry logic
- Updated CoordinatorIntegrationService with SELECT FOR UPDATE:
  * updateJobStatus() - transaction with row locking
  * completeJob() - transaction with row locking
  * failJob() - transaction with row locking
  * updateJobProgress() - optimistic locking
- Added retry mechanism (3 attempts) with exponential backoff
- Added comprehensive concurrency tests (10 tests, all passing)
- Updated existing test mocks to support updateMany

Test Results:
- All 10 concurrency tests passing ✓
- Tests cover concurrent status updates, progress updates, completions,
  cancellations, retry logic, and exponential backoff

This fix prevents race conditions that could cause:
- Lost job results (double completion)
- Lost progress updates
- Invalid status transitions
- Data corruption under concurrent access

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-02 12:51:17 -06:00

21 KiB

M6-AgentOrchestration Issue Audit

Date: 2026-02-02 Milestone: M6-AgentOrchestration (0.0.6) Status: 6 open / 3 closed issues Audit Purpose: Review existing issues against confirmed orchestrator-in-monorepo architecture


Executive Summary

Current State:

  • M6 milestone has 9 issues (6 open, 3 closed)
  • Issues are based on "ClawdBot integration" architecture
  • New architecture: Orchestrator is apps/orchestrator/ in monorepo (NOT ClawdBot)

Key Finding:

  • CONFLICT: All M6 issues reference "ClawdBot" as external execution backend
  • REALITY: Orchestrator is now an internal monorepo service at apps/orchestrator/

Recommendation:

  • Keep existing M6 issues - they represent the control plane (Mosaic Stack's responsibility)
  • Create 34 new issues - for the execution plane (apps/orchestrator/ implementation)
  • Update issue descriptions - replace "ClawdBot" references with "Orchestrator service"

Architecture Comparison

Old Architecture (Current M6 Issues)

Mosaic Stack (Control Plane)
  ↓
ClawdBot Gateway (External service, separate repo)
  ↓
Worker Agents

New Architecture (Confirmed 2026-02-02)

Mosaic Stack Monorepo
├── apps/api/ (Control Plane - task CRUD, dispatch)
├── apps/coordinator/ (Quality gates, 50% rule)
├── apps/orchestrator/ (NEW - Execution plane)
│   ├── Agent spawning
│   ├── Task queue (Valkey/BullMQ)
│   ├── Git operations
│   ├── Health monitoring
│   └── Killswitch responder
└── apps/web/ (Dashboard, agent monitoring)

Key Difference: Orchestrator is IN the monorepo at apps/orchestrator/, not external "ClawdBot".


Existing M6 Issues Analysis

Epic

#95 [EPIC] Agent Orchestration - Persistent task management

  • Status: Open
  • Architecture: Based on ClawdBot integration
  • Recommendation: UPDATE - Keep as overall epic, but update description:
    • Replace "ClawdBot" with "Orchestrator service (apps/orchestrator/)"
    • Update delegation model to reflect monorepo architecture
    • Reference ORCHESTRATOR-MONOREPO-SETUP.md instead of CLAWDBOT-INTEGRATION.md
  • Action: Update issue description

Phase 1: Foundation (Control Plane)

#96 [ORCH-001] Agent Task Database Schema

  • Status: Closed
  • Scope: Database schema for task orchestration
  • Architecture Fit: KEEP AS-IS
  • Reason: Control plane (Mosaic Stack) still needs task database
  • Notes:
    • agent_tasks table - Still needed
    • agent_task_logs - Still needed
    • clawdbot_backends - ⚠️ Rename to orchestrator_instances (if multi-instance)
  • Action: No changes needed (already closed)

#97 [ORCH-002] Task CRUD API

  • Status: Closed
  • Scope: REST API for task management
  • Architecture Fit: KEEP AS-IS
  • Reason: Control plane API (Mosaic Stack) manages tasks
  • Notes:
    • POST/GET/PATCH endpoints - Still needed
    • Dispatch handled in #99 - Correct
  • Action: No changes needed (already closed)

Phase 2: Integration (Control Plane ↔ Execution Plane)

#98 [ORCH-003] Valkey Integration

  • Status: Closed
  • Scope: Valkey for runtime state
  • Architecture Fit: KEEP AS-IS
  • Reason: Shared state between control plane and orchestrator
  • Notes:
    • Task status caching - Control plane needs this
    • Pub/Sub for progress - Still needed
    • Backend health cache - ⚠️ Update to "Orchestrator health cache"
  • Action: No changes needed (already closed)

#99 [ORCH-004] Task Dispatcher Service

  • Status: Open
  • Scope: Dispatch tasks to execution backend
  • Architecture Fit: ⚠️ UPDATE REQUIRED
  • Current Description: "Dispatcher service for delegating work to ClawdBot"
  • Should Be: "Dispatcher service for delegating work to Orchestrator (apps/orchestrator/)"
  • Changes Needed:
    • Replace "ClawdBot Gateway API client" with "Orchestrator API client"
    • Update endpoint references (ClawdBot → Orchestrator)
    • Internal service call, not external HTTP (unless orchestrator runs separately)
  • Action: Update issue description, replace ClawdBot → Orchestrator

#102 [ORCH-007] Gateway Integration

  • Status: Open
  • Scope: Integration with execution backend
  • Architecture Fit: ⚠️ UPDATE REQUIRED
  • Current Description: "Core integration with ClawdBot Gateway API"
  • Should Be: "Integration with Orchestrator service (apps/orchestrator/)"
  • Changes Needed:
    • API endpoints: /orchestrator/agents/spawn, /orchestrator/agents/kill
    • Monorepo service-to-service communication (not external HTTP, or internal HTTP)
    • Session management handled by orchestrator
  • Action: Update issue description, replace ClawdBot → Orchestrator

Phase 3: Failure Handling (Control Plane)

#100 [ORCH-005] ClawdBot Failure Handling

  • Status: Open
  • Scope: Handle failures reported by execution backend
  • Architecture Fit: ⚠️ UPDATE REQUIRED
  • Current Description: "Handle failures reported by ClawdBot"
  • Should Be: "Handle failures reported by Orchestrator"
  • Changes Needed:
    • Callback handler receives failures from orchestrator
    • Retry/escalation logic - Still valid
    • Orchestrator reports failures, control plane decides retry
  • Action: Update issue description, replace ClawdBot → Orchestrator

Phase 4: Observability (Control Plane UI)

#101 [ORCH-006] Task Progress UI

  • Status: Open
  • Scope: Dashboard for monitoring task execution
  • Architecture Fit: KEEP - MINOR UPDATES
  • Current Description: Dashboard with kill controls
  • Should Be: Same, but backend is Orchestrator
  • Changes Needed:
    • Backend health indicators - ⚠️ Update to "Orchestrator health"
    • Real-time progress from Orchestrator via Valkey pub/sub - Correct
  • Action: Minor update to issue description (backend = Orchestrator)

Safety Critical

#114 [ORCH-008] Kill Authority Implementation

  • Status: Open
  • Scope: Control plane kill authority over execution backend
  • Architecture Fit: KEEP - CRITICAL
  • Current Description: "Mosaic Stack MUST retain the ability to terminate any ClawdBot operation"
  • Should Be: "Mosaic Stack MUST retain the ability to terminate any Orchestrator operation"
  • Changes Needed:
    • Endpoints: /api/orchestrator/tasks/:id/kill (not /api/clawdbot/...)
    • Kill signal to orchestrator service
    • Audit trail - Still valid
  • Action: Update issue description, replace ClawdBot → Orchestrator

New Orchestrator Issues (Execution Plane)

The existing M6 issues cover the control plane (Mosaic Stack). We need 34 new issues for the execution plane (apps/orchestrator/).

Source: ORCHESTRATOR-MONOREPO-SETUP.md Section 10.

Foundation (Days 1-2)

  1. [ORCH-101] Set up apps/orchestrator structure

    • Labels: task, setup, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Create directory structure, package.json, tsconfig.json
    • Dependencies: None
    • Conflicts: None (new code)
  2. [ORCH-102] Create Fastify server with health checks

    • Labels: feature, api, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Basic HTTP server with /health endpoint
    • Dependencies: #[ORCH-101]
    • Conflicts: None
  3. [ORCH-103] Docker Compose integration for orchestrator

    • Labels: task, infrastructure, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Add orchestrator service to docker-compose.yml
    • Dependencies: #[ORCH-101]
    • Conflicts: None
  4. [ORCH-104] Monorepo build pipeline for orchestrator

    • Labels: task, infrastructure, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Update turbo.json, ensure orchestrator builds correctly
    • Dependencies: #[ORCH-101]
    • Conflicts: None

Agent Spawning (Days 3-4)

  1. [ORCH-105] Implement agent spawner (Claude SDK)

    • Labels: feature, core, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Spawn Claude agents via Anthropic SDK
    • Dependencies: #[ORCH-102]
    • Conflicts: None
  2. [ORCH-106] Docker sandbox isolation

    • Labels: feature, security, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Isolate agents in Docker containers
    • Dependencies: #[ORCH-105]
    • Conflicts: None
  3. [ORCH-107] Valkey client and state management

    • Labels: feature, core, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Valkey client, state schema implementation
    • Dependencies: #98 (Valkey Integration), #[ORCH-102]
    • Conflicts: None (orchestrator's own Valkey client)
  4. [ORCH-108] BullMQ task queue

    • Labels: feature, core, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Task queue with priority, retry logic
    • Dependencies: #[ORCH-107]
    • Conflicts: None
  5. [ORCH-109] Agent lifecycle management

    • Labels: feature, core, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Manage agent states (spawning, running, completed, failed)
    • Dependencies: #[ORCH-105], #[ORCH-108]
    • Conflicts: None

Git Integration (Days 5-6)

  1. [ORCH-110] Git operations (clone, commit, push)

    • Labels: feature, git, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Implement git-operations.ts with simple-git
    • Dependencies: #[ORCH-105]
    • Conflicts: None
  2. [ORCH-111] Git worktree management

    • Labels: feature, git, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Create and manage git worktrees for isolation
    • Dependencies: #[ORCH-110]
    • Conflicts: None
  3. [ORCH-112] Conflict detection

    • Labels: feature, git, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Detect merge conflicts before pushing
    • Dependencies: #[ORCH-110]
    • Conflicts: None

Coordinator Integration (Days 7-8)

  1. [ORCH-113] Coordinator API client

    • Labels: feature, integration, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: HTTP client for coordinator callbacks
    • Dependencies: #[ORCH-102]
    • Related: Existing coordinator in apps/coordinator/
  2. [ORCH-114] Quality gate callbacks

    • Labels: feature, quality, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Call coordinator quality gates (pre-commit, post-commit)
    • Dependencies: #[ORCH-113]
    • Related: Coordinator implements gates
  3. [ORCH-115] Task dispatch from coordinator

    • Labels: feature, integration, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Coordinator dispatches tasks to orchestrator
    • Dependencies: #99 (Task Dispatcher), #[ORCH-113]
    • Conflicts: None (complements #99)
  4. [ORCH-116] 50% rule enforcement

    • Labels: feature, quality, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Mechanical gates + AI confirmation
    • Dependencies: #[ORCH-114]
    • Related: Coordinator enforces, orchestrator calls

Killswitch + Security (Days 9-10)

  1. [ORCH-117] Killswitch implementation

    • Labels: feature, security, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Kill single agent or all agents (emergency stop)
    • Dependencies: #[ORCH-109]
    • Related: #114 (Kill Authority in control plane)
  2. [ORCH-118] Resource cleanup

    • Labels: task, infrastructure, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Clean up Docker containers, git worktrees
    • Dependencies: #[ORCH-117]
    • Conflicts: None
  3. [ORCH-119] Docker security hardening

    • Labels: security, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Non-root user, minimal image, security scanning
    • Dependencies: #[ORCH-106]
    • Conflicts: None
  4. [ORCH-120] Secret scanning

    • Labels: security, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: git-secrets integration, pre-commit hooks
    • Dependencies: #[ORCH-110]
    • Conflicts: None

Quality Gates (Days 11-12)

  1. [ORCH-121] Mechanical quality gates

    • Labels: feature, quality, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: TypeScript, ESLint, tests, coverage
    • Dependencies: #[ORCH-114]
    • Related: Coordinator has gate implementations
  2. [ORCH-122] AI agent confirmation

    • Labels: feature, quality, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Independent AI agent reviews changes
    • Dependencies: #[ORCH-114]
    • Related: Coordinator calls AI reviewer
  3. [ORCH-123] YOLO mode (gate bypass)

    • Labels: feature, configuration, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: User-configurable approval gates
    • Dependencies: #[ORCH-114]
    • Conflicts: None
  4. [ORCH-124] Gate configuration per-task

    • Labels: feature, configuration, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Different quality gates for different tasks
    • Dependencies: #[ORCH-114]
    • Conflicts: None

Testing (Days 13-14)

  1. [ORCH-125] E2E test: Full agent lifecycle

    • Labels: test, e2e, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Spawn → Git → Quality → Complete
    • Dependencies: All above
    • Conflicts: None
  2. [ORCH-126] E2E test: Killswitch

    • Labels: test, e2e, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Kill single and all agents
    • Dependencies: #[ORCH-117]
    • Conflicts: None
  3. [ORCH-127] E2E test: Concurrent agents

    • Labels: test, e2e, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: 10 concurrent agents
    • Dependencies: #[ORCH-109]
    • Conflicts: None
  4. [ORCH-128] Performance testing

    • Labels: test, performance, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Load testing, resource monitoring
    • Dependencies: #[ORCH-125]
    • Conflicts: None
  5. [ORCH-129] Documentation

    • Labels: documentation, orchestrator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: API docs, architecture diagrams, runbooks
    • Dependencies: All above
    • Conflicts: None

Integration Issues (Existing Apps)

  1. [ORCH-130] apps/api: Add orchestrator client

    • Labels: feature, integration, api
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: HTTP client for orchestrator API
    • Dependencies: #[ORCH-102], #99 (uses this client)
    • Conflicts: None (extends #99)
  2. [ORCH-131] apps/coordinator: Add orchestrator dispatcher

    • Labels: feature, integration, coordinator
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Dispatch tasks to orchestrator after quality pre-check
    • Dependencies: #[ORCH-102], #99
    • Related: Coordinator already exists
  3. [ORCH-132] apps/web: Add agent dashboard

    • Labels: feature, ui, web
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Real-time agent status dashboard
    • Dependencies: #101 (extends this), #[ORCH-102]
    • Related: Extends #101
  4. [ORCH-133] docker-compose: Add orchestrator service

    • Labels: task, infrastructure
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Integrate orchestrator into docker-compose.yml
    • Dependencies: #[ORCH-103]
    • Conflicts: None
  5. [ORCH-134] Update root documentation

    • Labels: documentation
    • Milestone: M6-AgentOrchestration (0.0.6)
    • Description: Update README, ARCHITECTURE.md
    • Dependencies: #[ORCH-129]
    • Conflicts: None

Integration Matrix

Existing M6 Issues (Control Plane)

Issue Keep? Update? Reason
#95 (Epic) ⚠️ Update ClawdBot → Orchestrator
#96 (Schema) Already closed, no changes
#97 (CRUD API) Already closed, no changes
#98 (Valkey) Already closed, no changes
#99 (Dispatcher) ⚠️ Update ClawdBot → Orchestrator
#100 (Failure Handling) ⚠️ Update ClawdBot → Orchestrator
#101 (Progress UI) ⚠️ Minor update (backend = Orchestrator)
#102 (Gateway Integration) ⚠️ Update ClawdBot → Orchestrator
#114 (Kill Authority) ⚠️ Update ClawdBot → Orchestrator

New Orchestrator Issues (Execution Plane)

Issue Phase Dependencies Conflicts
ORCH-101 to ORCH-104 Foundation None None
ORCH-105 to ORCH-109 Spawning Foundation None
ORCH-110 to ORCH-112 Git Spawning None
ORCH-113 to ORCH-116 Coordinator Git None
ORCH-117 to ORCH-120 Security Coordinator None
ORCH-121 to ORCH-124 Quality Security None
ORCH-125 to ORCH-129 Testing All above None
ORCH-130 to ORCH-134 Integration Testing Extends existing

No conflicts. New issues are additive (execution plane). Existing issues are control plane.


Immediate (Before Creating New Issues)

  1. Update Existing M6 Issues (6 issues to update)

    • #95: Update epic description (ClawdBot → Orchestrator service)
    • #99: Update dispatcher description
    • #100: Update failure handling description
    • #101: Minor update (backend = Orchestrator)
    • #102: Update gateway integration description
    • #114: Update kill authority description

    Script:

    # For each issue, use tea CLI:
    tea issues edit <issue-number> --description "<updated description>"
    
  2. Add Architecture Reference to Epic

    • Update #95 to reference:
      • ORCHESTRATOR-MONOREPO-SETUP.md
      • ARCHITECTURE-CLARIFICATION.md
    • Remove reference to CLAWDBOT-INTEGRATION.md (obsolete)

After Updates

  1. Create 34 New Orchestrator Issues

    • Use template:

      # [ORCH-XXX] Title
      
      ## Description
      
      [What needs to be done]
      
      ## Acceptance Criteria
      
      - [ ] Criterion 1
      - [ ] Criterion 2
      
      ## Dependencies
      
      - Blocks: #X
      - Blocked by: #Y
      
      ## Technical Notes
      
      [Implementation details from ORCHESTRATOR-MONOREPO-SETUP.md]
      
  2. Create Label: orchestrator

    tea labels create orchestrator --color "#FF6B35" --description "Orchestrator service (apps/orchestrator/)"
    
  3. Link Issues

    • New orchestrator issues should reference control plane issues:
      • ORCH-130 extends #99 (API client for dispatcher)
      • ORCH-131 extends #99 (Coordinator dispatcher)
      • ORCH-132 extends #101 (Agent dashboard)
    • Use "Blocks:" and "Blocked by:" in issue descriptions

Issue Creation Priority

Phase 1: Foundation (Create First)

  • ORCH-101 to ORCH-104 (no dependencies)

Phase 2: Core Features

  • ORCH-105 to ORCH-109 (spawning)
  • ORCH-110 to ORCH-112 (git)
  • ORCH-113 to ORCH-116 (coordinator)

Phase 3: Security & Quality

  • ORCH-117 to ORCH-120 (security)
  • ORCH-121 to ORCH-124 (quality)

Phase 4: Testing & Integration

  • ORCH-125 to ORCH-129 (testing)
  • ORCH-130 to ORCH-134 (integration)

Summary

Existing M6 Issues: 9 total

  • Keep: 9 (all control plane work)
  • Update: 6 (replace ClawdBot → Orchestrator)
  • Close: 0 (all still valid)

New Orchestrator Issues: 34 total

  • Foundation: 4 issues
  • Spawning: 5 issues
  • Git: 3 issues
  • Coordinator: 4 issues
  • Security: 4 issues
  • Quality: 4 issues
  • Testing: 5 issues
  • Integration: 5 issues

Total M6 Issues After Audit: 43 issues

  • 9 control plane (existing, updated)
  • 34 execution plane (new)

Conflicts: None (clean separation between control plane and execution plane)

Blockers: None

Questions for Jason:

  1. Approve update of existing 6 issues? (replace ClawdBot → Orchestrator)
  2. Approve creation of 34 new orchestrator issues?
  3. Create orchestrator label?
  4. Any additional issues needed?

Next Steps

  1. Review this audit
  2. ⏸️ Get Jason's approval
  3. ⏸️ Update existing 6 M6 issues
  4. ⏸️ Create orchestrator label
  5. ⏸️ Create 34 new orchestrator issues
  6. ⏸️ Link issues (dependencies, blocks)
  7. ⏸️ Update M6 milestone (43 total issues)

Ready to proceed?