Files
stack/apps/orchestrator
Jason Woltje a0062494b7
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
fix(CQ-ORCH-7): Graceful Docker container shutdown before force remove
Replace the always-force container removal (SIGKILL) with a two-phase
approach: first attempt graceful stop (SIGTERM with configurable timeout),
then remove without force. Falls back to force remove only if the graceful
path fails. The graceful stop timeout is configurable via
orchestrator.sandbox.gracefulStopTimeoutSeconds (default: 10s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:05:53 -06:00
..

Mosaic Orchestrator

Agent orchestration service for Mosaic Stack built with NestJS.

Overview

The Orchestrator is the execution plane of Mosaic Stack, responsible for:

  • Spawning and managing Claude agents (worker, reviewer, tester)
  • Task queue management via BullMQ with Valkey backend
  • Agent lifecycle state machine (spawning → running → completed/failed/killed)
  • Git workflow automation with worktree isolation per agent
  • Quality gate enforcement via Coordinator integration
  • Killswitch emergency stop with cleanup
  • Docker sandbox isolation (optional)
  • Secret scanning on agent commits

Architecture

AppModule
├── HealthModule          → GET /health, GET /health/ready
├── AgentsModule          → POST /agents/spawn, GET /agents/:id/status, kill endpoints
│   ├── QueueModule       → BullMQ task queue (priority 1-10, retry with backoff)
│   ├── SpawnerModule     → Agent session management, Docker sandbox, lifecycle FSM
│   ├── KillswitchModule  → Emergency kill + cleanup (Docker, worktree, Valkey state)
│   └── ValkeyModule      → Distributed state persistence and pub/sub events
├── CoordinatorModule     → Quality gate checks (typecheck, lint, tests, coverage, AI review)
├── GitModule             → Clone, branch, commit, push, conflict detection, secret scanning
└── MonitorModule         → Agent health monitoring (placeholder)

Part of the Mosaic Stack monorepo at apps/orchestrator/. Controlled by apps/coordinator/ (Quality Coordinator). Monitored via apps/web/ (Agent Dashboard).

API Reference

Health

Method Path Description
GET /health Uptime and status
GET /health/ready Readiness check

Agents

Method Path Description
POST /agents/spawn Spawn a new agent
GET /agents/:agentId/status Get agent status
POST /agents/:agentId/kill Kill a single agent
POST /agents/kill-all Kill all active agents

POST /agents/spawn

{
  "taskId": "string (required)",
  "agentType": "worker | reviewer | tester",
  "gateProfile": "strict | standard | minimal | custom (optional)",
  "context": {
    "repository": "https://git.example.com/repo.git",
    "branch": "main",
    "workItems": ["US-001"],
    "skills": ["typescript"]
  }
}

Response:

{
  "agentId": "uuid",
  "status": "spawning"
}

GET /agents/:agentId/status

Response:

{
  "agentId": "uuid",
  "taskId": "string",
  "status": "spawning | running | completed | failed | killed",
  "spawnedAt": "ISO timestamp",
  "startedAt": "ISO timestamp (optional)",
  "completedAt": "ISO timestamp (optional)",
  "error": "string (optional)"
}

POST /agents/kill-all

Response:

{
  "message": "Kill all completed: 3 killed, 0 failed",
  "total": 3,
  "killed": 3,
  "failed": 0,
  "errors": []
}

Services

Service Module Responsibility
AgentSpawnerService Spawner Create agent sessions, generate UUIDs, track state
AgentLifecycleService Spawner State machine transitions with Valkey pub/sub events
DockerSandboxService Spawner Container creation with memory/CPU limits
QueueService Queue BullMQ priority queue with exponential backoff retry
KillswitchService Killswitch Emergency agent termination with audit logging
CleanupService Killswitch Multi-step cleanup (Docker, worktree, Valkey state)
GitOperationsService Git Clone, branch, commit, push operations
WorktreeManagerService Git Per-agent worktree isolation
ConflictDetectionService Git Merge conflict detection before push
SecretScannerService Git Detect hardcoded secrets (AWS, API keys, JWTs, etc.)
ValkeyService Valkey Distributed state and event pub/sub
CoordinatorClientService Coordinator HTTP client for quality gate API with retry
QualityGatesService Coordinator Pre-commit and post-commit gate evaluation

Valkey State Keys

orchestrator:task:{taskId}    → TaskState (status, agentId, context, timestamps)
orchestrator:agent:{agentId}  → AgentState (status, taskId, timestamps, error)
orchestrator:events           → Pub/sub channel for lifecycle events

Quality Gate Profiles

Profile Default For Gates
strict reviewer typecheck, lint, tests, coverage (85%), build, integration, AI review
standard worker typecheck, lint, tests, coverage (85%)
minimal tester tests only

Development

# Install dependencies (from monorepo root)
pnpm install

# Run in dev mode
pnpm --filter @mosaic/orchestrator dev

# Build
pnpm --filter @mosaic/orchestrator build

# Run unit tests
pnpm --filter @mosaic/orchestrator test

# Run E2E/integration tests
pnpm --filter @mosaic/orchestrator test:e2e

# Type check
pnpm --filter @mosaic/orchestrator typecheck

# Lint
pnpm --filter @mosaic/orchestrator lint

Testing

  • Unit tests: Co-located *.spec.ts files (19 test files, 447+ tests)
  • Integration tests: tests/integration/*.e2e-spec.ts (17 E2E tests)
  • Coverage threshold: 85% (lines, functions, branches, statements)

Configuration

Environment variables loaded via @nestjs/config. Key variables:

Variable Description
ORCHESTRATOR_PORT HTTP port (default: 3001)
CLAUDE_API_KEY Claude API key for agents
VALKEY_HOST Valkey/Redis host (default: localhost)
VALKEY_PORT Valkey/Redis port (default: 6379)
COORDINATOR_URL Quality Coordinator base URL
SANDBOX_ENABLED Enable Docker sandbox (true/false)
  • Design: docs/design/agent-orchestration.md
  • Setup: docs/ORCHESTRATOR-MONOREPO-SETUP.md
  • Milestone: M6-AgentOrchestration (0.0.6)