feat(#93): implement agent spawn via federation

Implements FED-010: Agent Spawn via Federation feature that enables spawning and managing Claude agents on remote federated Mosaic Stack instances via COMMAND message type. Features: - Federation agent command types (spawn, status, kill) - FederationAgentService for handling agent operations - Integration with orchestrator's agent spawner/lifecycle services - API endpoints for spawning, querying status, and killing agents - Full command routing through federation COMMAND infrastructure - Comprehensive test coverage (12/12 tests passing) Architecture: - Hub → Spoke: Spawn agents on remote instances - Command flow: FederationController → FederationAgentService → CommandService → Remote Orchestrator - Response handling: Remote orchestrator returns agent status/results - Security: Connection validation, signature verification Files created: - apps/api/src/federation/types/federation-agent.types.ts - apps/api/src/federation/federation-agent.service.ts - apps/api/src/federation/federation-agent.service.spec.ts Files modified: - apps/api/src/federation/command.service.ts (agent command routing) - apps/api/src/federation/federation.controller.ts (agent endpoints) - apps/api/src/federation/federation.module.ts (service registration) - apps/orchestrator/src/api/agents/agents.controller.ts (status endpoint) - apps/orchestrator/src/api/agents/agents.module.ts (lifecycle integration) Testing: - 12/12 tests passing for FederationAgentService - All command service tests passing - TypeScript compilation successful - Linting passed Refs #93 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 14:37:06 -06:00
parent a8c8af21e5
commit 12abdfe81d
405 changed files with 13545 additions and 2153 deletions
--- a/docs/design/agent-orchestration.md
+++ b/docs/design/agent-orchestration.md
@@ -87,14 +87,14 @@ The Agent Orchestration Layer must provide:

 ### Component Responsibilities

-| Component | Responsibility |
-|-----------|----------------|
-| **Task Manager** | CRUD operations on tasks, state transitions, assignment logic |
-| **Agent Manager** | Agent lifecycle, health tracking, session management |
-| **Coordinator** | Heartbeat processing, failure detection, recovery orchestration |
-| **PostgreSQL** | Persistent storage of tasks, agents, sessions, logs |
-| **Valkey/Redis** | Runtime state, heartbeats, quick lookups, pub/sub |
-| **Gateway** | Agent spawning, session management, message routing |
+| Component         | Responsibility                                                  |
+| ----------------- | --------------------------------------------------------------- |
+| **Task Manager**  | CRUD operations on tasks, state transitions, assignment logic   |
+| **Agent Manager** | Agent lifecycle, health tracking, session management            |
+| **Coordinator**   | Heartbeat processing, failure detection, recovery orchestration |
+| **PostgreSQL**    | Persistent storage of tasks, agents, sessions, logs             |
+| **Valkey/Redis**  | Runtime state, heartbeats, quick lookups, pub/sub               |
+| **Gateway**       | Agent spawning, session management, message routing             |

 ---

@@ -156,44 +156,44 @@ CREATE TYPE task_orchestration_status AS ENUM (
 CREATE TABLE agent_tasks (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
-  
+
  -- Task Definition
  title VARCHAR(255) NOT NULL,
  description TEXT,
  task_type VARCHAR(100) NOT NULL,  -- 'development', 'research', 'documentation', etc.
-  
+
  -- Status & Priority
  status task_orchestration_status DEFAULT 'PENDING',
  priority INT DEFAULT 5,  -- 1 (low) to 10 (high)
-  
+
  -- Assignment
  agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
  session_key VARCHAR(255),  -- Current active session
-  
+
  -- Progress Tracking
  progress_percent INT DEFAULT 0 CHECK (progress_percent BETWEEN 0 AND 100),
  current_step TEXT,
  estimated_completion_at TIMESTAMPTZ,
-  
+
  -- Retry Logic
  retry_count INT DEFAULT 0,
  max_retries INT DEFAULT 3,
  retry_backoff_seconds INT DEFAULT 300,  -- 5 minutes
  last_error TEXT,
-  
+
  -- Dependencies
  depends_on UUID[] DEFAULT ARRAY[]::UUID[],  -- Array of task IDs
  blocks UUID[] DEFAULT ARRAY[]::UUID[],      -- Tasks blocked by this one
-  
+
  -- Context
  input_context JSONB DEFAULT '{}',   -- Input data/params for agent
  output_result JSONB,                -- Final result from agent
  checkpoint_data JSONB DEFAULT '{}', -- Resumable state
-  
+
  -- Metadata
  metadata JSONB DEFAULT '{}',
  tags VARCHAR(100)[] DEFAULT ARRAY[]::VARCHAR[],
-  
+
  -- Audit
  created_by UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
@@ -202,7 +202,7 @@ CREATE TABLE agent_tasks (
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  failed_at TIMESTAMPTZ,
-  
+
  CONSTRAINT fk_workspace FOREIGN KEY (workspace_id) REFERENCES workspaces(id),
  CONSTRAINT fk_agent FOREIGN KEY (agent_id) REFERENCES agents(id)
 );
@@ -227,21 +227,21 @@ CREATE TABLE agent_task_logs (
  workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
  task_id UUID NOT NULL REFERENCES agent_tasks(id) ON DELETE CASCADE,
  agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
-  
+
  -- Log Entry
  level task_log_level DEFAULT 'INFO',
  event VARCHAR(100) NOT NULL,  -- 'state_transition', 'progress_update', 'error', etc.
  message TEXT,
  details JSONB DEFAULT '{}',
-  
+
  -- State Snapshot
  previous_status task_orchestration_status,
  new_status task_orchestration_status,
-  
+
  -- Context
  session_key VARCHAR(255),
  stack_trace TEXT,
-  
+
  created_at TIMESTAMPTZ DEFAULT NOW()
 );

@@ -260,20 +260,20 @@ CREATE TABLE agent_heartbeats (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
  agent_id UUID NOT NULL REFERENCES agents(id) ON DELETE CASCADE,
-  
+
  -- Health Data
  status VARCHAR(50) NOT NULL,  -- 'healthy', 'degraded', 'stale'
  current_task_id UUID REFERENCES agent_tasks(id) ON DELETE SET NULL,
  progress_percent INT,
-  
+
  -- Resource Usage
  memory_mb INT,
  cpu_percent INT,
-  
+
  -- Timing
  last_seen_at TIMESTAMPTZ DEFAULT NOW(),
  next_expected_at TIMESTAMPTZ,
-  
+
  metadata JSONB DEFAULT '{}'
 );

@@ -284,7 +284,7 @@ CREATE INDEX idx_heartbeats_stale ON agent_heartbeats(last_seen_at) WHERE status
 CREATE OR REPLACE FUNCTION cleanup_old_heartbeats()
 RETURNS void AS $$
 BEGIN
-  DELETE FROM agent_heartbeats 
+  DELETE FROM agent_heartbeats
  WHERE last_seen_at < NOW() - INTERVAL '1 hour';
 END;
 $$ LANGUAGE plpgsql;
@@ -310,6 +310,7 @@ CREATE INDEX idx_agents_coordinator ON agents(coordinator_enabled) WHERE coordin
 ## Valkey/Redis Key Patterns

 Valkey is used for:
+
 - **Real-time state** (fast reads/writes)
 - **Pub/Sub messaging** (coordination events)
 - **Distributed locks** (prevent race conditions)
@@ -343,9 +344,9 @@ SADD tasks:active:{workspace_id} {task_id}:{session_key}
 SETEX agent:heartbeat:{agent_id} 60 "{\"status\":\"running\",\"task_id\":\"{task_id}\",\"timestamp\":{ts}}"

 # Agent status (hash)
-HSET agent:status:{agent_id} 
-  status "running" 
-  current_task "{task_id}" 
+HSET agent:status:{agent_id}
+  status "running"
+  current_task "{task_id}"
  last_heartbeat {timestamp}

 # Stale agents (sorted set by last heartbeat)
@@ -382,7 +383,7 @@ PUBLISH coordinator:commands:{workspace_id} "{\"command\":\"reassign_task\",\"ta

 ```redis
 # Session context (hash with TTL)
-HSET session:context:{session_key} 
+HSET session:context:{session_key}
  workspace_id "{workspace_id}"
  agent_id "{agent_id}"
  task_id "{task_id}"
@@ -392,14 +393,14 @@ EXPIRE session:context:{session_key} 3600

 ### Data Lifecycle

-| Key Type | TTL | Cleanup Strategy |
-|----------|-----|------------------|
-| `agent:heartbeat:*` | 60s | Auto-expire |
-| `agent:status:*` | None | Delete on agent termination |
-| `session:context:*` | 1h | Auto-expire |
-| `tasks:pending:*` | None | Remove on assignment |
-| `coordinator:lock:*` | 30s | Auto-expire (renewed by active coordinator) |
-| `task:assign_lock:*` | 5s | Auto-expire after assignment |
+| Key Type             | TTL  | Cleanup Strategy                            |
+| -------------------- | ---- | ------------------------------------------- |
+| `agent:heartbeat:*`  | 60s  | Auto-expire                                 |
+| `agent:status:*`     | None | Delete on agent termination                 |
+| `session:context:*`  | 1h   | Auto-expire                                 |
+| `tasks:pending:*`    | None | Remove on assignment                        |
+| `coordinator:lock:*` | 30s  | Auto-expire (renewed by active coordinator) |
+| `task:assign_lock:*` | 5s   | Auto-expire after assignment                |

 ---

@@ -546,7 +547,7 @@ POST {configured_webhook_url}
 export class CoordinatorService {
  private coordinatorLock: string;
  private isRunning: boolean = false;
-  
+
  constructor(
    private readonly taskManager: TaskManagerService,
    private readonly agentManager: AgentManagerService,
@@ -554,191 +555,190 @@ export class CoordinatorService {
    private readonly prisma: PrismaService,
    private readonly logger: Logger
  ) {}
-  
+
  // Main coordination loop
-  @Cron('*/30 * * * * *')  // Every 30 seconds
+  @Cron("*/30 * * * * *") // Every 30 seconds
  async coordinate() {
-    if (!await this.acquireLock()) {
-      return;  // Another coordinator is active
+    if (!(await this.acquireLock())) {
+      return; // Another coordinator is active
    }
-    
+
    try {
      await this.checkAgentHealth();
      await this.assignPendingTasks();
      await this.resolveDependencies();
      await this.recoverFailedTasks();
    } catch (error) {
-      this.logger.error('Coordination cycle failed', error);
+      this.logger.error("Coordination cycle failed", error);
    } finally {
      await this.releaseLock();
    }
  }
-  
+
  // Distributed lock to prevent multiple coordinators
  private async acquireLock(): Promise<boolean> {
    const lockKey = `coordinator:lock:global`;
    const result = await this.valkey.set(
      lockKey,
-      process.env.HOSTNAME || 'coordinator',
-      'NX',
-      'EX',
+      process.env.HOSTNAME || "coordinator",
+      "NX",
+      "EX",
      30
    );
-    return result === 'OK';
+    return result === "OK";
  }
-  
+
  // Check agent heartbeats and mark stale
  private async checkAgentHealth() {
    const agents = await this.agentManager.getCoordinatorManagedAgents();
    const now = Date.now();
-    
+
    for (const agent of agents) {
      const heartbeatKey = `agent:heartbeat:${agent.id}`;
      const lastHeartbeat = await this.valkey.get(heartbeatKey);
-      
+
      if (!lastHeartbeat) {
        // No heartbeat - agent is stale
        await this.handleStaleAgent(agent);
      } else {
        const heartbeatData = JSON.parse(lastHeartbeat);
        const age = now - heartbeatData.timestamp;
-        
+
        if (age > agent.staleThresholdSeconds * 1000) {
          await this.handleStaleAgent(agent);
        }
      }
    }
  }
-  
+
  // Assign pending tasks to available agents
  private async assignPendingTasks() {
    const workspaces = await this.getActiveWorkspaces();
-    
+
    for (const workspace of workspaces) {
-      const pendingTasks = await this.taskManager.getPendingTasks(
-        workspace.id,
-        { orderBy: { priority: 'desc', createdAt: 'asc' } }
-      );
-      
+      const pendingTasks = await this.taskManager.getPendingTasks(workspace.id, {
+        orderBy: { priority: "desc", createdAt: "asc" },
+      });
+
      for (const task of pendingTasks) {
        // Check dependencies
-        if (!await this.areDependenciesMet(task)) {
+        if (!(await this.areDependenciesMet(task))) {
          continue;
        }
-        
+
        // Find available agent
        const agent = await this.agentManager.findAvailableAgent(
          workspace.id,
          task.taskType,
          task.metadata.requiredCapabilities
        );
-        
+
        if (agent) {
          await this.assignTask(task, agent);
        }
      }
    }
  }
-  
+
  // Handle stale agents
  private async handleStaleAgent(agent: Agent) {
    this.logger.warn(`Agent ${agent.id} is stale - recovering tasks`);
-    
+
    // Mark agent as ERROR
    await this.agentManager.updateAgentStatus(agent.id, AgentStatus.ERROR);
-    
+
    // Get assigned tasks
    const tasks = await this.taskManager.getTasksForAgent(agent.id);
-    
+
    for (const task of tasks) {
-      await this.recoverTask(task, 'agent_stale');
+      await this.recoverTask(task, "agent_stale");
    }
  }
-  
+
  // Recover a task from failure
  private async recoverTask(task: AgentTask, reason: string) {
    // Log the failure
    await this.taskManager.logTaskEvent(task.id, {
-      level: 'ERROR',
-      event: 'task_recovery',
+      level: "ERROR",
+      event: "task_recovery",
      message: `Task recovery initiated: ${reason}`,
      previousStatus: task.status,
-      newStatus: 'ABORTED'
+      newStatus: "ABORTED",
    });
-    
+
    // Check retry limit
    if (task.retryCount >= task.maxRetries) {
      await this.taskManager.updateTask(task.id, {
-        status: 'FAILED',
+        status: "FAILED",
        lastError: `Max retries exceeded (${task.retryCount}/${task.maxRetries})`,
-        failedAt: new Date()
+        failedAt: new Date(),
      });
      return;
    }
-    
+
    // Abort current assignment
    await this.taskManager.updateTask(task.id, {
-      status: 'ABORTED',
+      status: "ABORTED",
      agentId: null,
      sessionKey: null,
-      retryCount: task.retryCount + 1
+      retryCount: task.retryCount + 1,
    });
-    
+
    // Wait for backoff period before requeuing
    const backoffMs = task.retryBackoffSeconds * 1000 * Math.pow(2, task.retryCount);
    setTimeout(async () => {
      await this.taskManager.updateTask(task.id, {
-        status: 'PENDING'
+        status: "PENDING",
      });
    }, backoffMs);
  }
-  
+
  // Assign task to agent
  private async assignTask(task: AgentTask, agent: Agent) {
    // Acquire assignment lock
    const lockKey = `task:assign_lock:${task.id}`;
-    const locked = await this.valkey.set(lockKey, agent.id, 'NX', 'EX', 5);
-    
+    const locked = await this.valkey.set(lockKey, agent.id, "NX", "EX", 5);
+
    if (!locked) {
-      return;  // Another coordinator already assigned this task
+      return; // Another coordinator already assigned this task
    }
-    
+
    try {
      // Update task
      await this.taskManager.updateTask(task.id, {
-        status: 'ASSIGNED',
+        status: "ASSIGNED",
        agentId: agent.id,
-        assignedAt: new Date()
+        assignedAt: new Date(),
      });
-      
+
      // Spawn agent session via Gateway
      const session = await this.spawnAgentSession(agent, task);
-      
+
      // Update task with session
      await this.taskManager.updateTask(task.id, {
        sessionKey: session.sessionKey,
-        status: 'RUNNING',
-        startedAt: new Date()
+        status: "RUNNING",
+        startedAt: new Date(),
      });
-      
+
      // Log assignment
      await this.taskManager.logTaskEvent(task.id, {
-        level: 'INFO',
-        event: 'task_assigned',
+        level: "INFO",
+        event: "task_assigned",
        message: `Task assigned to agent ${agent.id}`,
-        details: { agentId: agent.id, sessionKey: session.sessionKey }
+        details: { agentId: agent.id, sessionKey: session.sessionKey },
      });
    } finally {
      await this.valkey.del(lockKey);
    }
  }
-  
+
  // Spawn agent session via Gateway
  private async spawnAgentSession(agent: Agent, task: AgentTask): Promise<AgentSession> {
    // Call Gateway API to spawn subagent with task context
    const response = await fetch(`${process.env.GATEWAY_URL}/api/agents/spawn`, {
-      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
+      method: "POST",
+      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        workspaceId: task.workspaceId,
        agentId: agent.id,
@@ -748,11 +748,11 @@ export class CoordinatorService {
          taskTitle: task.title,
          taskDescription: task.description,
          inputContext: task.inputContext,
-          checkpointData: task.checkpointData
-        }
-      })
+          checkpointData: task.checkpointData,
+        },
+      }),
    });
-    
+
    const data = await response.json();
    return data.session;
  }
@@ -811,10 +811,12 @@ export class CoordinatorService {
 **Scenario:** Agent crashes mid-task.

 **Detection:**
+
 - Heartbeat TTL expires in Valkey
 - Coordinator detects missing heartbeat

 **Recovery:**
+
 1. Mark agent as `ERROR` in database
 2. Abort assigned tasks with `status = ABORTED`
 3. Log failure with stack trace (if available)
@@ -827,10 +829,12 @@ export class CoordinatorService {
 **Scenario:** Gateway restarts, killing all agent sessions.

 **Detection:**
+
 - All agent heartbeats stop simultaneously
 - Coordinator detects mass stale agents

 **Recovery:**
+
 1. Coordinator marks all `RUNNING` tasks as `ABORTED`
 2. Tasks with `checkpointData` can resume from last checkpoint
 3. Tasks without checkpoints restart from scratch
@@ -841,10 +845,12 @@ export class CoordinatorService {
 **Scenario:** Task A depends on Task B, which depends on Task A (circular dependency).

 **Detection:**
+
 - Coordinator builds dependency graph
 - Detects cycles during `resolveDependencies()`

 **Recovery:**
+
 1. Log `ERROR` with cycle details
 2. Mark all tasks in cycle as `FAILED` with reason `dependency_cycle`
 3. Notify workspace owner via webhook
@@ -854,9 +860,11 @@ export class CoordinatorService {
 **Scenario:** PostgreSQL becomes unavailable.

 **Detection:**
+
 - Prisma query fails with connection error

 **Recovery:**
+
 1. Coordinator catches error, logs to stderr
 2. Releases lock (allowing failover to another instance)
 3. Retries with exponential backoff: 5s, 10s, 20s, 40s
@@ -867,14 +875,17 @@ export class CoordinatorService {
 **Scenario:** Network partition causes two coordinators to run simultaneously.

 **Prevention:**
+
 - Distributed lock in Valkey with 30s TTL
 - Coordinators must renew lock every cycle
 - Only one coordinator can hold lock at a time

 **Detection:**
+
 - Task assigned to multiple agents (conflict detection)

 **Recovery:**
+
 1. Newer assignment wins (based on `assignedAt` timestamp)
 2. Cancel older session
 3. Log conflict for investigation
@@ -884,10 +895,12 @@ export class CoordinatorService {
 **Scenario:** Task runs longer than `estimatedCompletionAt + grace period`.

 **Detection:**
+
 - Coordinator checks `estimatedCompletionAt` field
 - If exceeded by >30 minutes, mark as potentially hung

 **Recovery:**
+
 1. Send warning to agent session (via pub/sub)
 2. If no progress update in 10 minutes, abort task
 3. Log timeout error
@@ -902,6 +915,7 @@ export class CoordinatorService {
 **Goal:** Basic task and agent models, no coordination yet.

 **Deliverables:**
+
 - [ ] Database schema migration (tables, indexes)
 - [ ] Prisma models for `AgentTask`, `AgentTaskLog`, `AgentHeartbeat`
 - [ ] Basic CRUD API endpoints for tasks
@@ -909,6 +923,7 @@ export class CoordinatorService {
 - [ ] Manual task assignment (no automation)

 **Testing:**
+
 - Unit tests for task state machine
 - Integration tests for task CRUD
 - Manual testing: create task, assign to agent, complete
@@ -918,6 +933,7 @@ export class CoordinatorService {
 **Goal:** Autonomous coordinator with health monitoring.

 **Deliverables:**
+
 - [ ] `CoordinatorService` with distributed locking
 - [ ] Health monitoring (heartbeat TTL checks)
 - [ ] Automatic task assignment to available agents
@@ -925,6 +941,7 @@ export class CoordinatorService {
 - [ ] Pub/Sub for coordination events

 **Testing:**
+
 - Unit tests for coordinator logic
 - Integration tests with Valkey
 - Chaos testing: kill agents, verify recovery
@@ -935,6 +952,7 @@ export class CoordinatorService {
 **Goal:** Fault-tolerant operation with automatic recovery.

 **Deliverables:**
+
 - [ ] Agent failure detection and task recovery
 - [ ] Exponential backoff for retries
 - [ ] Checkpoint/resume support for long-running tasks
@@ -942,6 +960,7 @@ export class CoordinatorService {
 - [ ] Deadlock detection

 **Testing:**
+
 - Fault injection: kill agents, restart Gateway
 - Dependency cycle testing
 - Retry exhaustion testing
@@ -952,6 +971,7 @@ export class CoordinatorService {
 **Goal:** Full visibility into orchestration state.

 **Deliverables:**
+
 - [ ] Coordinator status dashboard
 - [ ] Task progress tracking UI
 - [ ] Real-time logs API
@@ -959,6 +979,7 @@ export class CoordinatorService {
 - [ ] Webhook integration for external monitoring

 **Testing:**
+
 - Load testing with metrics collection
 - Dashboard usability testing
 - Webhook reliability testing
@@ -968,6 +989,7 @@ export class CoordinatorService {
 **Goal:** Production-grade features.

 **Deliverables:**
+
 - [ ] Task prioritization algorithms (SJF, priority queues)
 - [ ] Agent capability matching (skills-based routing)
 - [ ] Task batching (group similar tasks)
@@ -988,11 +1010,11 @@ async getTasks(userId: string, workspaceId: string) {
  const membership = await this.prisma.workspaceMember.findUnique({
    where: { workspaceId_userId: { workspaceId, userId } }
  });
-  
+
  if (!membership) {
    throw new ForbiddenException('Not a member of this workspace');
  }
-  
+
  return this.prisma.agentTask.findMany({
    where: { workspaceId }
  });
@@ -1018,15 +1040,15 @@ async getTasks(userId: string, workspaceId: string) {

 ### Key Metrics

-| Metric | Description | Alert Threshold |
-|--------|-------------|-----------------|
-| `coordinator.cycle.duration_ms` | Coordination cycle execution time | >5000ms |
-| `coordinator.stale_agents.count` | Number of stale agents detected | >5 |
-| `tasks.pending.count` | Tasks waiting for assignment | >50 |
-| `tasks.failed.count` | Total failed tasks (last 1h) | >10 |
-| `tasks.retry.exhausted.count` | Tasks exceeding max retries | >0 |
-| `agents.spawned.count` | Agent spawn rate | >100/min |
-| `valkey.connection.errors` | Valkey connection failures | >0 |
+| Metric                           | Description                       | Alert Threshold |
+| -------------------------------- | --------------------------------- | --------------- |
+| `coordinator.cycle.duration_ms`  | Coordination cycle execution time | >5000ms         |
+| `coordinator.stale_agents.count` | Number of stale agents detected   | >5              |
+| `tasks.pending.count`            | Tasks waiting for assignment      | >50             |
+| `tasks.failed.count`             | Total failed tasks (last 1h)      | >10             |
+| `tasks.retry.exhausted.count`    | Tasks exceeding max retries       | >0              |
+| `agents.spawned.count`           | Agent spawn rate                  | >100/min        |
+| `valkey.connection.errors`       | Valkey connection failures        | >0              |

 ### Health Checks

@@ -1053,25 +1075,25 @@ GET /health/coordinator

 ```typescript
 // Main agent creates a development task
-const task = await fetch('/api/v1/agent-tasks', {
-  method: 'POST',
+const task = await fetch("/api/v1/agent-tasks", {
+  method: "POST",
  headers: {
-    'Content-Type': 'application/json',
-    'Authorization': `Bearer ${sessionToken}`
+    "Content-Type": "application/json",
+    Authorization: `Bearer ${sessionToken}`,
  },
  body: JSON.stringify({
-    title: 'Fix TypeScript strict errors in U-Connect',
-    description: 'Run tsc --noEmit, fix all errors, commit changes',
-    taskType: 'development',
+    title: "Fix TypeScript strict errors in U-Connect",
+    description: "Run tsc --noEmit, fix all errors, commit changes",
+    taskType: "development",
    priority: 8,
    inputContext: {
-      repository: 'u-connect',
-      branch: 'main',
-      commands: ['pnpm install', 'pnpm tsc:check']
+      repository: "u-connect",
+      branch: "main",
+      commands: ["pnpm install", "pnpm tsc:check"],
    },
    maxRetries: 2,
-    estimatedDurationMinutes: 30
-  })
+    estimatedDurationMinutes: 30,
+  }),
 });

 const { id } = await task.json();
@@ -1084,19 +1106,19 @@ console.log(`Task created: ${id}`);
 // Subagent sends heartbeat every 30s
 setInterval(async () => {
  await fetch(`/api/v1/agents/${agentId}/heartbeat`, {
-    method: 'POST',
+    method: "POST",
    headers: {
-      'Content-Type': 'application/json',
-      'Authorization': `Bearer ${agentToken}`
+      "Content-Type": "application/json",
+      Authorization: `Bearer ${agentToken}`,
    },
    body: JSON.stringify({
-      status: 'healthy',
+      status: "healthy",
      currentTaskId: taskId,
      progressPercent: 45,
-      currentStep: 'Running tsc --noEmit',
+      currentStep: "Running tsc --noEmit",
      memoryMb: 512,
-      cpuPercent: 35
-    })
+      cpuPercent: 35,
+    }),
  });
 }, 30000);
 ```
@@ -1106,20 +1128,20 @@ setInterval(async () => {
 ```typescript
 // Agent updates task progress
 await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
-  method: 'PATCH',
+  method: "PATCH",
  headers: {
-    'Content-Type': 'application/json',
-    'Authorization': `Bearer ${agentToken}`
+    "Content-Type": "application/json",
+    Authorization: `Bearer ${agentToken}`,
  },
  body: JSON.stringify({
    progressPercent: 70,
-    currentStep: 'Fixing type errors in packages/shared',
+    currentStep: "Fixing type errors in packages/shared",
    checkpointData: {
      filesProcessed: 15,
      errorsFixed: 8,
-      remainingFiles: 5
-    }
-  })
+      remainingFiles: 5,
+    },
+  }),
 });
 ```

@@ -1128,20 +1150,20 @@ await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
 ```typescript
 // Agent marks task complete
 await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
-  method: 'POST',
+  method: "POST",
  headers: {
-    'Content-Type': 'application/json',
-    'Authorization': `Bearer ${agentToken}`
+    "Content-Type": "application/json",
+    Authorization: `Bearer ${agentToken}`,
  },
  body: JSON.stringify({
    outputResult: {
      filesModified: 20,
      errorsFixed: 23,
-      commitHash: 'abc123',
-      buildStatus: 'passing'
+      commitHash: "abc123",
+      buildStatus: "passing",
    },
-    summary: 'All TypeScript strict errors resolved. Build passing.'
-  })
+    summary: "All TypeScript strict errors resolved. Build passing.",
+  }),
 });
 ```

@@ -1149,17 +1171,17 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {

 ## Glossary

-| Term | Definition |
-|------|------------|
-| **Agent** | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
-| **Task** | Unit of work to be executed by an agent |
-| **Coordinator** | Background service that assigns tasks and monitors agent health |
-| **Heartbeat** | Periodic signal from agent indicating it's alive and working |
-| **Stale Agent** | Agent that has stopped sending heartbeats (assumed dead) |
-| **Checkpoint** | Snapshot of task state allowing resumption after failure |
-| **Workspace** | Tenant isolation boundary (all tasks belong to a workspace) |
-| **Session** | Gateway-managed connection between user and agent |
-| **Orchestration** | Automated coordination of multiple agents working on tasks |
+| Term              | Definition                                                         |
+| ----------------- | ------------------------------------------------------------------ |
+| **Agent**         | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
+| **Task**          | Unit of work to be executed by an agent                            |
+| **Coordinator**   | Background service that assigns tasks and monitors agent health    |
+| **Heartbeat**     | Periodic signal from agent indicating it's alive and working       |
+| **Stale Agent**   | Agent that has stopped sending heartbeats (assumed dead)           |
+| **Checkpoint**    | Snapshot of task state allowing resumption after failure           |
+| **Workspace**     | Tenant isolation boundary (all tasks belong to a workspace)        |
+| **Session**       | Gateway-managed connection between user and agent                  |
+| **Orchestration** | Automated coordination of multiple agents working on tasks         |

 ---

@@ -1174,6 +1196,7 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
 ---

 **Next Steps:**
+
 1. Review and approve this design document
 2. Create GitHub issues for Phase 1 tasks
 3. Set up development branch: `feature/agent-orchestration`