feat(#93): implement agent spawn via federation

Implements FED-010: Agent Spawn via Federation feature that enables
spawning and managing Claude agents on remote federated Mosaic Stack
instances via COMMAND message type.

Features:
- Federation agent command types (spawn, status, kill)
- FederationAgentService for handling agent operations
- Integration with orchestrator's agent spawner/lifecycle services
- API endpoints for spawning, querying status, and killing agents
- Full command routing through federation COMMAND infrastructure
- Comprehensive test coverage (12/12 tests passing)

Architecture:
- Hub → Spoke: Spawn agents on remote instances
- Command flow: FederationController → FederationAgentService →
  CommandService → Remote Orchestrator
- Response handling: Remote orchestrator returns agent status/results
- Security: Connection validation, signature verification

Files created:
- apps/api/src/federation/types/federation-agent.types.ts
- apps/api/src/federation/federation-agent.service.ts
- apps/api/src/federation/federation-agent.service.spec.ts

Files modified:
- apps/api/src/federation/command.service.ts (agent command routing)
- apps/api/src/federation/federation.controller.ts (agent endpoints)
- apps/api/src/federation/federation.module.ts (service registration)
- apps/orchestrator/src/api/agents/agents.controller.ts (status endpoint)
- apps/orchestrator/src/api/agents/agents.module.ts (lifecycle integration)

Testing:
- 12/12 tests passing for FederationAgentService
- All command service tests passing
- TypeScript compilation successful
- Linting passed

Refs #93

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Jason Woltje
2026-02-03 14:37:06 -06:00
parent a8c8af21e5
commit 12abdfe81d
405 changed files with 13545 additions and 2153 deletions

View File

@@ -87,14 +87,14 @@ The Agent Orchestration Layer must provide:
### Component Responsibilities
| Component | Responsibility |
|-----------|----------------|
| **Task Manager** | CRUD operations on tasks, state transitions, assignment logic |
| **Agent Manager** | Agent lifecycle, health tracking, session management |
| **Coordinator** | Heartbeat processing, failure detection, recovery orchestration |
| **PostgreSQL** | Persistent storage of tasks, agents, sessions, logs |
| **Valkey/Redis** | Runtime state, heartbeats, quick lookups, pub/sub |
| **Gateway** | Agent spawning, session management, message routing |
| Component | Responsibility |
| ----------------- | --------------------------------------------------------------- |
| **Task Manager** | CRUD operations on tasks, state transitions, assignment logic |
| **Agent Manager** | Agent lifecycle, health tracking, session management |
| **Coordinator** | Heartbeat processing, failure detection, recovery orchestration |
| **PostgreSQL** | Persistent storage of tasks, agents, sessions, logs |
| **Valkey/Redis** | Runtime state, heartbeats, quick lookups, pub/sub |
| **Gateway** | Agent spawning, session management, message routing |
---
@@ -156,44 +156,44 @@ CREATE TYPE task_orchestration_status AS ENUM (
CREATE TABLE agent_tasks (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
-- Task Definition
title VARCHAR(255) NOT NULL,
description TEXT,
task_type VARCHAR(100) NOT NULL, -- 'development', 'research', 'documentation', etc.
-- Status & Priority
status task_orchestration_status DEFAULT 'PENDING',
priority INT DEFAULT 5, -- 1 (low) to 10 (high)
-- Assignment
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
session_key VARCHAR(255), -- Current active session
-- Progress Tracking
progress_percent INT DEFAULT 0 CHECK (progress_percent BETWEEN 0 AND 100),
current_step TEXT,
estimated_completion_at TIMESTAMPTZ,
-- Retry Logic
retry_count INT DEFAULT 0,
max_retries INT DEFAULT 3,
retry_backoff_seconds INT DEFAULT 300, -- 5 minutes
last_error TEXT,
-- Dependencies
depends_on UUID[] DEFAULT ARRAY[]::UUID[], -- Array of task IDs
blocks UUID[] DEFAULT ARRAY[]::UUID[], -- Tasks blocked by this one
-- Context
input_context JSONB DEFAULT '{}', -- Input data/params for agent
output_result JSONB, -- Final result from agent
checkpoint_data JSONB DEFAULT '{}', -- Resumable state
-- Metadata
metadata JSONB DEFAULT '{}',
tags VARCHAR(100)[] DEFAULT ARRAY[]::VARCHAR[],
-- Audit
created_by UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ DEFAULT NOW(),
@@ -202,7 +202,7 @@ CREATE TABLE agent_tasks (
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
failed_at TIMESTAMPTZ,
CONSTRAINT fk_workspace FOREIGN KEY (workspace_id) REFERENCES workspaces(id),
CONSTRAINT fk_agent FOREIGN KEY (agent_id) REFERENCES agents(id)
);
@@ -227,21 +227,21 @@ CREATE TABLE agent_task_logs (
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
task_id UUID NOT NULL REFERENCES agent_tasks(id) ON DELETE CASCADE,
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
-- Log Entry
level task_log_level DEFAULT 'INFO',
event VARCHAR(100) NOT NULL, -- 'state_transition', 'progress_update', 'error', etc.
message TEXT,
details JSONB DEFAULT '{}',
-- State Snapshot
previous_status task_orchestration_status,
new_status task_orchestration_status,
-- Context
session_key VARCHAR(255),
stack_trace TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
@@ -260,20 +260,20 @@ CREATE TABLE agent_heartbeats (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
agent_id UUID NOT NULL REFERENCES agents(id) ON DELETE CASCADE,
-- Health Data
status VARCHAR(50) NOT NULL, -- 'healthy', 'degraded', 'stale'
current_task_id UUID REFERENCES agent_tasks(id) ON DELETE SET NULL,
progress_percent INT,
-- Resource Usage
memory_mb INT,
cpu_percent INT,
-- Timing
last_seen_at TIMESTAMPTZ DEFAULT NOW(),
next_expected_at TIMESTAMPTZ,
metadata JSONB DEFAULT '{}'
);
@@ -284,7 +284,7 @@ CREATE INDEX idx_heartbeats_stale ON agent_heartbeats(last_seen_at) WHERE status
CREATE OR REPLACE FUNCTION cleanup_old_heartbeats()
RETURNS void AS $$
BEGIN
DELETE FROM agent_heartbeats
DELETE FROM agent_heartbeats
WHERE last_seen_at < NOW() - INTERVAL '1 hour';
END;
$$ LANGUAGE plpgsql;
@@ -310,6 +310,7 @@ CREATE INDEX idx_agents_coordinator ON agents(coordinator_enabled) WHERE coordin
## Valkey/Redis Key Patterns
Valkey is used for:
- **Real-time state** (fast reads/writes)
- **Pub/Sub messaging** (coordination events)
- **Distributed locks** (prevent race conditions)
@@ -343,9 +344,9 @@ SADD tasks:active:{workspace_id} {task_id}:{session_key}
SETEX agent:heartbeat:{agent_id} 60 "{\"status\":\"running\",\"task_id\":\"{task_id}\",\"timestamp\":{ts}}"
# Agent status (hash)
HSET agent:status:{agent_id}
status "running"
current_task "{task_id}"
HSET agent:status:{agent_id}
status "running"
current_task "{task_id}"
last_heartbeat {timestamp}
# Stale agents (sorted set by last heartbeat)
@@ -382,7 +383,7 @@ PUBLISH coordinator:commands:{workspace_id} "{\"command\":\"reassign_task\",\"ta
```redis
# Session context (hash with TTL)
HSET session:context:{session_key}
HSET session:context:{session_key}
workspace_id "{workspace_id}"
agent_id "{agent_id}"
task_id "{task_id}"
@@ -392,14 +393,14 @@ EXPIRE session:context:{session_key} 3600
### Data Lifecycle
| Key Type | TTL | Cleanup Strategy |
|----------|-----|------------------|
| `agent:heartbeat:*` | 60s | Auto-expire |
| `agent:status:*` | None | Delete on agent termination |
| `session:context:*` | 1h | Auto-expire |
| `tasks:pending:*` | None | Remove on assignment |
| `coordinator:lock:*` | 30s | Auto-expire (renewed by active coordinator) |
| `task:assign_lock:*` | 5s | Auto-expire after assignment |
| Key Type | TTL | Cleanup Strategy |
| -------------------- | ---- | ------------------------------------------- |
| `agent:heartbeat:*` | 60s | Auto-expire |
| `agent:status:*` | None | Delete on agent termination |
| `session:context:*` | 1h | Auto-expire |
| `tasks:pending:*` | None | Remove on assignment |
| `coordinator:lock:*` | 30s | Auto-expire (renewed by active coordinator) |
| `task:assign_lock:*` | 5s | Auto-expire after assignment |
---
@@ -546,7 +547,7 @@ POST {configured_webhook_url}
export class CoordinatorService {
private coordinatorLock: string;
private isRunning: boolean = false;
constructor(
private readonly taskManager: TaskManagerService,
private readonly agentManager: AgentManagerService,
@@ -554,191 +555,190 @@ export class CoordinatorService {
private readonly prisma: PrismaService,
private readonly logger: Logger
) {}
// Main coordination loop
@Cron('*/30 * * * * *') // Every 30 seconds
@Cron("*/30 * * * * *") // Every 30 seconds
async coordinate() {
if (!await this.acquireLock()) {
return; // Another coordinator is active
if (!(await this.acquireLock())) {
return; // Another coordinator is active
}
try {
await this.checkAgentHealth();
await this.assignPendingTasks();
await this.resolveDependencies();
await this.recoverFailedTasks();
} catch (error) {
this.logger.error('Coordination cycle failed', error);
this.logger.error("Coordination cycle failed", error);
} finally {
await this.releaseLock();
}
}
// Distributed lock to prevent multiple coordinators
private async acquireLock(): Promise<boolean> {
const lockKey = `coordinator:lock:global`;
const result = await this.valkey.set(
lockKey,
process.env.HOSTNAME || 'coordinator',
'NX',
'EX',
process.env.HOSTNAME || "coordinator",
"NX",
"EX",
30
);
return result === 'OK';
return result === "OK";
}
// Check agent heartbeats and mark stale
private async checkAgentHealth() {
const agents = await this.agentManager.getCoordinatorManagedAgents();
const now = Date.now();
for (const agent of agents) {
const heartbeatKey = `agent:heartbeat:${agent.id}`;
const lastHeartbeat = await this.valkey.get(heartbeatKey);
if (!lastHeartbeat) {
// No heartbeat - agent is stale
await this.handleStaleAgent(agent);
} else {
const heartbeatData = JSON.parse(lastHeartbeat);
const age = now - heartbeatData.timestamp;
if (age > agent.staleThresholdSeconds * 1000) {
await this.handleStaleAgent(agent);
}
}
}
}
// Assign pending tasks to available agents
private async assignPendingTasks() {
const workspaces = await this.getActiveWorkspaces();
for (const workspace of workspaces) {
const pendingTasks = await this.taskManager.getPendingTasks(
workspace.id,
{ orderBy: { priority: 'desc', createdAt: 'asc' } }
);
const pendingTasks = await this.taskManager.getPendingTasks(workspace.id, {
orderBy: { priority: "desc", createdAt: "asc" },
});
for (const task of pendingTasks) {
// Check dependencies
if (!await this.areDependenciesMet(task)) {
if (!(await this.areDependenciesMet(task))) {
continue;
}
// Find available agent
const agent = await this.agentManager.findAvailableAgent(
workspace.id,
task.taskType,
task.metadata.requiredCapabilities
);
if (agent) {
await this.assignTask(task, agent);
}
}
}
}
// Handle stale agents
private async handleStaleAgent(agent: Agent) {
this.logger.warn(`Agent ${agent.id} is stale - recovering tasks`);
// Mark agent as ERROR
await this.agentManager.updateAgentStatus(agent.id, AgentStatus.ERROR);
// Get assigned tasks
const tasks = await this.taskManager.getTasksForAgent(agent.id);
for (const task of tasks) {
await this.recoverTask(task, 'agent_stale');
await this.recoverTask(task, "agent_stale");
}
}
// Recover a task from failure
private async recoverTask(task: AgentTask, reason: string) {
// Log the failure
await this.taskManager.logTaskEvent(task.id, {
level: 'ERROR',
event: 'task_recovery',
level: "ERROR",
event: "task_recovery",
message: `Task recovery initiated: ${reason}`,
previousStatus: task.status,
newStatus: 'ABORTED'
newStatus: "ABORTED",
});
// Check retry limit
if (task.retryCount >= task.maxRetries) {
await this.taskManager.updateTask(task.id, {
status: 'FAILED',
status: "FAILED",
lastError: `Max retries exceeded (${task.retryCount}/${task.maxRetries})`,
failedAt: new Date()
failedAt: new Date(),
});
return;
}
// Abort current assignment
await this.taskManager.updateTask(task.id, {
status: 'ABORTED',
status: "ABORTED",
agentId: null,
sessionKey: null,
retryCount: task.retryCount + 1
retryCount: task.retryCount + 1,
});
// Wait for backoff period before requeuing
const backoffMs = task.retryBackoffSeconds * 1000 * Math.pow(2, task.retryCount);
setTimeout(async () => {
await this.taskManager.updateTask(task.id, {
status: 'PENDING'
status: "PENDING",
});
}, backoffMs);
}
// Assign task to agent
private async assignTask(task: AgentTask, agent: Agent) {
// Acquire assignment lock
const lockKey = `task:assign_lock:${task.id}`;
const locked = await this.valkey.set(lockKey, agent.id, 'NX', 'EX', 5);
const locked = await this.valkey.set(lockKey, agent.id, "NX", "EX", 5);
if (!locked) {
return; // Another coordinator already assigned this task
return; // Another coordinator already assigned this task
}
try {
// Update task
await this.taskManager.updateTask(task.id, {
status: 'ASSIGNED',
status: "ASSIGNED",
agentId: agent.id,
assignedAt: new Date()
assignedAt: new Date(),
});
// Spawn agent session via Gateway
const session = await this.spawnAgentSession(agent, task);
// Update task with session
await this.taskManager.updateTask(task.id, {
sessionKey: session.sessionKey,
status: 'RUNNING',
startedAt: new Date()
status: "RUNNING",
startedAt: new Date(),
});
// Log assignment
await this.taskManager.logTaskEvent(task.id, {
level: 'INFO',
event: 'task_assigned',
level: "INFO",
event: "task_assigned",
message: `Task assigned to agent ${agent.id}`,
details: { agentId: agent.id, sessionKey: session.sessionKey }
details: { agentId: agent.id, sessionKey: session.sessionKey },
});
} finally {
await this.valkey.del(lockKey);
}
}
// Spawn agent session via Gateway
private async spawnAgentSession(agent: Agent, task: AgentTask): Promise<AgentSession> {
// Call Gateway API to spawn subagent with task context
const response = await fetch(`${process.env.GATEWAY_URL}/api/agents/spawn`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
workspaceId: task.workspaceId,
agentId: agent.id,
@@ -748,11 +748,11 @@ export class CoordinatorService {
taskTitle: task.title,
taskDescription: task.description,
inputContext: task.inputContext,
checkpointData: task.checkpointData
}
})
checkpointData: task.checkpointData,
},
}),
});
const data = await response.json();
return data.session;
}
@@ -811,10 +811,12 @@ export class CoordinatorService {
**Scenario:** Agent crashes mid-task.
**Detection:**
- Heartbeat TTL expires in Valkey
- Coordinator detects missing heartbeat
**Recovery:**
1. Mark agent as `ERROR` in database
2. Abort assigned tasks with `status = ABORTED`
3. Log failure with stack trace (if available)
@@ -827,10 +829,12 @@ export class CoordinatorService {
**Scenario:** Gateway restarts, killing all agent sessions.
**Detection:**
- All agent heartbeats stop simultaneously
- Coordinator detects mass stale agents
**Recovery:**
1. Coordinator marks all `RUNNING` tasks as `ABORTED`
2. Tasks with `checkpointData` can resume from last checkpoint
3. Tasks without checkpoints restart from scratch
@@ -841,10 +845,12 @@ export class CoordinatorService {
**Scenario:** Task A depends on Task B, which depends on Task A (circular dependency).
**Detection:**
- Coordinator builds dependency graph
- Detects cycles during `resolveDependencies()`
**Recovery:**
1. Log `ERROR` with cycle details
2. Mark all tasks in cycle as `FAILED` with reason `dependency_cycle`
3. Notify workspace owner via webhook
@@ -854,9 +860,11 @@ export class CoordinatorService {
**Scenario:** PostgreSQL becomes unavailable.
**Detection:**
- Prisma query fails with connection error
**Recovery:**
1. Coordinator catches error, logs to stderr
2. Releases lock (allowing failover to another instance)
3. Retries with exponential backoff: 5s, 10s, 20s, 40s
@@ -867,14 +875,17 @@ export class CoordinatorService {
**Scenario:** Network partition causes two coordinators to run simultaneously.
**Prevention:**
- Distributed lock in Valkey with 30s TTL
- Coordinators must renew lock every cycle
- Only one coordinator can hold lock at a time
**Detection:**
- Task assigned to multiple agents (conflict detection)
**Recovery:**
1. Newer assignment wins (based on `assignedAt` timestamp)
2. Cancel older session
3. Log conflict for investigation
@@ -884,10 +895,12 @@ export class CoordinatorService {
**Scenario:** Task runs longer than `estimatedCompletionAt + grace period`.
**Detection:**
- Coordinator checks `estimatedCompletionAt` field
- If exceeded by >30 minutes, mark as potentially hung
**Recovery:**
1. Send warning to agent session (via pub/sub)
2. If no progress update in 10 minutes, abort task
3. Log timeout error
@@ -902,6 +915,7 @@ export class CoordinatorService {
**Goal:** Basic task and agent models, no coordination yet.
**Deliverables:**
- [ ] Database schema migration (tables, indexes)
- [ ] Prisma models for `AgentTask`, `AgentTaskLog`, `AgentHeartbeat`
- [ ] Basic CRUD API endpoints for tasks
@@ -909,6 +923,7 @@ export class CoordinatorService {
- [ ] Manual task assignment (no automation)
**Testing:**
- Unit tests for task state machine
- Integration tests for task CRUD
- Manual testing: create task, assign to agent, complete
@@ -918,6 +933,7 @@ export class CoordinatorService {
**Goal:** Autonomous coordinator with health monitoring.
**Deliverables:**
- [ ] `CoordinatorService` with distributed locking
- [ ] Health monitoring (heartbeat TTL checks)
- [ ] Automatic task assignment to available agents
@@ -925,6 +941,7 @@ export class CoordinatorService {
- [ ] Pub/Sub for coordination events
**Testing:**
- Unit tests for coordinator logic
- Integration tests with Valkey
- Chaos testing: kill agents, verify recovery
@@ -935,6 +952,7 @@ export class CoordinatorService {
**Goal:** Fault-tolerant operation with automatic recovery.
**Deliverables:**
- [ ] Agent failure detection and task recovery
- [ ] Exponential backoff for retries
- [ ] Checkpoint/resume support for long-running tasks
@@ -942,6 +960,7 @@ export class CoordinatorService {
- [ ] Deadlock detection
**Testing:**
- Fault injection: kill agents, restart Gateway
- Dependency cycle testing
- Retry exhaustion testing
@@ -952,6 +971,7 @@ export class CoordinatorService {
**Goal:** Full visibility into orchestration state.
**Deliverables:**
- [ ] Coordinator status dashboard
- [ ] Task progress tracking UI
- [ ] Real-time logs API
@@ -959,6 +979,7 @@ export class CoordinatorService {
- [ ] Webhook integration for external monitoring
**Testing:**
- Load testing with metrics collection
- Dashboard usability testing
- Webhook reliability testing
@@ -968,6 +989,7 @@ export class CoordinatorService {
**Goal:** Production-grade features.
**Deliverables:**
- [ ] Task prioritization algorithms (SJF, priority queues)
- [ ] Agent capability matching (skills-based routing)
- [ ] Task batching (group similar tasks)
@@ -988,11 +1010,11 @@ async getTasks(userId: string, workspaceId: string) {
const membership = await this.prisma.workspaceMember.findUnique({
where: { workspaceId_userId: { workspaceId, userId } }
});
if (!membership) {
throw new ForbiddenException('Not a member of this workspace');
}
return this.prisma.agentTask.findMany({
where: { workspaceId }
});
@@ -1018,15 +1040,15 @@ async getTasks(userId: string, workspaceId: string) {
### Key Metrics
| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| `coordinator.cycle.duration_ms` | Coordination cycle execution time | >5000ms |
| `coordinator.stale_agents.count` | Number of stale agents detected | >5 |
| `tasks.pending.count` | Tasks waiting for assignment | >50 |
| `tasks.failed.count` | Total failed tasks (last 1h) | >10 |
| `tasks.retry.exhausted.count` | Tasks exceeding max retries | >0 |
| `agents.spawned.count` | Agent spawn rate | >100/min |
| `valkey.connection.errors` | Valkey connection failures | >0 |
| Metric | Description | Alert Threshold |
| -------------------------------- | --------------------------------- | --------------- |
| `coordinator.cycle.duration_ms` | Coordination cycle execution time | >5000ms |
| `coordinator.stale_agents.count` | Number of stale agents detected | >5 |
| `tasks.pending.count` | Tasks waiting for assignment | >50 |
| `tasks.failed.count` | Total failed tasks (last 1h) | >10 |
| `tasks.retry.exhausted.count` | Tasks exceeding max retries | >0 |
| `agents.spawned.count` | Agent spawn rate | >100/min |
| `valkey.connection.errors` | Valkey connection failures | >0 |
### Health Checks
@@ -1053,25 +1075,25 @@ GET /health/coordinator
```typescript
// Main agent creates a development task
const task = await fetch('/api/v1/agent-tasks', {
method: 'POST',
const task = await fetch("/api/v1/agent-tasks", {
method: "POST",
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${sessionToken}`
"Content-Type": "application/json",
Authorization: `Bearer ${sessionToken}`,
},
body: JSON.stringify({
title: 'Fix TypeScript strict errors in U-Connect',
description: 'Run tsc --noEmit, fix all errors, commit changes',
taskType: 'development',
title: "Fix TypeScript strict errors in U-Connect",
description: "Run tsc --noEmit, fix all errors, commit changes",
taskType: "development",
priority: 8,
inputContext: {
repository: 'u-connect',
branch: 'main',
commands: ['pnpm install', 'pnpm tsc:check']
repository: "u-connect",
branch: "main",
commands: ["pnpm install", "pnpm tsc:check"],
},
maxRetries: 2,
estimatedDurationMinutes: 30
})
estimatedDurationMinutes: 30,
}),
});
const { id } = await task.json();
@@ -1084,19 +1106,19 @@ console.log(`Task created: ${id}`);
// Subagent sends heartbeat every 30s
setInterval(async () => {
await fetch(`/api/v1/agents/${agentId}/heartbeat`, {
method: 'POST',
method: "POST",
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${agentToken}`
"Content-Type": "application/json",
Authorization: `Bearer ${agentToken}`,
},
body: JSON.stringify({
status: 'healthy',
status: "healthy",
currentTaskId: taskId,
progressPercent: 45,
currentStep: 'Running tsc --noEmit',
currentStep: "Running tsc --noEmit",
memoryMb: 512,
cpuPercent: 35
})
cpuPercent: 35,
}),
});
}, 30000);
```
@@ -1106,20 +1128,20 @@ setInterval(async () => {
```typescript
// Agent updates task progress
await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
method: 'PATCH',
method: "PATCH",
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${agentToken}`
"Content-Type": "application/json",
Authorization: `Bearer ${agentToken}`,
},
body: JSON.stringify({
progressPercent: 70,
currentStep: 'Fixing type errors in packages/shared',
currentStep: "Fixing type errors in packages/shared",
checkpointData: {
filesProcessed: 15,
errorsFixed: 8,
remainingFiles: 5
}
})
remainingFiles: 5,
},
}),
});
```
@@ -1128,20 +1150,20 @@ await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
```typescript
// Agent marks task complete
await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
method: 'POST',
method: "POST",
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${agentToken}`
"Content-Type": "application/json",
Authorization: `Bearer ${agentToken}`,
},
body: JSON.stringify({
outputResult: {
filesModified: 20,
errorsFixed: 23,
commitHash: 'abc123',
buildStatus: 'passing'
commitHash: "abc123",
buildStatus: "passing",
},
summary: 'All TypeScript strict errors resolved. Build passing.'
})
summary: "All TypeScript strict errors resolved. Build passing.",
}),
});
```
@@ -1149,17 +1171,17 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
## Glossary
| Term | Definition |
|------|------------|
| **Agent** | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
| **Task** | Unit of work to be executed by an agent |
| **Coordinator** | Background service that assigns tasks and monitors agent health |
| **Heartbeat** | Periodic signal from agent indicating it's alive and working |
| **Stale Agent** | Agent that has stopped sending heartbeats (assumed dead) |
| **Checkpoint** | Snapshot of task state allowing resumption after failure |
| **Workspace** | Tenant isolation boundary (all tasks belong to a workspace) |
| **Session** | Gateway-managed connection between user and agent |
| **Orchestration** | Automated coordination of multiple agents working on tasks |
| Term | Definition |
| ----------------- | ------------------------------------------------------------------ |
| **Agent** | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
| **Task** | Unit of work to be executed by an agent |
| **Coordinator** | Background service that assigns tasks and monitors agent health |
| **Heartbeat** | Periodic signal from agent indicating it's alive and working |
| **Stale Agent** | Agent that has stopped sending heartbeats (assumed dead) |
| **Checkpoint** | Snapshot of task state allowing resumption after failure |
| **Workspace** | Tenant isolation boundary (all tasks belong to a workspace) |
| **Session** | Gateway-managed connection between user and agent |
| **Orchestration** | Automated coordination of multiple agents working on tasks |
---
@@ -1174,6 +1196,7 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
---
**Next Steps:**
1. Review and approve this design document
2. Create GitHub issues for Phase 1 tasks
3. Set up development branch: `feature/agent-orchestration`