feat(#93): implement agent spawn via federation
Implements FED-010: Agent Spawn via Federation feature that enables spawning and managing Claude agents on remote federated Mosaic Stack instances via COMMAND message type. Features: - Federation agent command types (spawn, status, kill) - FederationAgentService for handling agent operations - Integration with orchestrator's agent spawner/lifecycle services - API endpoints for spawning, querying status, and killing agents - Full command routing through federation COMMAND infrastructure - Comprehensive test coverage (12/12 tests passing) Architecture: - Hub → Spoke: Spawn agents on remote instances - Command flow: FederationController → FederationAgentService → CommandService → Remote Orchestrator - Response handling: Remote orchestrator returns agent status/results - Security: Connection validation, signature verification Files created: - apps/api/src/federation/types/federation-agent.types.ts - apps/api/src/federation/federation-agent.service.ts - apps/api/src/federation/federation-agent.service.spec.ts Files modified: - apps/api/src/federation/command.service.ts (agent command routing) - apps/api/src/federation/federation.controller.ts (agent endpoints) - apps/api/src/federation/federation.module.ts (service registration) - apps/orchestrator/src/api/agents/agents.controller.ts (status endpoint) - apps/orchestrator/src/api/agents/agents.module.ts (lifecycle integration) Testing: - 12/12 tests passing for FederationAgentService - All command service tests passing - TypeScript compilation successful - Linting passed Refs #93 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -87,14 +87,14 @@ The Agent Orchestration Layer must provide:
|
||||
|
||||
### Component Responsibilities
|
||||
|
||||
| Component | Responsibility |
|
||||
|-----------|----------------|
|
||||
| **Task Manager** | CRUD operations on tasks, state transitions, assignment logic |
|
||||
| **Agent Manager** | Agent lifecycle, health tracking, session management |
|
||||
| **Coordinator** | Heartbeat processing, failure detection, recovery orchestration |
|
||||
| **PostgreSQL** | Persistent storage of tasks, agents, sessions, logs |
|
||||
| **Valkey/Redis** | Runtime state, heartbeats, quick lookups, pub/sub |
|
||||
| **Gateway** | Agent spawning, session management, message routing |
|
||||
| Component | Responsibility |
|
||||
| ----------------- | --------------------------------------------------------------- |
|
||||
| **Task Manager** | CRUD operations on tasks, state transitions, assignment logic |
|
||||
| **Agent Manager** | Agent lifecycle, health tracking, session management |
|
||||
| **Coordinator** | Heartbeat processing, failure detection, recovery orchestration |
|
||||
| **PostgreSQL** | Persistent storage of tasks, agents, sessions, logs |
|
||||
| **Valkey/Redis** | Runtime state, heartbeats, quick lookups, pub/sub |
|
||||
| **Gateway** | Agent spawning, session management, message routing |
|
||||
|
||||
---
|
||||
|
||||
@@ -156,44 +156,44 @@ CREATE TYPE task_orchestration_status AS ENUM (
|
||||
CREATE TABLE agent_tasks (
|
||||
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
||||
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
|
||||
|
||||
|
||||
-- Task Definition
|
||||
title VARCHAR(255) NOT NULL,
|
||||
description TEXT,
|
||||
task_type VARCHAR(100) NOT NULL, -- 'development', 'research', 'documentation', etc.
|
||||
|
||||
|
||||
-- Status & Priority
|
||||
status task_orchestration_status DEFAULT 'PENDING',
|
||||
priority INT DEFAULT 5, -- 1 (low) to 10 (high)
|
||||
|
||||
|
||||
-- Assignment
|
||||
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
|
||||
session_key VARCHAR(255), -- Current active session
|
||||
|
||||
|
||||
-- Progress Tracking
|
||||
progress_percent INT DEFAULT 0 CHECK (progress_percent BETWEEN 0 AND 100),
|
||||
current_step TEXT,
|
||||
estimated_completion_at TIMESTAMPTZ,
|
||||
|
||||
|
||||
-- Retry Logic
|
||||
retry_count INT DEFAULT 0,
|
||||
max_retries INT DEFAULT 3,
|
||||
retry_backoff_seconds INT DEFAULT 300, -- 5 minutes
|
||||
last_error TEXT,
|
||||
|
||||
|
||||
-- Dependencies
|
||||
depends_on UUID[] DEFAULT ARRAY[]::UUID[], -- Array of task IDs
|
||||
blocks UUID[] DEFAULT ARRAY[]::UUID[], -- Tasks blocked by this one
|
||||
|
||||
|
||||
-- Context
|
||||
input_context JSONB DEFAULT '{}', -- Input data/params for agent
|
||||
output_result JSONB, -- Final result from agent
|
||||
checkpoint_data JSONB DEFAULT '{}', -- Resumable state
|
||||
|
||||
|
||||
-- Metadata
|
||||
metadata JSONB DEFAULT '{}',
|
||||
tags VARCHAR(100)[] DEFAULT ARRAY[]::VARCHAR[],
|
||||
|
||||
|
||||
-- Audit
|
||||
created_by UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
@@ -202,7 +202,7 @@ CREATE TABLE agent_tasks (
|
||||
started_at TIMESTAMPTZ,
|
||||
completed_at TIMESTAMPTZ,
|
||||
failed_at TIMESTAMPTZ,
|
||||
|
||||
|
||||
CONSTRAINT fk_workspace FOREIGN KEY (workspace_id) REFERENCES workspaces(id),
|
||||
CONSTRAINT fk_agent FOREIGN KEY (agent_id) REFERENCES agents(id)
|
||||
);
|
||||
@@ -227,21 +227,21 @@ CREATE TABLE agent_task_logs (
|
||||
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
|
||||
task_id UUID NOT NULL REFERENCES agent_tasks(id) ON DELETE CASCADE,
|
||||
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
|
||||
|
||||
|
||||
-- Log Entry
|
||||
level task_log_level DEFAULT 'INFO',
|
||||
event VARCHAR(100) NOT NULL, -- 'state_transition', 'progress_update', 'error', etc.
|
||||
message TEXT,
|
||||
details JSONB DEFAULT '{}',
|
||||
|
||||
|
||||
-- State Snapshot
|
||||
previous_status task_orchestration_status,
|
||||
new_status task_orchestration_status,
|
||||
|
||||
|
||||
-- Context
|
||||
session_key VARCHAR(255),
|
||||
stack_trace TEXT,
|
||||
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
@@ -260,20 +260,20 @@ CREATE TABLE agent_heartbeats (
|
||||
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
||||
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
|
||||
agent_id UUID NOT NULL REFERENCES agents(id) ON DELETE CASCADE,
|
||||
|
||||
|
||||
-- Health Data
|
||||
status VARCHAR(50) NOT NULL, -- 'healthy', 'degraded', 'stale'
|
||||
current_task_id UUID REFERENCES agent_tasks(id) ON DELETE SET NULL,
|
||||
progress_percent INT,
|
||||
|
||||
|
||||
-- Resource Usage
|
||||
memory_mb INT,
|
||||
cpu_percent INT,
|
||||
|
||||
|
||||
-- Timing
|
||||
last_seen_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
next_expected_at TIMESTAMPTZ,
|
||||
|
||||
|
||||
metadata JSONB DEFAULT '{}'
|
||||
);
|
||||
|
||||
@@ -284,7 +284,7 @@ CREATE INDEX idx_heartbeats_stale ON agent_heartbeats(last_seen_at) WHERE status
|
||||
CREATE OR REPLACE FUNCTION cleanup_old_heartbeats()
|
||||
RETURNS void AS $$
|
||||
BEGIN
|
||||
DELETE FROM agent_heartbeats
|
||||
DELETE FROM agent_heartbeats
|
||||
WHERE last_seen_at < NOW() - INTERVAL '1 hour';
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
@@ -310,6 +310,7 @@ CREATE INDEX idx_agents_coordinator ON agents(coordinator_enabled) WHERE coordin
|
||||
## Valkey/Redis Key Patterns
|
||||
|
||||
Valkey is used for:
|
||||
|
||||
- **Real-time state** (fast reads/writes)
|
||||
- **Pub/Sub messaging** (coordination events)
|
||||
- **Distributed locks** (prevent race conditions)
|
||||
@@ -343,9 +344,9 @@ SADD tasks:active:{workspace_id} {task_id}:{session_key}
|
||||
SETEX agent:heartbeat:{agent_id} 60 "{\"status\":\"running\",\"task_id\":\"{task_id}\",\"timestamp\":{ts}}"
|
||||
|
||||
# Agent status (hash)
|
||||
HSET agent:status:{agent_id}
|
||||
status "running"
|
||||
current_task "{task_id}"
|
||||
HSET agent:status:{agent_id}
|
||||
status "running"
|
||||
current_task "{task_id}"
|
||||
last_heartbeat {timestamp}
|
||||
|
||||
# Stale agents (sorted set by last heartbeat)
|
||||
@@ -382,7 +383,7 @@ PUBLISH coordinator:commands:{workspace_id} "{\"command\":\"reassign_task\",\"ta
|
||||
|
||||
```redis
|
||||
# Session context (hash with TTL)
|
||||
HSET session:context:{session_key}
|
||||
HSET session:context:{session_key}
|
||||
workspace_id "{workspace_id}"
|
||||
agent_id "{agent_id}"
|
||||
task_id "{task_id}"
|
||||
@@ -392,14 +393,14 @@ EXPIRE session:context:{session_key} 3600
|
||||
|
||||
### Data Lifecycle
|
||||
|
||||
| Key Type | TTL | Cleanup Strategy |
|
||||
|----------|-----|------------------|
|
||||
| `agent:heartbeat:*` | 60s | Auto-expire |
|
||||
| `agent:status:*` | None | Delete on agent termination |
|
||||
| `session:context:*` | 1h | Auto-expire |
|
||||
| `tasks:pending:*` | None | Remove on assignment |
|
||||
| `coordinator:lock:*` | 30s | Auto-expire (renewed by active coordinator) |
|
||||
| `task:assign_lock:*` | 5s | Auto-expire after assignment |
|
||||
| Key Type | TTL | Cleanup Strategy |
|
||||
| -------------------- | ---- | ------------------------------------------- |
|
||||
| `agent:heartbeat:*` | 60s | Auto-expire |
|
||||
| `agent:status:*` | None | Delete on agent termination |
|
||||
| `session:context:*` | 1h | Auto-expire |
|
||||
| `tasks:pending:*` | None | Remove on assignment |
|
||||
| `coordinator:lock:*` | 30s | Auto-expire (renewed by active coordinator) |
|
||||
| `task:assign_lock:*` | 5s | Auto-expire after assignment |
|
||||
|
||||
---
|
||||
|
||||
@@ -546,7 +547,7 @@ POST {configured_webhook_url}
|
||||
export class CoordinatorService {
|
||||
private coordinatorLock: string;
|
||||
private isRunning: boolean = false;
|
||||
|
||||
|
||||
constructor(
|
||||
private readonly taskManager: TaskManagerService,
|
||||
private readonly agentManager: AgentManagerService,
|
||||
@@ -554,191 +555,190 @@ export class CoordinatorService {
|
||||
private readonly prisma: PrismaService,
|
||||
private readonly logger: Logger
|
||||
) {}
|
||||
|
||||
|
||||
// Main coordination loop
|
||||
@Cron('*/30 * * * * *') // Every 30 seconds
|
||||
@Cron("*/30 * * * * *") // Every 30 seconds
|
||||
async coordinate() {
|
||||
if (!await this.acquireLock()) {
|
||||
return; // Another coordinator is active
|
||||
if (!(await this.acquireLock())) {
|
||||
return; // Another coordinator is active
|
||||
}
|
||||
|
||||
|
||||
try {
|
||||
await this.checkAgentHealth();
|
||||
await this.assignPendingTasks();
|
||||
await this.resolveDependencies();
|
||||
await this.recoverFailedTasks();
|
||||
} catch (error) {
|
||||
this.logger.error('Coordination cycle failed', error);
|
||||
this.logger.error("Coordination cycle failed", error);
|
||||
} finally {
|
||||
await this.releaseLock();
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// Distributed lock to prevent multiple coordinators
|
||||
private async acquireLock(): Promise<boolean> {
|
||||
const lockKey = `coordinator:lock:global`;
|
||||
const result = await this.valkey.set(
|
||||
lockKey,
|
||||
process.env.HOSTNAME || 'coordinator',
|
||||
'NX',
|
||||
'EX',
|
||||
process.env.HOSTNAME || "coordinator",
|
||||
"NX",
|
||||
"EX",
|
||||
30
|
||||
);
|
||||
return result === 'OK';
|
||||
return result === "OK";
|
||||
}
|
||||
|
||||
|
||||
// Check agent heartbeats and mark stale
|
||||
private async checkAgentHealth() {
|
||||
const agents = await this.agentManager.getCoordinatorManagedAgents();
|
||||
const now = Date.now();
|
||||
|
||||
|
||||
for (const agent of agents) {
|
||||
const heartbeatKey = `agent:heartbeat:${agent.id}`;
|
||||
const lastHeartbeat = await this.valkey.get(heartbeatKey);
|
||||
|
||||
|
||||
if (!lastHeartbeat) {
|
||||
// No heartbeat - agent is stale
|
||||
await this.handleStaleAgent(agent);
|
||||
} else {
|
||||
const heartbeatData = JSON.parse(lastHeartbeat);
|
||||
const age = now - heartbeatData.timestamp;
|
||||
|
||||
|
||||
if (age > agent.staleThresholdSeconds * 1000) {
|
||||
await this.handleStaleAgent(agent);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// Assign pending tasks to available agents
|
||||
private async assignPendingTasks() {
|
||||
const workspaces = await this.getActiveWorkspaces();
|
||||
|
||||
|
||||
for (const workspace of workspaces) {
|
||||
const pendingTasks = await this.taskManager.getPendingTasks(
|
||||
workspace.id,
|
||||
{ orderBy: { priority: 'desc', createdAt: 'asc' } }
|
||||
);
|
||||
|
||||
const pendingTasks = await this.taskManager.getPendingTasks(workspace.id, {
|
||||
orderBy: { priority: "desc", createdAt: "asc" },
|
||||
});
|
||||
|
||||
for (const task of pendingTasks) {
|
||||
// Check dependencies
|
||||
if (!await this.areDependenciesMet(task)) {
|
||||
if (!(await this.areDependenciesMet(task))) {
|
||||
continue;
|
||||
}
|
||||
|
||||
|
||||
// Find available agent
|
||||
const agent = await this.agentManager.findAvailableAgent(
|
||||
workspace.id,
|
||||
task.taskType,
|
||||
task.metadata.requiredCapabilities
|
||||
);
|
||||
|
||||
|
||||
if (agent) {
|
||||
await this.assignTask(task, agent);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// Handle stale agents
|
||||
private async handleStaleAgent(agent: Agent) {
|
||||
this.logger.warn(`Agent ${agent.id} is stale - recovering tasks`);
|
||||
|
||||
|
||||
// Mark agent as ERROR
|
||||
await this.agentManager.updateAgentStatus(agent.id, AgentStatus.ERROR);
|
||||
|
||||
|
||||
// Get assigned tasks
|
||||
const tasks = await this.taskManager.getTasksForAgent(agent.id);
|
||||
|
||||
|
||||
for (const task of tasks) {
|
||||
await this.recoverTask(task, 'agent_stale');
|
||||
await this.recoverTask(task, "agent_stale");
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// Recover a task from failure
|
||||
private async recoverTask(task: AgentTask, reason: string) {
|
||||
// Log the failure
|
||||
await this.taskManager.logTaskEvent(task.id, {
|
||||
level: 'ERROR',
|
||||
event: 'task_recovery',
|
||||
level: "ERROR",
|
||||
event: "task_recovery",
|
||||
message: `Task recovery initiated: ${reason}`,
|
||||
previousStatus: task.status,
|
||||
newStatus: 'ABORTED'
|
||||
newStatus: "ABORTED",
|
||||
});
|
||||
|
||||
|
||||
// Check retry limit
|
||||
if (task.retryCount >= task.maxRetries) {
|
||||
await this.taskManager.updateTask(task.id, {
|
||||
status: 'FAILED',
|
||||
status: "FAILED",
|
||||
lastError: `Max retries exceeded (${task.retryCount}/${task.maxRetries})`,
|
||||
failedAt: new Date()
|
||||
failedAt: new Date(),
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
// Abort current assignment
|
||||
await this.taskManager.updateTask(task.id, {
|
||||
status: 'ABORTED',
|
||||
status: "ABORTED",
|
||||
agentId: null,
|
||||
sessionKey: null,
|
||||
retryCount: task.retryCount + 1
|
||||
retryCount: task.retryCount + 1,
|
||||
});
|
||||
|
||||
|
||||
// Wait for backoff period before requeuing
|
||||
const backoffMs = task.retryBackoffSeconds * 1000 * Math.pow(2, task.retryCount);
|
||||
setTimeout(async () => {
|
||||
await this.taskManager.updateTask(task.id, {
|
||||
status: 'PENDING'
|
||||
status: "PENDING",
|
||||
});
|
||||
}, backoffMs);
|
||||
}
|
||||
|
||||
|
||||
// Assign task to agent
|
||||
private async assignTask(task: AgentTask, agent: Agent) {
|
||||
// Acquire assignment lock
|
||||
const lockKey = `task:assign_lock:${task.id}`;
|
||||
const locked = await this.valkey.set(lockKey, agent.id, 'NX', 'EX', 5);
|
||||
|
||||
const locked = await this.valkey.set(lockKey, agent.id, "NX", "EX", 5);
|
||||
|
||||
if (!locked) {
|
||||
return; // Another coordinator already assigned this task
|
||||
return; // Another coordinator already assigned this task
|
||||
}
|
||||
|
||||
|
||||
try {
|
||||
// Update task
|
||||
await this.taskManager.updateTask(task.id, {
|
||||
status: 'ASSIGNED',
|
||||
status: "ASSIGNED",
|
||||
agentId: agent.id,
|
||||
assignedAt: new Date()
|
||||
assignedAt: new Date(),
|
||||
});
|
||||
|
||||
|
||||
// Spawn agent session via Gateway
|
||||
const session = await this.spawnAgentSession(agent, task);
|
||||
|
||||
|
||||
// Update task with session
|
||||
await this.taskManager.updateTask(task.id, {
|
||||
sessionKey: session.sessionKey,
|
||||
status: 'RUNNING',
|
||||
startedAt: new Date()
|
||||
status: "RUNNING",
|
||||
startedAt: new Date(),
|
||||
});
|
||||
|
||||
|
||||
// Log assignment
|
||||
await this.taskManager.logTaskEvent(task.id, {
|
||||
level: 'INFO',
|
||||
event: 'task_assigned',
|
||||
level: "INFO",
|
||||
event: "task_assigned",
|
||||
message: `Task assigned to agent ${agent.id}`,
|
||||
details: { agentId: agent.id, sessionKey: session.sessionKey }
|
||||
details: { agentId: agent.id, sessionKey: session.sessionKey },
|
||||
});
|
||||
} finally {
|
||||
await this.valkey.del(lockKey);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// Spawn agent session via Gateway
|
||||
private async spawnAgentSession(agent: Agent, task: AgentTask): Promise<AgentSession> {
|
||||
// Call Gateway API to spawn subagent with task context
|
||||
const response = await fetch(`${process.env.GATEWAY_URL}/api/agents/spawn`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
method: "POST",
|
||||
headers: { "Content-Type": "application/json" },
|
||||
body: JSON.stringify({
|
||||
workspaceId: task.workspaceId,
|
||||
agentId: agent.id,
|
||||
@@ -748,11 +748,11 @@ export class CoordinatorService {
|
||||
taskTitle: task.title,
|
||||
taskDescription: task.description,
|
||||
inputContext: task.inputContext,
|
||||
checkpointData: task.checkpointData
|
||||
}
|
||||
})
|
||||
checkpointData: task.checkpointData,
|
||||
},
|
||||
}),
|
||||
});
|
||||
|
||||
|
||||
const data = await response.json();
|
||||
return data.session;
|
||||
}
|
||||
@@ -811,10 +811,12 @@ export class CoordinatorService {
|
||||
**Scenario:** Agent crashes mid-task.
|
||||
|
||||
**Detection:**
|
||||
|
||||
- Heartbeat TTL expires in Valkey
|
||||
- Coordinator detects missing heartbeat
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Mark agent as `ERROR` in database
|
||||
2. Abort assigned tasks with `status = ABORTED`
|
||||
3. Log failure with stack trace (if available)
|
||||
@@ -827,10 +829,12 @@ export class CoordinatorService {
|
||||
**Scenario:** Gateway restarts, killing all agent sessions.
|
||||
|
||||
**Detection:**
|
||||
|
||||
- All agent heartbeats stop simultaneously
|
||||
- Coordinator detects mass stale agents
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Coordinator marks all `RUNNING` tasks as `ABORTED`
|
||||
2. Tasks with `checkpointData` can resume from last checkpoint
|
||||
3. Tasks without checkpoints restart from scratch
|
||||
@@ -841,10 +845,12 @@ export class CoordinatorService {
|
||||
**Scenario:** Task A depends on Task B, which depends on Task A (circular dependency).
|
||||
|
||||
**Detection:**
|
||||
|
||||
- Coordinator builds dependency graph
|
||||
- Detects cycles during `resolveDependencies()`
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Log `ERROR` with cycle details
|
||||
2. Mark all tasks in cycle as `FAILED` with reason `dependency_cycle`
|
||||
3. Notify workspace owner via webhook
|
||||
@@ -854,9 +860,11 @@ export class CoordinatorService {
|
||||
**Scenario:** PostgreSQL becomes unavailable.
|
||||
|
||||
**Detection:**
|
||||
|
||||
- Prisma query fails with connection error
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Coordinator catches error, logs to stderr
|
||||
2. Releases lock (allowing failover to another instance)
|
||||
3. Retries with exponential backoff: 5s, 10s, 20s, 40s
|
||||
@@ -867,14 +875,17 @@ export class CoordinatorService {
|
||||
**Scenario:** Network partition causes two coordinators to run simultaneously.
|
||||
|
||||
**Prevention:**
|
||||
|
||||
- Distributed lock in Valkey with 30s TTL
|
||||
- Coordinators must renew lock every cycle
|
||||
- Only one coordinator can hold lock at a time
|
||||
|
||||
**Detection:**
|
||||
|
||||
- Task assigned to multiple agents (conflict detection)
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Newer assignment wins (based on `assignedAt` timestamp)
|
||||
2. Cancel older session
|
||||
3. Log conflict for investigation
|
||||
@@ -884,10 +895,12 @@ export class CoordinatorService {
|
||||
**Scenario:** Task runs longer than `estimatedCompletionAt + grace period`.
|
||||
|
||||
**Detection:**
|
||||
|
||||
- Coordinator checks `estimatedCompletionAt` field
|
||||
- If exceeded by >30 minutes, mark as potentially hung
|
||||
|
||||
**Recovery:**
|
||||
|
||||
1. Send warning to agent session (via pub/sub)
|
||||
2. If no progress update in 10 minutes, abort task
|
||||
3. Log timeout error
|
||||
@@ -902,6 +915,7 @@ export class CoordinatorService {
|
||||
**Goal:** Basic task and agent models, no coordination yet.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- [ ] Database schema migration (tables, indexes)
|
||||
- [ ] Prisma models for `AgentTask`, `AgentTaskLog`, `AgentHeartbeat`
|
||||
- [ ] Basic CRUD API endpoints for tasks
|
||||
@@ -909,6 +923,7 @@ export class CoordinatorService {
|
||||
- [ ] Manual task assignment (no automation)
|
||||
|
||||
**Testing:**
|
||||
|
||||
- Unit tests for task state machine
|
||||
- Integration tests for task CRUD
|
||||
- Manual testing: create task, assign to agent, complete
|
||||
@@ -918,6 +933,7 @@ export class CoordinatorService {
|
||||
**Goal:** Autonomous coordinator with health monitoring.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- [ ] `CoordinatorService` with distributed locking
|
||||
- [ ] Health monitoring (heartbeat TTL checks)
|
||||
- [ ] Automatic task assignment to available agents
|
||||
@@ -925,6 +941,7 @@ export class CoordinatorService {
|
||||
- [ ] Pub/Sub for coordination events
|
||||
|
||||
**Testing:**
|
||||
|
||||
- Unit tests for coordinator logic
|
||||
- Integration tests with Valkey
|
||||
- Chaos testing: kill agents, verify recovery
|
||||
@@ -935,6 +952,7 @@ export class CoordinatorService {
|
||||
**Goal:** Fault-tolerant operation with automatic recovery.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- [ ] Agent failure detection and task recovery
|
||||
- [ ] Exponential backoff for retries
|
||||
- [ ] Checkpoint/resume support for long-running tasks
|
||||
@@ -942,6 +960,7 @@ export class CoordinatorService {
|
||||
- [ ] Deadlock detection
|
||||
|
||||
**Testing:**
|
||||
|
||||
- Fault injection: kill agents, restart Gateway
|
||||
- Dependency cycle testing
|
||||
- Retry exhaustion testing
|
||||
@@ -952,6 +971,7 @@ export class CoordinatorService {
|
||||
**Goal:** Full visibility into orchestration state.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- [ ] Coordinator status dashboard
|
||||
- [ ] Task progress tracking UI
|
||||
- [ ] Real-time logs API
|
||||
@@ -959,6 +979,7 @@ export class CoordinatorService {
|
||||
- [ ] Webhook integration for external monitoring
|
||||
|
||||
**Testing:**
|
||||
|
||||
- Load testing with metrics collection
|
||||
- Dashboard usability testing
|
||||
- Webhook reliability testing
|
||||
@@ -968,6 +989,7 @@ export class CoordinatorService {
|
||||
**Goal:** Production-grade features.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- [ ] Task prioritization algorithms (SJF, priority queues)
|
||||
- [ ] Agent capability matching (skills-based routing)
|
||||
- [ ] Task batching (group similar tasks)
|
||||
@@ -988,11 +1010,11 @@ async getTasks(userId: string, workspaceId: string) {
|
||||
const membership = await this.prisma.workspaceMember.findUnique({
|
||||
where: { workspaceId_userId: { workspaceId, userId } }
|
||||
});
|
||||
|
||||
|
||||
if (!membership) {
|
||||
throw new ForbiddenException('Not a member of this workspace');
|
||||
}
|
||||
|
||||
|
||||
return this.prisma.agentTask.findMany({
|
||||
where: { workspaceId }
|
||||
});
|
||||
@@ -1018,15 +1040,15 @@ async getTasks(userId: string, workspaceId: string) {
|
||||
|
||||
### Key Metrics
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `coordinator.cycle.duration_ms` | Coordination cycle execution time | >5000ms |
|
||||
| `coordinator.stale_agents.count` | Number of stale agents detected | >5 |
|
||||
| `tasks.pending.count` | Tasks waiting for assignment | >50 |
|
||||
| `tasks.failed.count` | Total failed tasks (last 1h) | >10 |
|
||||
| `tasks.retry.exhausted.count` | Tasks exceeding max retries | >0 |
|
||||
| `agents.spawned.count` | Agent spawn rate | >100/min |
|
||||
| `valkey.connection.errors` | Valkey connection failures | >0 |
|
||||
| Metric | Description | Alert Threshold |
|
||||
| -------------------------------- | --------------------------------- | --------------- |
|
||||
| `coordinator.cycle.duration_ms` | Coordination cycle execution time | >5000ms |
|
||||
| `coordinator.stale_agents.count` | Number of stale agents detected | >5 |
|
||||
| `tasks.pending.count` | Tasks waiting for assignment | >50 |
|
||||
| `tasks.failed.count` | Total failed tasks (last 1h) | >10 |
|
||||
| `tasks.retry.exhausted.count` | Tasks exceeding max retries | >0 |
|
||||
| `agents.spawned.count` | Agent spawn rate | >100/min |
|
||||
| `valkey.connection.errors` | Valkey connection failures | >0 |
|
||||
|
||||
### Health Checks
|
||||
|
||||
@@ -1053,25 +1075,25 @@ GET /health/coordinator
|
||||
|
||||
```typescript
|
||||
// Main agent creates a development task
|
||||
const task = await fetch('/api/v1/agent-tasks', {
|
||||
method: 'POST',
|
||||
const task = await fetch("/api/v1/agent-tasks", {
|
||||
method: "POST",
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${sessionToken}`
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${sessionToken}`,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
title: 'Fix TypeScript strict errors in U-Connect',
|
||||
description: 'Run tsc --noEmit, fix all errors, commit changes',
|
||||
taskType: 'development',
|
||||
title: "Fix TypeScript strict errors in U-Connect",
|
||||
description: "Run tsc --noEmit, fix all errors, commit changes",
|
||||
taskType: "development",
|
||||
priority: 8,
|
||||
inputContext: {
|
||||
repository: 'u-connect',
|
||||
branch: 'main',
|
||||
commands: ['pnpm install', 'pnpm tsc:check']
|
||||
repository: "u-connect",
|
||||
branch: "main",
|
||||
commands: ["pnpm install", "pnpm tsc:check"],
|
||||
},
|
||||
maxRetries: 2,
|
||||
estimatedDurationMinutes: 30
|
||||
})
|
||||
estimatedDurationMinutes: 30,
|
||||
}),
|
||||
});
|
||||
|
||||
const { id } = await task.json();
|
||||
@@ -1084,19 +1106,19 @@ console.log(`Task created: ${id}`);
|
||||
// Subagent sends heartbeat every 30s
|
||||
setInterval(async () => {
|
||||
await fetch(`/api/v1/agents/${agentId}/heartbeat`, {
|
||||
method: 'POST',
|
||||
method: "POST",
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${agentToken}`
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${agentToken}`,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
status: 'healthy',
|
||||
status: "healthy",
|
||||
currentTaskId: taskId,
|
||||
progressPercent: 45,
|
||||
currentStep: 'Running tsc --noEmit',
|
||||
currentStep: "Running tsc --noEmit",
|
||||
memoryMb: 512,
|
||||
cpuPercent: 35
|
||||
})
|
||||
cpuPercent: 35,
|
||||
}),
|
||||
});
|
||||
}, 30000);
|
||||
```
|
||||
@@ -1106,20 +1128,20 @@ setInterval(async () => {
|
||||
```typescript
|
||||
// Agent updates task progress
|
||||
await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
|
||||
method: 'PATCH',
|
||||
method: "PATCH",
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${agentToken}`
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${agentToken}`,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
progressPercent: 70,
|
||||
currentStep: 'Fixing type errors in packages/shared',
|
||||
currentStep: "Fixing type errors in packages/shared",
|
||||
checkpointData: {
|
||||
filesProcessed: 15,
|
||||
errorsFixed: 8,
|
||||
remainingFiles: 5
|
||||
}
|
||||
})
|
||||
remainingFiles: 5,
|
||||
},
|
||||
}),
|
||||
});
|
||||
```
|
||||
|
||||
@@ -1128,20 +1150,20 @@ await fetch(`/api/v1/agent-tasks/${taskId}/progress`, {
|
||||
```typescript
|
||||
// Agent marks task complete
|
||||
await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
|
||||
method: 'POST',
|
||||
method: "POST",
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${agentToken}`
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${agentToken}`,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
outputResult: {
|
||||
filesModified: 20,
|
||||
errorsFixed: 23,
|
||||
commitHash: 'abc123',
|
||||
buildStatus: 'passing'
|
||||
commitHash: "abc123",
|
||||
buildStatus: "passing",
|
||||
},
|
||||
summary: 'All TypeScript strict errors resolved. Build passing.'
|
||||
})
|
||||
summary: "All TypeScript strict errors resolved. Build passing.",
|
||||
}),
|
||||
});
|
||||
```
|
||||
|
||||
@@ -1149,17 +1171,17 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
|
||||
|
||||
## Glossary
|
||||
|
||||
| Term | Definition |
|
||||
|------|------------|
|
||||
| **Agent** | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
|
||||
| **Task** | Unit of work to be executed by an agent |
|
||||
| **Coordinator** | Background service that assigns tasks and monitors agent health |
|
||||
| **Heartbeat** | Periodic signal from agent indicating it's alive and working |
|
||||
| **Stale Agent** | Agent that has stopped sending heartbeats (assumed dead) |
|
||||
| **Checkpoint** | Snapshot of task state allowing resumption after failure |
|
||||
| **Workspace** | Tenant isolation boundary (all tasks belong to a workspace) |
|
||||
| **Session** | Gateway-managed connection between user and agent |
|
||||
| **Orchestration** | Automated coordination of multiple agents working on tasks |
|
||||
| Term | Definition |
|
||||
| ----------------- | ------------------------------------------------------------------ |
|
||||
| **Agent** | Autonomous AI instance (e.g., Claude subagent) that executes tasks |
|
||||
| **Task** | Unit of work to be executed by an agent |
|
||||
| **Coordinator** | Background service that assigns tasks and monitors agent health |
|
||||
| **Heartbeat** | Periodic signal from agent indicating it's alive and working |
|
||||
| **Stale Agent** | Agent that has stopped sending heartbeats (assumed dead) |
|
||||
| **Checkpoint** | Snapshot of task state allowing resumption after failure |
|
||||
| **Workspace** | Tenant isolation boundary (all tasks belong to a workspace) |
|
||||
| **Session** | Gateway-managed connection between user and agent |
|
||||
| **Orchestration** | Automated coordination of multiple agents working on tasks |
|
||||
|
||||
---
|
||||
|
||||
@@ -1174,6 +1196,7 @@ await fetch(`/api/v1/agent-tasks/${taskId}/complete`, {
|
||||
---
|
||||
|
||||
**Next Steps:**
|
||||
|
||||
1. Review and approve this design document
|
||||
2. Create GitHub issues for Phase 1 tasks
|
||||
3. Set up development branch: `feature/agent-orchestration`
|
||||
|
||||
Reference in New Issue
Block a user