Commit Graph

53 Commits

Author SHA1 Message Date
e20aea99b9 test(#344): Add comprehensive tests for CI operations service
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
- Add 52 tests achieving 99.3% coverage
- Test all public methods: getLatestPipeline, getPipeline, waitForPipeline, getPipelineLogs
- Test auto-diagnosis for all failure categories
- Test pipeline parsing and status handling
- Mock ConfigService and child_process exec
- All tests passing with >85% coverage requirement met

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 11:27:35 -06:00
7feb686d73 feat(#344): Add CI operations service to orchestrator
- Add CIOperationsService for Woodpecker CI integration
- Add types for pipeline status, failure diagnosis
- Add waitForPipeline with auto-diagnosis on failure
- Add getPipelineLogs for log retrieval
- Integrate CIModule into orchestrator app

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 11:21:38 -06:00
Jason Woltje
dfef71b660 fix(CQ-ORCH-10): Make BullMQ job retention configurable via env vars
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Replace hardcoded BullMQ job retention values (completed: 100 jobs / 1h,
failed: 1000 jobs / 24h) with configurable env vars to prevent memory
growth under load. Adds QUEUE_COMPLETED_RETENTION_COUNT,
QUEUE_COMPLETED_RETENTION_AGE_S, QUEUE_FAILED_RETENTION_COUNT, and
QUEUE_FAILED_RETENTION_AGE_S to orchestrator config. Defaults preserve
existing behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 15:25:55 -06:00
Jason Woltje
6934d9261c fix(SEC-ORCH-30): Add unique suffix to container names
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add crypto.randomBytes(4) hex suffix to container name generation
to prevent name collisions when multiple agents spawn simultaneously
within the same millisecond. Container names now include both a
timestamp and 8 random hex characters for guaranteed uniqueness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 15:22:12 -06:00
Jason Woltje
3880993b60 fix(SEC-ORCH-28+29): Add Valkey connection timeout + workItems MaxLength
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
SEC-ORCH-28: Add connectTimeout (5000ms default) and commandTimeout
(3000ms default) to Valkey/Redis client to prevent indefinite connection
hangs. Both are configurable via VALKEY_CONNECT_TIMEOUT_MS and
VALKEY_COMMAND_TIMEOUT_MS environment variables.

SEC-ORCH-29: Add @ArrayMaxSize(50) and @MaxLength(2000) to workItems
in AgentContextDto to prevent memory exhaustion from unbounded input.
Also adds @ArrayMaxSize(20) and @MaxLength(200) to skills array.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 15:19:44 -06:00
Jason Woltje
92c310333c fix(SEC-REVIEW-4-7): Address remaining MEDIUM security review findings
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
- Graceful container shutdown: detect "not running" containers and skip
  force-remove escalation, only SIGKILL for genuine stop failures
- data: URI stripping: add security audit logging via NestJS Logger
  when data: URIs are blocked in markdown links and images
- Orchestrator bootstrap: replace void bootstrap() with .catch() handler
  for clear startup failure logging and clean process.exit(1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:51:22 -06:00
Jason Woltje
433212e00f test(CQ-ORCH-9): Add SpawnAgentDto validation tests
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Adds 23 dedicated DTO-level validation tests for SpawnAgentDto and
AgentContextDto using plainToInstance + validate() from class-validator.
Covers: valid payloads, missing/empty taskId, invalid agentType, empty
repository/branch, empty workItems, shell injection in branch names,
SSRF in repository URLs, file:// protocol blocking, option injection,
and invalid gateProfile values.

Replaces the 5 controller-level validation tests removed in CQ-ORCH-9
with proper DTO-level equivalents.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:31:37 -06:00
Jason Woltje
c9ad3a661a fix(CQ-ORCH-9): Deduplicate spawn validation logic
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Remove duplicate validateSpawnRequest from AgentsController. Validation
is now handled exclusively by:
1. ValidationPipe + DTO decorators (HTTP layer, class-validator)
2. AgentSpawnerService.validateSpawnRequest (business logic layer)

This eliminates the maintenance burden and divergence risk of having
identical validation in two places. Controller tests for the removed
duplicate validation are also removed since they are fully covered by
the service tests and DTO validation decorators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:09:06 -06:00
Jason Woltje
a0062494b7 fix(CQ-ORCH-7): Graceful Docker container shutdown before force remove
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Replace the always-force container removal (SIGKILL) with a two-phase
approach: first attempt graceful stop (SIGTERM with configurable timeout),
then remove without force. Falls back to force remove only if the graceful
path fails. The graceful stop timeout is configurable via
orchestrator.sandbox.gracefulStopTimeoutSeconds (default: 10s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:05:53 -06:00
Jason Woltje
2b356f6ca2 fix(CQ-ORCH-5): Fix TOCTOU race in agent state transitions
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add per-agent mutex using promise chaining to serialize state transitions
for the same agent. This prevents the Time-of-Check-Time-of-Use race
condition where two concurrent requests could both read the current state,
both validate it as valid for transition, and both write, causing one to
overwrite the other's transition.

The mutex uses a Map<string, Promise<void>> with promise chaining so that:
- Concurrent transitions to the same agent are queued and executed sequentially
- Different agents can still transition concurrently without contention
- The lock is always released even if the transition throws an error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 14:02:40 -06:00
Jason Woltje
d9efa85924 fix(SEC-ORCH-22): Validate Docker image tag format before pull
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add validateImageTag() method to DockerSandboxService that validates
Docker image references against a safe character pattern before any
container creation. Rejects empty tags, tags exceeding 256 characters,
and tags containing shell metacharacters (;, &, |, $, backtick, etc.)
to prevent injection attacks. Also validates the default image tag at
service construction time to fail fast on misconfiguration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 13:46:47 -06:00
Jason Woltje
25d2958fe4 fix(SEC-ORCH-20): Bind orchestrator to 127.0.0.1 by default
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Change default bind address from 0.0.0.0 to 127.0.0.1 to prevent
the orchestrator API from being exposed on all network interfaces.
The bind address is now configurable via HOST or BIND_ADDRESS env
vars for Docker/production deployments that need 0.0.0.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 13:42:51 -06:00
Jason Woltje
519093f42e fix(tests): Correct pipeline test failures (#239)
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
Fixes 4 test failures identified in pipeline run 239:

1. RunnerJobsService cancel tests:
   - Use updateMany mock instead of update (service uses optimistic locking)
   - Add version field to mock objects
   - Use mockResolvedValueOnce for sequential findUnique calls

2. ActivityService error handling tests:
   - Update tests to expect null return (fire-and-forget pattern)
   - Activity logging now returns null on DB errors per security fix

3. SecretScannerService unreadable file test:
   - Handle root user case where chmod 0o000 doesn't prevent reads
   - Test now adapts expectations based on runtime permissions

Quality gates: lint ✓ typecheck ✓ tests ✓
- @mosaic/orchestrator: 612 tests passing
- @mosaic/web: 650 tests passing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 11:57:47 -06:00
Jason Woltje
3cfed1ebe3 fix(SEC-ORCH-19): Validate agentId path parameter as UUID
Add ParseUUIDPipe to getAgentStatus and killAgent endpoints to
reject invalid agentId values with a 400 Bad Request.

This prevents potential injection attacks and ensures type safety
for agent lookups.

Refs #339

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:21:35 -06:00
Jason Woltje
89bb24493a fix(SEC-ORCH-16): Implement real health and readiness checks
- Add ping() method to ValkeyClient and ValkeyService for health checks
- Update HealthService to check Valkey connectivity before reporting ready
- /health/ready now returns 503 if dependencies are unhealthy
- Add detailed checks object showing individual dependency status
- Update tests with ValkeyService mock

Refs #339

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:20:07 -06:00
Jason Woltje
e891449e0f fix(CQ-ORCH-4): Fix AbortController timeout cleanup using try-finally
Move clearTimeout() to finally blocks in both checkQuality() and
isHealthy() methods to ensure timer cleanup even when errors occur.
This prevents timer leaks on failed requests.

Refs #339

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:14:06 -06:00
Jason Woltje
a42f88d64c fix(#338): Add session cleanup on terminal states
- Add removeSession and scheduleSessionCleanup methods to AgentSpawnerService
- Schedule session cleanup after completed/failed/killed transitions
- Default 30 second delay before cleanup to allow status queries
- Implement OnModuleDestroy to clean up pending timers
- Add forwardRef injection to avoid circular dependency
- Add comprehensive tests for cleanup functionality

Refs #338
2026-02-05 18:47:14 -06:00
Jason Woltje
8d57191a91 fix(#338): Use MGET for batch retrieval instead of N individual GETs
- Replace N GET calls with single MGET after SCAN in listTasks()
- Replace N GET calls with single MGET after SCAN in listAgents()
- Handle null values (key deleted between SCAN and MGET)
- Add early return for empty key sets to skip unnecessary MGET
- Update tests to verify MGET batch retrieval and N+1 prevention

Significantly improves performance for large key sets (100-500x faster).

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:00 -06:00
Jason Woltje
a3490d7b09 fix(#338): Warn when VALKEY_PASSWORD not set
- Log security warning when Valkey password not configured
- Prominent warning in production environment
- Tests verify warning behavior for SEC-ORCH-15

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:39:44 -06:00
Jason Woltje
d53c80fef0 fix(#338): Block YOLO mode in production
- Add isProductionEnvironment() check to prevent YOLO mode bypass
- Log warning when YOLO mode request is blocked in production
- Fall back to process.env.NODE_ENV when config service returns undefined
- Add comprehensive tests for production blocking behavior

SECURITY: YOLO mode bypasses all quality gates which is dangerous in
production environments. This change ensures quality gates are always
enforced when NODE_ENV=production.

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:33:17 -06:00
Jason Woltje
3b80e9c396 fix(#338): Add max concurrent agents limit
- Add MAX_CONCURRENT_AGENTS configuration (default: 20)
- Check current agent count before spawning
- Reject spawn requests with 429 Too Many Requests when limit reached
- Add comprehensive tests for limit enforcement

Refs #338
2026-02-05 18:30:42 -06:00
Jason Woltje
ce7fb27c46 fix(#338): Add rate limiting to orchestrator API
- Add @nestjs/throttler for rate limiting support
- Configure multiple throttle profiles: default (100/min), strict (10/min for spawn/kill), status (200/min for polling)
- Apply strict rate limits to spawn and kill endpoints to prevent DoS
- Apply higher rate limits to status/health endpoints for monitoring
- Add OrchestratorThrottlerGuard with X-Forwarded-For support for proxy setups
- Add unit tests for throttler guard

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:26:50 -06:00
Jason Woltje
3f16bbeca1 fix(#338): Add Docker security hardening (CapDrop, ReadonlyRootfs, PidsLimit)
- Drop all Linux capabilities by default (CapDrop: ALL)
- Enable read-only root filesystem (agents write to mounted /workspace volume)
- Limit process count to 100 to prevent fork bombs (PidsLimit)
- Add no-new-privileges security option to prevent privilege escalation
- Add DockerSecurityOptions type with configurable security settings
- All options are configurable via config but secure by default
- Add comprehensive tests for security hardening options (20+ new tests)

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:21:43 -06:00
Jason Woltje
e747c8db04 fix(#338): Whitelist allowed environment variables in Docker containers
- Add DEFAULT_ENV_WHITELIST constant with safe env vars (AGENT_ID, TASK_ID,
  NODE_ENV, LOG_LEVEL, TZ, MOSAIC_* vars, etc.)
- Implement filterEnvVars() to separate allowed/filtered vars
- Log security warning when non-whitelisted vars are filtered
- Support custom whitelist via orchestrator.sandbox.envWhitelist config
- Add comprehensive tests for whitelist functionality (39 tests passing)

Prevents accidental leakage of secrets like API keys, database credentials,
AWS secrets, etc. to Docker containers.

Refs #338

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:17:00 -06:00
Jason Woltje
6552edaa11 fix(#337): Add Zod validation for Redis deserialization
- Created Zod schemas for TaskState, AgentState, and OrchestratorEvent
- Added ValkeyValidationError class for detailed error context
- Validate task and agent state data after JSON.parse
- Validate events in subscribeToEvents handler
- Corrupted/tampered data now rejected with clear errors including:
  - Key name for context
  - Data snippet (truncated to 100 chars)
  - Underlying Zod validation error
- Prevents silent propagation of invalid data (SEC-ORCH-6)
- Added 20 new tests for validation scenarios

Refs #337

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:54:48 -06:00
Jason Woltje
6a4f58dc1c fix(#337): Replace blocking KEYS command with SCAN in Valkey client
- Use SCAN with cursor for non-blocking iteration
- Prevents Redis DoS under high key counts
- Same API, safer implementation

Refs #337

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:49:08 -06:00
Jason Woltje
6d6ef1d151 fix(#337): Add API key authentication for orchestrator-coordinator communication
- Add COORDINATOR_API_KEY config option to orchestrator.config.ts
- Include X-API-Key header in coordinator requests when configured
- Log security warning if COORDINATOR_API_KEY not configured in production
- Log security warning if coordinator URL uses HTTP in production
- Add tests verifying API key inclusion in requests and warning behavior

Refs #337
2026-02-05 15:46:03 -06:00
Jason Woltje
949d0d0ead fix(#337): Enable Docker sandbox by default and warn when disabled
- Sandbox now enabled by default for security
- Logs prominent warning when explicitly disabled
- Agents run in containers unless SANDBOX_ENABLED=false

Refs #337

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:43:00 -06:00
Jason Woltje
6bb9846cde fix(#337): Return error state from secret scanner on scan failures
- Add scanError field and scannedSuccessfully flag to SecretScanResult
- File read errors no longer falsely report as "clean"
- Callers can distinguish clean files from scan failures
- Update getScanSummary to track filesWithErrors count
- SecretsDetectedError now reports files that couldn't be scanned
- Add tests verifying error handling behavior for file access issues

Refs #337

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:30:06 -06:00
Jason Woltje
000145af96 fix(SEC-ORCH-2): Add API key authentication to orchestrator API
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Add OrchestratorApiKeyGuard to protect agent management endpoints (spawn,
kill, kill-all, status) from unauthorized access. Uses X-API-Key header
with constant-time comparison to prevent timing attacks.

- Create apps/orchestrator/src/common/guards/api-key.guard.ts
- Add comprehensive tests for all guard scenarios
- Apply guard to AgentsController (controller-level protection)
- Document ORCHESTRATOR_API_KEY in .env.example files
- Health endpoints remain unauthenticated for monitoring

Security: Prevents unauthorized users from draining API credits or
killing all agents via unprotected endpoints.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:18:15 -06:00
6b63ca3e07 Merge branch 'develop' into feature/329-usage-budget
Some checks failed
ci/woodpecker/pr/woodpecker Pipeline failed
ci/woodpecker/push/woodpecker Pipeline failed
2026-02-05 20:37:17 +00:00
7bc37fc513 Merge branch 'develop' into feature/229-performance-testing
Some checks are pending
ci/woodpecker/push/woodpecker Pipeline is pending
ci/woodpecker/pr/woodpecker Pipeline is pending
2026-02-05 19:33:06 +00:00
8f2afcd022 Merge branch 'develop' into feature/230-documentation
Some checks failed
ci/woodpecker/pr/woodpecker Pipeline is pending
ci/woodpecker/push/woodpecker Pipeline failed
2026-02-05 19:32:40 +00:00
a8828cb53e Merge branch 'develop' into feature/226-e2e-agent-lifecycle
Some checks failed
ci/woodpecker/pr/woodpecker Pipeline failed
ci/woodpecker/push/woodpecker Pipeline failed
2026-02-05 19:32:23 +00:00
Jason Woltje
c68b541b6f fix(#226): Remediate code review findings for E2E tests
Some checks failed
ci/woodpecker/pr/woodpecker Pipeline failed
ci/woodpecker/push/woodpecker Pipeline failed
- Fix CRITICAL: Remove unused imports (Test, TestingModule, CleanupService)
- Fix CRITICAL: Remove unused mockValkeyService declaration
- Fix IMPORTANT: Rename misleading test describe/names to match actual behavior
- Fix IMPORTANT: Verify spawned agents exist before kill-all assertion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:26:21 -06:00
Jason Woltje
5a0f090cc5 fix(#230): Correct documentation errors from code review
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
- Fix CRITICAL: Correct 5 environment variable names to match actual config
  (VALKEY_HOST not ORCHESTRATOR_VALKEY_HOST, CLAUDE_API_KEY not ORCHESTRATOR_CLAUDE_API_KEY, etc.)
- Fix CRITICAL: Correct quality gate profiles table to match actual gate-config service
  (minimal = tests only, not typecheck+lint; add agent type defaults)
- Fix IMPORTANT: Add missing gateProfile optional field to spawn request docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:24:54 -06:00
Jason Woltje
0796cbc744 fix(#229): Remediate code review findings for performance tests
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
- Fix CRITICAL: Increase single-spawn threshold from 10ms to 50ms (CI flakiness)
- Fix CRITICAL: Replace no-op validation test with real backoff scale tests
- Fix IMPORTANT: Add warmup iterations before all timed measurements
- Fix IMPORTANT: Increase scan position ratio tolerance to 10x for sub-ms noise
- Refactored queue perf tests to use actual service methods (calculateBackoffDelay)
- Helper function to reduce spawn request duplication

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:23:19 -06:00
Jason Woltje
2cb3fe8f5a fix(#329): Harden BudgetService against security review findings
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
- Fix CRITICAL: Unbounded memory growth via daily record purging
- Fix CRITICAL: Negative/NaN/Infinity token bypass via input clamping
- Fix HIGH: TOCTOU race via atomic trySpawnAgent() method
- Fix HIGH: Phantom agent leak via Set<string> ID tracking (not counter)
- Fix HIGH: isAgentOverBudget now scoped to today only
- Fix HIGH: Config validation clamps invalid values to safe defaults
- Fix MEDIUM: Wire BudgetModule into AppModule
- Fix MEDIUM: Sanitize agentId in log output to prevent log injection
- Fix MEDIUM: Use Date objects for timezone-safe comparisons
- Fix MEDIUM: Reject empty agentId/taskId in recordUsage
- Add tests for negative tokens, NaN, Infinity, empty IDs, config edge cases

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:15:33 -06:00
Jason Woltje
22dc964503 feat(#329): Add usage budget management and cost governance
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
Implement BudgetService for tracking and enforcing agent usage limits:
- Daily token limit tracking (default 10M tokens)
- Per-agent token limit enforcement (default 2M tokens)
- Maximum concurrent agent cap (default 10)
- Task duration limits (default 120 minutes)
- Hard/soft limit enforcement modes
- Real-time usage summaries with budget status
  (within_budget/approaching_limit/at_limit/exceeded)
- Per-agent usage breakdown with percentage calculations

Includes BudgetModule for NestJS DI and 23 unit tests.

Fixes #329

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:00:26 -06:00
Jason Woltje
b93f4c59ce test(#229): Add performance test suite for orchestrator
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
Add 14 performance benchmarks across 3 test files:
- Spawner throughput: single/sequential/concurrent spawn latency,
  session lookup, list performance, memory efficiency
- Queue service: backoff calculation throughput, validation perf
- Secret scanner: content scanning throughput, pattern scalability

Adds test:perf script to package.json.

Fixes #229

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:52:30 -06:00
Jason Woltje
751005391b docs(#230): Comprehensive orchestrator documentation
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
Update README with complete API reference, module architecture tree,
service catalog, Valkey state keys, quality gate profiles, and
configuration reference.

Fixes #230

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:49:54 -06:00
Jason Woltje
c8c81fc437 test(#226,#227,#228): Add E2E integration tests for agent orchestration
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
Add comprehensive E2E test suites covering:
- Full agent lifecycle (spawn → running → completed/failed) - 7 tests
- Killswitch emergency stop mechanism (single/all/partial) - 5 tests
- Concurrent agent spawning and isolation - 5 tests

Includes vitest config for integration test runner with 30s timeout.

Fixes #226
Fixes #227
Fixes #228

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:46:44 -06:00
Jason Woltje
27bbbe79df feat(#233): Connect agent dashboard to real orchestrator API
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
- Add GET /agents endpoint to orchestrator controller
- Update AgentStatusWidget to fetch from real API instead of mock data
- Add comprehensive tests for listAgents endpoint
- Auto-refresh agent list every 30 seconds
- Display agent status with proper icons and formatting
- Show error states when API is unavailable

Fixes #233

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-05 12:31:07 -06:00
Jason Woltje
5d683d401e fix(#121): Remediate security issues from ORCH-121 review
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Priority Fixes (Required Before Production):

H3: Add rate limiting to webhook endpoint
- Added slowapi library for FastAPI rate limiting
- Implemented per-IP rate limiting (100 req/min) on webhook endpoint
- Added global rate limiting support via slowapi

M4: Add subprocess timeouts to all gates
- Added timeout=300 (5 minutes) to all subprocess.run() calls in gates
- Implemented proper TimeoutExpired exception handling
- Removed dead CalledProcessError handlers (check=False makes them unreachable)

M2: Add input validation on QualityCheckRequest
- Validate files array size (max 1000 files)
- Validate file paths (no path traversal, no null bytes, no absolute paths)
- Validate diff summary size (max 10KB)
- Validate taskId and agentId format (non-empty)

Additional Fixes:

H1: Fix coverage.json path resolution
- Use absolute paths resolved from project root
- Validate path is within project boundaries (prevent path traversal)

Code Review Cleanup:
- Moved imports to module level in quality_orchestrator.py
- Refactored mock detection logic into separate helper methods
- Removed dead subprocess.CalledProcessError exception handlers from all gates

Testing:
- Added comprehensive tests for all security fixes
- All 339 coordinator tests pass
- All 447 orchestrator tests pass
- Followed TDD principles (RED-GREEN-REFACTOR)

Security Impact:
- Prevents webhook DoS attacks via rate limiting
- Prevents hung processes via subprocess timeouts
- Prevents path traversal attacks via input validation
- Prevents malformed input attacks via comprehensive validation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-04 11:50:05 -06:00
596ec39442 fix(#277): Add comprehensive security event logging for command injection
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Implemented comprehensive structured logging for all git command injection
and SSRF attack attempts blocked by input validation.

Security Events Logged:
- GIT_COMMAND_INJECTION_BLOCKED: Invalid characters in branch names
- GIT_OPTION_INJECTION_BLOCKED: Branch names starting with hyphen
- GIT_RANGE_INJECTION_BLOCKED: Double dots in branch names
- GIT_PATH_TRAVERSAL_BLOCKED: Path traversal patterns
- GIT_DANGEROUS_PROTOCOL_BLOCKED: Dangerous protocols (file://, javascript:, etc)
- GIT_SSRF_ATTEMPT_BLOCKED: Localhost/internal network URLs

Log Structure:
- event: Event type identifier
- input: The malicious input that was blocked
- reason: Human-readable reason for blocking
- securityEvent: true (enables security monitoring)
- timestamp: ISO 8601 timestamp

Benefits:
- Enables attack detection and forensic analysis
- Provides visibility into attack patterns
- Supports security monitoring and alerting
- Captures attempted exploits before they reach git operations

Testing:
- All 31 validation tests passing
- Quality gates: lint, typecheck, build all passing
- Logging does not affect validation behavior (tests unchanged)

Partial fix for #277. Additional logging areas (OIDC, rate limits) will
be addressed in follow-up commits.

Fixes #277

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 20:27:45 -06:00
7a84d96d72 fix(#274): Add input validation to prevent command injection in git operations
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Implemented strict whitelist-based validation for git branch names and
repository URLs to prevent command injection vulnerabilities in worktree
operations.

Security fixes:
- Created git-validation.util.ts with whitelist validation functions
- Added custom DTO validators for branch names and repository URLs
- Applied defense-in-depth validation in WorktreeManagerService
- Comprehensive test coverage (31 tests) for all validation scenarios

Validation rules:
- Branch names: alphanumeric + hyphens + underscores + slashes + dots only
- Repository URLs: https://, http://, ssh://, git:// protocols only
- Blocks: option injection (--), command substitution ($(), ``), shell operators
- Prevents: SSRF attacks (localhost, internal networks), credential injection

Defense layers:
1. DTO validation (first line of defense at API boundary)
2. Service-level validation (defense-in-depth before git operations)

Fixes #274

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 20:17:47 -06:00
701df76df1 fix: resolve TypeScript errors in orchestrator and API
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Fixed CI typecheck failures:
- Added missing AgentLifecycleService dependency to AgentsController test mocks
- Made validateToken method async to match service return type
- Fixed formatting in federation.module.ts

All affected tests pass. Typecheck now succeeds.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 20:07:49 -06:00
Jason Woltje
12abdfe81d feat(#93): implement agent spawn via federation
Implements FED-010: Agent Spawn via Federation feature that enables
spawning and managing Claude agents on remote federated Mosaic Stack
instances via COMMAND message type.

Features:
- Federation agent command types (spawn, status, kill)
- FederationAgentService for handling agent operations
- Integration with orchestrator's agent spawner/lifecycle services
- API endpoints for spawning, querying status, and killing agents
- Full command routing through federation COMMAND infrastructure
- Comprehensive test coverage (12/12 tests passing)

Architecture:
- Hub → Spoke: Spawn agents on remote instances
- Command flow: FederationController → FederationAgentService →
  CommandService → Remote Orchestrator
- Response handling: Remote orchestrator returns agent status/results
- Security: Connection validation, signature verification

Files created:
- apps/api/src/federation/types/federation-agent.types.ts
- apps/api/src/federation/federation-agent.service.ts
- apps/api/src/federation/federation-agent.service.spec.ts

Files modified:
- apps/api/src/federation/command.service.ts (agent command routing)
- apps/api/src/federation/federation.controller.ts (agent endpoints)
- apps/api/src/federation/federation.module.ts (service registration)
- apps/orchestrator/src/api/agents/agents.controller.ts (status endpoint)
- apps/orchestrator/src/api/agents/agents.module.ts (lifecycle integration)

Testing:
- 12/12 tests passing for FederationAgentService
- All command service tests passing
- TypeScript compilation successful
- Linting passed

Refs #93

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-03 14:37:06 -06:00
Jason Woltje
fc87494137 fix(orchestrator): resolve all M6 remediation issues (#260-#269)
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Addresses all 10 quality remediation issues for the orchestrator module:

TypeScript & Type Safety:
- #260: Fix TypeScript compilation errors in tests
- #261: Replace explicit 'any' types with proper typed mocks

Error Handling & Reliability:
- #262: Fix silent cleanup failures - return structured results
- #263: Fix silent Valkey event parsing failures with proper error handling
- #266: Improve error context in Docker operations
- #267: Fix secret scanner false negatives on file read errors
- #268: Fix worktree cleanup error swallowing

Testing & Quality:
- #264: Add queue integration tests (coverage 15% → 85%)
- #265: Fix Prettier formatting violations
- #269: Update outdated TODO comments

All tests passing (406/406), TypeScript compiles cleanly, ESLint clean.

Fixes #260, Fixes #261, Fixes #262, Fixes #263, Fixes #264
Fixes #265, Fixes #266, Fixes #267, Fixes #268, Fixes #269

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 12:44:04 -06:00
Jason Woltje
5d348526de feat(#71): implement graph data API
Implemented three new API endpoints for knowledge graph visualization:

1. GET /api/knowledge/graph - Full knowledge graph
   - Returns all entries and links with optional filtering
   - Supports filtering by tags, status, and node count limit
   - Includes orphan detection (entries with no links)

2. GET /api/knowledge/graph/stats - Graph statistics
   - Total entries and links counts
   - Orphan entries detection
   - Average links per entry
   - Top 10 most connected entries
   - Tag distribution across entries

3. GET /api/knowledge/graph/:slug - Entry-centered subgraph
   - Returns graph centered on specific entry
   - Supports depth parameter (1-5) for traversal distance
   - Includes all connected nodes up to specified depth

New Files:
- apps/api/src/knowledge/graph.controller.ts
- apps/api/src/knowledge/graph.controller.spec.ts

Modified Files:
- apps/api/src/knowledge/dto/graph-query.dto.ts (added GraphFilterDto)
- apps/api/src/knowledge/entities/graph.entity.ts (extended with new types)
- apps/api/src/knowledge/services/graph.service.ts (added new methods)
- apps/api/src/knowledge/services/graph.service.spec.ts (added tests)
- apps/api/src/knowledge/knowledge.module.ts (registered controller)
- apps/api/src/knowledge/dto/index.ts (exported new DTOs)
- docs/scratchpads/71-graph-data-api.md (implementation notes)

Test Coverage: 21 tests (all passing)
- 14 service tests including orphan detection, filtering, statistics
- 7 controller tests for all three endpoints

Follows TDD principles with tests written before implementation.
All code quality gates passed (lint, typecheck, tests).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-02 15:27:00 -06:00