# Quality-Rails Orchestration Architecture **Version**: 1.0 **Date**: 2026-01-31 **Status**: Proposed - Proof of Concept Required **Authors**: Jason Woltje + Claude --- ## Executive Summary A **non-AI coordinator** pattern for autonomous agent swarm orchestration with mechanical quality enforcement and intelligent context management. **Key Innovation:** Separate coordination logic (deterministic code) from execution (AI agents), enabling infinite runtime, cost optimization, and guaranteed quality through mechanical gates. **Core Principles:** 1. **Non-AI coordinator** - No context limit, runs forever 2. **Mechanical quality gates** - Lint, typecheck, test (not AI-judged) 3. **Context monitoring** - Track and manage AI agent capacity 4. **Model flexibility** - Assign right model for each task 5. **50% rule** - Issues never exceed 50% of agent context limit --- ## Problem Statement ### Current State: AI-Orchestrated Agents ``` AI Orchestrator (Opus/Sonnet) ├── Has context limit (200K tokens) ├── Context grows linearly during multi-issue work ├── At 95% usage: Pauses for confirmation (loses autonomy) ├── Manual intervention required (defeats automation) └── Cannot work through large issue queues unattended Result: Autonomous orchestration fails at scale ``` **Observed behavior (M4 milestone):** - 11 issues completed in 97 minutes - Agent paused at 95% context usage - Asked "Should I continue?" (lost autonomy) - 10 issues remained incomplete (32% incomplete) - No compaction occurred - Manual restart required ### Root Causes 1. **Context accumulation** - No automatic compaction 2. **AI risk aversion** - Conservative pause at high context 3. **Monolithic design** - Coordinator has same limits as workers 4. **No capacity planning** - Issues not sized for agent limits 5. **Model inflexibility** - One model for all tasks (waste) --- ## Solution: Non-AI Coordinator Architecture ### System Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Non-AI Coordinator (Python/Node.js) │ ├─────────────────────────────────────────────────────────┤ │ • No context limit (it's just code) │ │ • Reads issue queue │ │ • Assigns agents based on context + difficulty │ │ • Monitors agent context usage │ │ • Enforces mechanical quality gates │ │ • Triggers compaction at threshold │ │ • Rotates agents when exhausted │ │ • Infinite runtime capability │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ AI Swarm Controller (OpenClaw Session) │ ├─────────────────────────────────────────────────────────┤ │ • Coordinates subagent work │ │ • Context monitored externally │ │ • Receives compaction commands │ │ • Replaceable/rotatable │ │ • Just an executor (not decision-maker) │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Subagents (OpenClaw Workers) │ ├─────────────────────────────────────────────────────────┤ │ • Execute individual issues │ │ • Report to swarm controller │ │ • Quality-gated by coordinator │ │ • Model-specific (Opus, Sonnet, Haiku, etc.) │ └─────────────────────────────────────────────────────────┘ ``` ### Separation of Concerns | Concern | Non-AI Coordinator | AI Swarm Controller | Subagents | | -------------------- | ------------------------------------- | ------------------- | -------------- | | **Context limit** | None (immortal) | 200K tokens | 200K tokens | | **Lifespan** | Entire milestone | Rotatable | Per-issue | | **Decision-making** | Model assignment, quality enforcement | Work coordination | Task execution | | **Quality gates** | Enforces mechanically | N/A | N/A | | **State management** | Persistent | Can be rotated | Ephemeral | | **Cost** | Minimal (code execution) | Per-token | Per-token | --- ## The 50% Rule ### Issue Size Constraint **Rule:** Each issue must consume no more than **50% of the assigned agent's context limit.** **Rationale:** ``` Agent context limit: 200,000 tokens Overhead consumption: ├── System prompts: 10-20K tokens ├── Project context: 20-30K tokens ├── Code reading: 20-40K tokens ├── Execution buffer: 10-20K tokens └── Total overhead: 60-110K tokens (30-55%) Available for issue: 90-140K tokens Safe limit (50%): 100K tokens This allows: - Room for overhead - Iteration and debugging - Unexpected complexity - No mid-task exhaustion ``` **Enforcement:** - Issue creation MUST include context estimate - Coordinator validates estimate before assignment - If estimate > 50% of target agent: Reject or decompose ### Epic Decomposition **Large epics must be split:** ``` Epic: Authentication System Estimated context: 300K tokens total Target agent: Sonnet (200K limit) Issue size limit: 100K tokens (50% rule) Decomposition required: ├── Issue 1: Auth middleware [20K ctx | Medium] ├── Issue 2: JWT implementation [25K ctx | Medium] ├── Issue 3: User sessions [30K ctx | Medium] ├── Issue 4: Login endpoints [25K ctx | Low] ├── Issue 5: RBAC permissions [20K ctx | Medium] └── Total: 120K ctx across 5 issues Each issue < 100K ✅ Epic fits within multiple agent sessions ✅ ``` --- ## Agent Profiles ### Model Capabilities Matrix ```json { "agents": { "opus": { "model": "claude-opus-4-5", "context_limit": 200000, "difficulty_levels": ["high", "medium", "low"], "cost_per_1k_input": 0.015, "cost_per_1k_output": 0.075, "speed": "slow", "use_cases": [ "Complex refactoring", "Architecture design", "Difficult debugging", "Novel algorithms" ] }, "sonnet": { "model": "claude-sonnet-4-5", "context_limit": 200000, "difficulty_levels": ["medium", "low"], "cost_per_1k_input": 0.003, "cost_per_1k_output": 0.015, "speed": "medium", "use_cases": ["API endpoints", "Business logic", "Standard features", "Test writing"] }, "haiku": { "model": "claude-haiku-4", "context_limit": 200000, "difficulty_levels": ["low"], "cost_per_1k_input": 0.00025, "cost_per_1k_output": 0.00125, "speed": "fast", "use_cases": ["CRUD operations", "Config changes", "Documentation", "Simple fixes"] }, "glm": { "model": "glm-4-plus", "context_limit": 128000, "difficulty_levels": ["medium", "low"], "cost_per_1k_input": 0.001, "cost_per_1k_output": 0.001, "speed": "fast", "use_cases": ["Standard features (lower cost)", "International projects", "High-volume tasks"] }, "minimax": { "model": "minimax-01", "context_limit": 128000, "difficulty_levels": ["low"], "cost_per_1k_input": 0.0005, "cost_per_1k_output": 0.0005, "speed": "fast", "use_cases": ["Simple tasks (very low cost)", "Bulk operations", "Non-critical work"] } } } ``` ### Difficulty Levels Defined **Low Difficulty:** - CRUD operations (create, read, update, delete) - Configuration changes - Documentation updates - Simple bug fixes - UI text changes - Adding logging/comments **Criteria:** - Well-established patterns - No complex logic - Minimal dependencies - Low risk of regressions **Medium Difficulty:** - API endpoint implementation - Business logic features - Database schema changes - Integration with external services - Standard refactoring - Test suite additions **Criteria:** - Moderate complexity - Some novel logic required - Multiple file changes - Moderate risk of side effects **High Difficulty:** - Architecture changes - Complex algorithms - Performance optimization - Security-critical features - Large-scale refactoring - Novel problem-solving **Criteria:** - High complexity - Requires deep understanding - Cross-cutting concerns - High risk of regressions --- ## Issue Metadata Schema ### Required Fields ```json { "issue": { "id": 123, "title": "Add JWT authentication [25K | Medium]", "description": "Implement JWT token-based authentication...", "metadata": { "estimated_context": 25000, "difficulty": "medium", "epic": "auth-system", "dependencies": [122], "quality_gates": ["lint", "typecheck", "test", "security-scan"], "assignment": { "suggested_models": ["sonnet", "opus"], "assigned_model": null, "assigned_agent_id": null }, "tracking": { "created_at": "2026-01-31T10:00:00Z", "started_at": null, "completed_at": null, "actual_context_used": null, "duration_minutes": null } } } } ``` ### Issue Title Format **Template:** `[Feature name] [Context estimate | Difficulty]` **Examples:** ``` ✅ "Add JWT authentication [25K | Medium]" ✅ "Fix typo in README [2K | Low]" ✅ "Refactor auth system [80K | High]" ✅ "Implement rate limiting [30K | Medium]" ✅ "Add OpenAPI docs [15K | Low]" ❌ "Add authentication" (missing metadata) ❌ "Refactor auth [High]" (missing context estimate) ❌ "Fix bug [20K]" (missing difficulty) ``` ### Issue Body Template ```markdown ## Context Estimate **Estimated tokens:** 25,000 (12.5% of 200K limit) ## Difficulty **Level:** Medium **Rationale:** - Requires understanding JWT spec - Integration with existing auth middleware - Security considerations (token signing, validation) - Test coverage for auth flows ## Suggested Models - Primary: Sonnet (cost-effective for medium difficulty) - Fallback: Opus (if complexity increases) ## Dependencies - #122 (Auth middleware must be complete first) ## Quality Gates - [x] Lint (ESLint + Prettier) - [x] Typecheck (TypeScript strict mode) - [x] Tests (Unit + Integration, 80%+ coverage) - [x] Security scan (No hardcoded secrets, safe crypto) ## Task Description [Detailed description of work to be done...] ## Acceptance Criteria - [ ] JWT tokens generated on login - [ ] Tokens validated on protected routes - [ ] Token refresh mechanism implemented - [ ] Tests cover happy path + edge cases - [ ] Documentation updated ## Context Breakdown | Activity | Estimated Tokens | | --------------------------------- | ---------------- | | Read existing auth code | 5,000 | | Implement JWT library integration | 8,000 | | Write middleware logic | 6,000 | | Add tests | 4,000 | | Update documentation | 2,000 | | **Total** | **25,000** | ``` --- ## Context Estimation Guidelines ### Estimation Formula ``` Estimated Context = ( Files to read × 5-10K per file + Implementation complexity × 10-30K + Test writing × 5-15K + Documentation × 2-5K + Buffer for iteration × 20-50% ) ``` ### Examples **Simple (Low Difficulty):** ``` Task: Fix typo in README.md Files to read: 1 × 5K = 5K Implementation: Minimal = 1K Tests: None = 0K Docs: None = 0K Buffer: 20% = 1.2K Total: ~7K tokens Rounded estimate: 10K tokens (conservative) ``` **Medium (Medium Difficulty):** ``` Task: Add API endpoint for user profile Files to read: 3 × 8K = 24K Implementation: Standard endpoint = 15K Tests: Unit + integration = 10K Docs: API spec update = 3K Buffer: 30% = 15.6K Total: ~67.6K tokens Rounded estimate: 70K tokens ``` **Complex (High Difficulty):** ``` Task: Refactor authentication system Files to read: 8 × 10K = 80K Implementation: Complex refactor = 30K Tests: Extensive = 15K Docs: Architecture guide = 5K Buffer: 50% = 65K Total: ~195K tokens ⚠️ Exceeds 50% rule (100K limit)! Action: Split into 2-3 smaller issues ``` ### Estimation Accuracy Tracking **After each issue, measure variance:** ```python variance = actual_context - estimated_context variance_pct = (variance / estimated_context) * 100 # Log for calibration if variance_pct > 20%: print(f"⚠️ Estimate off by {variance_pct}%") print(f"Estimated: {estimated_context}") print(f"Actual: {actual_context}") print("Review estimation guidelines") ``` **Over time, refine estimation formula based on historical data.** --- ## Coordinator Implementation ### Core Algorithm ```python class QualityRailsCoordinator: """Non-AI coordinator for agent swarm orchestration.""" def __init__(self, issue_queue, agent_profiles, quality_gates): self.issues = issue_queue self.agents = agent_profiles self.gates = quality_gates self.current_controller = None def run(self): """Main orchestration loop.""" # Validate all issues self.validate_issues() # Sort by dependencies and priority self.issues = self.topological_sort(self.issues) # Start AI swarm controller self.start_swarm_controller() # Process queue for issue in self.issues: print(f"\n{'='*60}") print(f"Starting issue #{issue['id']}: {issue['title']}") print(f"{'='*60}\n") # Assign optimal agent agent = self.assign_agent(issue) # Monitor and execute self.execute_issue(issue, agent) # Log metrics self.log_metrics(issue, agent) print("\n✅ All issues complete. Queue empty.") def validate_issues(self): """Ensure all issues have required metadata.""" for issue in self.issues: if not issue.get("estimated_context"): raise ValueError( f"Issue {issue['id']} missing context estimate" ) if not issue.get("difficulty"): raise ValueError( f"Issue {issue['id']} missing difficulty rating" ) # Validate 50% rule max_context = max( agent["context_limit"] for agent in self.agents.values() ) if issue["estimated_context"] > (max_context * 0.5): raise ValueError( f"Issue {issue['id']} exceeds 50% rule: " f"{issue['estimated_context']} > {max_context * 0.5}" ) def assign_agent(self, issue): """Assign optimal agent based on context + difficulty.""" context_est = issue["estimated_context"] difficulty = issue["difficulty"] # Filter models that can handle this issue candidates = [] for model_name, profile in self.agents.items(): # Check context capacity (50% rule) if context_est <= (profile["context_limit"] * 0.5): # Check difficulty match if difficulty in profile["difficulty_levels"]: # Calculate cost cost = ( context_est * profile["cost_per_1k_input"] / 1000 ) candidates.append({ "model": model_name, "profile": profile, "cost": cost }) if not candidates: raise ValueError( f"No model can handle issue {issue['id']}: " f"{context_est}K ctx, {difficulty} difficulty" ) # Optimize for cost (prefer cheaper models when capable) candidates.sort(key=lambda x: x["cost"]) selected = candidates[0] print(f"📋 Assigned {selected['model']} to issue {issue['id']}") print(f" Context: {context_est}K tokens") print(f" Difficulty: {difficulty}") print(f" Estimated cost: ${selected['cost']:.4f}") return selected def execute_issue(self, issue, agent): """Execute issue with assigned agent.""" # Start agent session session = self.start_agent_session(agent["profile"]) # Track context session_context = 0 context_limit = agent["profile"]["context_limit"] # Execution loop iteration = 0 while not issue.get("complete"): iteration += 1 # Check context health if session_context > (context_limit * 0.80): print(f"⚠️ Context at 80% ({session_context}/{context_limit})") print(" Triggering compaction...") session_context = self.compact_session(session) print(f" ✓ Compacted to {session_context} tokens") if session_context > (context_limit * 0.95): print(f"🔄 Context at 95% - rotating agent session") state = session.save_state() session.terminate() session = self.start_agent_session(agent["profile"]) session.load_state(state) session_context = session.current_context() # Agent executes step print(f" Iteration {iteration}...") result = session.execute_step(issue) # Update context tracking session_context += result["context_used"] # Check if agent claims completion if result.get("claims_complete"): print(" Agent claims completion. Running quality gates...") # Enforce quality gates gate_results = self.gates.validate(result) if gate_results["passed"]: print(" ✅ All quality gates passed") issue["complete"] = True issue["actual_context_used"] = session_context else: print(" ❌ Quality gates failed:") for gate, errors in gate_results["failures"].items(): print(f" {gate}: {errors}") # Send feedback to agent session.send_feedback(gate_results["failures"]) # Clean up session.terminate() def start_swarm_controller(self): """Start AI swarm controller (OpenClaw session).""" # Initialize OpenClaw swarm controller # This coordinates subagents but is managed by this coordinator pass def start_agent_session(self, agent_profile): """Start individual agent session.""" # Start agent with specific model # Return session handle pass def compact_session(self, session): """Trigger compaction in agent session.""" summary = session.send_message( "Summarize all completed work concisely. " "Keep only essential context for continuation." ) session.reset_history_with_summary(summary) return session.current_context() def topological_sort(self, issues): """Sort issues by dependencies.""" # Implement dependency graph sorting # Ensures dependencies complete before dependents pass def log_metrics(self, issue, agent): """Log issue completion metrics.""" metrics = { "issue_id": issue["id"], "title": issue["title"], "estimated_context": issue["estimated_context"], "actual_context": issue.get("actual_context_used"), "variance": ( issue.get("actual_context_used", 0) - issue["estimated_context"] ), "model": agent["model"], "difficulty": issue["difficulty"], "timestamp": datetime.now().isoformat() } # Write to metrics file with open("orchestrator-metrics.jsonl", "a") as f: f.write(json.dumps(metrics) + "\n") ``` ### Quality Gates Implementation ```python class QualityGates: """Mechanical quality enforcement.""" def validate(self, result): """Run all quality gates.""" gates = { "lint": self.run_lint, "typecheck": self.run_typecheck, "test": self.run_tests, "security": self.run_security_scan } failures = {} for gate_name, gate_fn in gates.items(): gate_result = gate_fn(result) if not gate_result["passed"]: failures[gate_name] = gate_result["errors"] return { "passed": len(failures) == 0, "failures": failures } def run_lint(self, result): """Run linting (ESLint, Prettier, etc.).""" # Execute: pnpm turbo run lint # Parse output # Return pass/fail + errors pass def run_typecheck(self, result): """Run TypeScript type checking.""" # Execute: pnpm turbo run typecheck # Parse output # Return pass/fail + errors pass def run_tests(self, result): """Run test suite.""" # Execute: pnpm turbo run test # Check coverage threshold # Return pass/fail + errors pass def run_security_scan(self, result): """Run security checks.""" # Execute: detect-secrets scan # Check for vulnerabilities # Return pass/fail + errors pass ``` --- ## Issue Creation Process ### Workflow ``` 1. Epic Planning Agent ├── Receives epic description ├── Estimates total context required ├── Checks against agent limits └── Decomposes into issues if needed 2. Issue Creation ├── For each sub-issue: │ ├── Estimate context (formula + buffer) │ ├── Assign difficulty level │ ├── Validate 50% rule │ └── Create issue with metadata 3. Validation ├── Coordinator validates all issues ├── Checks for missing metadata └── Rejects oversized issues 4. Execution ├── Coordinator assigns agents ├── Monitors context usage ├── Enforces quality gates └── Logs metrics for calibration ``` ### Epic Planning Agent Prompt ````markdown You are an Epic Planning Agent. Your job is to decompose epics into properly-sized issues for autonomous execution. ## Guidelines 1. **Estimate total context:** - Read all related code files - Estimate implementation complexity - Account for tests and documentation - Add 30% buffer for iteration 2. **Apply 50% rule:** - Target agent context limit: 200K tokens - Maximum issue size: 100K tokens - If epic exceeds 100K: Split into multiple issues 3. **Assign difficulty:** - Low: CRUD, config, docs, simple fixes - Medium: APIs, business logic, integrations - High: Architecture, complex algorithms, refactors 4. **Create issues with metadata:** ```json { "title": "[Feature] [Context | Difficulty]", "estimated_context": 25000, "difficulty": "medium", "epic": "epic-name", "dependencies": [], "quality_gates": ["lint", "typecheck", "test"] } ``` ```` 5. **Validate:** - Each issue < 100K tokens ✓ - Dependencies are explicit ✓ - Difficulty matches complexity ✓ - Quality gates defined ✓ ## Output Format Create a JSON array of issues: ```json [ { "id": 1, "title": "Add auth middleware [20K | Medium]", "estimated_context": 20000, "difficulty": "medium", ... }, ... ] ``` ``` --- ## Proof of Concept Plan ### PoC Goals 1. **Validate non-AI coordinator pattern** - Prove it can manage agent lifecycle 2. **Test context monitoring** - Verify we can track and react to context usage 3. **Validate quality gates** - Ensure mechanical enforcement works 4. **Test agent assignment** - Confirm model selection logic 5. **Measure metrics** - Collect data on estimate accuracy ### PoC Scope **Small test project:** - 5-10 simple issues - Mix of difficulty levels - Use Haiku + Sonnet (cheap) - Real quality gates (lint, typecheck, test) **What we'll build:** ``` poc/ ├── coordinator.py # Non-AI coordinator ├── agent_profiles.json # Model capabilities ├── issues.json # Test issue queue ├── quality_gates.py # Mechanical gates └── metrics.jsonl # Results log ```` **Test cases:** 1. Low difficulty issue → Haiku (cheap, fast) 2. Medium difficulty issue → Sonnet (balanced) 3. Oversized issue → Should reject (50% rule) 4. Issue with failed quality gate → Agent retries 5. High context issue → Triggers compaction ### PoC Success Criteria - [ ] Coordinator completes all issues without human intervention - [ ] Quality gates enforce standards (at least 1 failure caught + fixed) - [ ] Context monitoring works (log shows tracking) - [ ] Agent assignment is optimal (cheapest capable model chosen) - [ ] Metrics collected for all issues - [ ] No agent exhaustion (50% rule enforced) ### PoC Timeline **Week 1: Foundation** - [ ] Build coordinator skeleton - [ ] Implement agent profiles - [ ] Create test issue queue - [ ] Set up quality gates **Week 2: Integration** - [ ] Connect to Claude API - [ ] Implement context monitoring - [ ] Test agent lifecycle - [ ] Validate quality gates **Week 3: Testing** - [ ] Run full PoC - [ ] Collect metrics - [ ] Analyze results - [ ] Document findings **Week 4: Refinement** - [ ] Fix issues discovered - [ ] Optimize assignment logic - [ ] Update documentation - [ ] Prepare for production --- ## Production Deployment (Post-PoC) ### Integration with Mosaic Stack **Phase 1: Core Implementation** - Implement coordinator in Mosaic Stack codebase - Add agent profiles to configuration - Integrate with existing OpenClaw infrastructure - Add quality gates to CI/CD **Phase 2: Issue Management** - Update issue templates with metadata fields - Train team on estimation guidelines - Build issue validation tools - Create epic planning workflows **Phase 3: Monitoring** - Add coordinator metrics dashboard - Track estimate accuracy over time - Monitor cost optimization - Alert on failures **Phase 4: Scale** - Expand to all milestones - Add more agent types (GLM, MiniMax) - Optimize for multi-epic orchestration - Build self-learning estimation --- ## Open Questions (To Resolve in PoC) 1. **Compaction effectiveness:** How much context does summarization actually free? 2. **Estimation accuracy:** How close are initial estimates to reality? 3. **Model selection:** Is cost-optimized assignment actually optimal, or should we prioritize speed/quality? 4. **Quality gate timing:** Should gates run after each commit, or only at issue completion? 5. **Session rotation overhead:** What's the cost of rotating agents vs compaction? 6. **Dependency handling:** How to ensure dependencies are truly complete before starting dependent issues? --- ## Success Metrics ### PoC Metrics - **Autonomy:** % of issues completed without human intervention - **Quality:** % of commits passing all quality gates on first try - **Cost:** Total cost vs baseline (all-Opus) - **Accuracy:** Context estimate variance (target: <20%) - **Efficiency:** Issues per hour ### Production Metrics - **Throughput:** Issues completed per day - **Quality rate:** % passing all gates first try - **Context efficiency:** Avg context used vs estimated - **Cost savings:** % saved vs all-Opus baseline - **Agent utilization:** % of time agents are productive (not waiting) --- ## Appendix: Agent Skill Definitions ### Agent Skills Schema ```json { "skills": { "backend-api": { "description": "Build RESTful APIs and endpoints", "difficulty": "medium", "typical_context": "20-40K", "quality_gates": ["lint", "typecheck", "test", "api-spec"] }, "frontend-ui": { "description": "Build UI components and pages", "difficulty": "medium", "typical_context": "15-35K", "quality_gates": ["lint", "typecheck", "test", "a11y"] }, "database-schema": { "description": "Design and migrate database schemas", "difficulty": "high", "typical_context": "30-50K", "quality_gates": ["typecheck", "test", "migration-validate"] }, "documentation": { "description": "Write technical documentation", "difficulty": "low", "typical_context": "5-15K", "quality_gates": ["spelling", "markdown-lint"] }, "refactoring": { "description": "Refactor existing code", "difficulty": "high", "typical_context": "40-80K", "quality_gates": ["lint", "typecheck", "test", "no-behavior-change"] }, "bug-fix": { "description": "Fix reported bugs", "difficulty": "low-medium", "typical_context": "10-30K", "quality_gates": ["lint", "typecheck", "test", "regression-test"] } } } ```` **Usage:** - Issues can reference skills: `"skills": ["backend-api", "database-schema"]` - Coordinator uses skill metadata to inform estimates - Helps with consistent difficulty assignment --- ## Document Status **Version:** 1.0 - Proposed Architecture **Next Steps:** Build Proof of Concept **Approval Required:** After successful PoC --- **End of Architecture Document**