fix(#281): Fix broad exception catching hiding system errors

Replaced broad try-catch blocks with targeted error handling that only
catches expected business logic errors (CommandProcessingError subclasses).
System errors (OOM, DB failures, network issues) now propagate correctly
for proper debugging and monitoring.

Changes:
- Created CommandProcessingError hierarchy for business logic errors
- UnknownCommandTypeError for invalid command types
- AgentCommandError for orchestrator communication failures
- InvalidCommandPayloadError for payload validation
- Updated command.service.ts to only catch CommandProcessingError
- Updated federation-agent.service.ts to throw appropriate error types
- Added comprehensive tests for both business and system error scenarios
- System errors now include structured logging with context
- All 286 federation tests pass

Impact:
- Debugging is now possible for system failures
- System errors properly trigger monitoring/alerting
- Business logic errors handled gracefully with error responses
- No more masking of critical issues like OOM or DB failures

Fixes #281

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-03 20:57:51 -06:00
parent 9caaf91ecc
commit f53f310061
7 changed files with 388 additions and 54 deletions

View File

@@ -0,0 +1,49 @@
# Issue #281: Fix broad exception catching hiding system errors
## Objective
Fix broad try-catch blocks in command.service.ts that catch ALL errors including system failures (OOM, DB failures, etc.), making debugging impossible.
## Location
apps/api/src/federation/command.service.ts:168-194
## Problem
The current implementation catches all errors in a broad try-catch block, which masks critical system errors as business logic failures. This makes debugging impossible and can hide serious issues like:
- Out of memory errors
- Database connection failures
- Network failures
- Module loading failures
## Approach
1. Define specific error types for expected business logic errors
2. Only catch expected errors (e.g., module not found, command validation failures)
3. Let system errors (OOM, DB failures, network issues) propagate naturally
4. Add structured logging for business logic errors
5. Add comprehensive tests for both business and system error scenarios
## Implementation Plan
- [ ] Create custom error classes for expected business errors
- [ ] Update handleIncomingCommand to only catch expected errors
- [ ] Add structured logging for security events
- [ ] Write tests for business logic errors (should be caught)
- [ ] Write tests for system errors (should propagate)
- [ ] Verify all tests pass
- [ ] Run quality gates (lint, typecheck, build)
## Testing
- Test business logic errors are caught and handled gracefully
- Test system errors propagate correctly
- Test error logging includes appropriate context
- Maintain 85%+ coverage
## Notes
- This is a P0 security issue - proper error handling is critical for production debugging
- Follow patterns from other federation services
- Ensure backward compatibility with existing error handling flows

View File

@@ -0,0 +1,50 @@
# Issue #282: Add HTTP request timeouts (DoS risk)
## Objective
Add 10-second timeout to all HTTP requests to prevent DoS attacks via slowloris and resource exhaustion.
## Security Impact
- DoS via slowloris attack (attacker sends data very slowly)
- Resource exhaustion from hung connections
- API becomes unresponsive
- P0 security vulnerability
## Current Status
✅ HttpModule is already configured with 10-second timeout in federation.module.ts:29
- All HTTP requests via HttpService automatically use this timeout
- No code changes needed in command.service.ts, query.service.ts, or event.service.ts
## Approach
1. Verify timeout is properly configured at module level
2. Add explicit test to verify timeout enforcement
3. Add tests for timeout scenarios
4. Document timeout configuration
5. Verify all federation HTTP requests use the configured HttpService
## Implementation Plan
- [ ] Review federation.module.ts timeout configuration
- [ ] Add test for HTTP timeout enforcement
- [ ] Add test for timeout error handling
- [ ] Verify query.service.ts uses timeout
- [ ] Verify event.service.ts uses timeout
- [ ] Verify command.service.ts uses timeout
- [ ] Run quality gates (lint, typecheck, build, tests)
## Testing
- Test HTTP request times out after 10 seconds
- Test timeout errors are handled gracefully
- Test all federation services respect timeout
- Maintain 85%+ coverage
## Notes
- Timeout already configured via HttpModule.register({ timeout: 10000 })
- Need to add explicit tests to verify timeout works
- This is a verification and testing issue, not an implementation issue