fix(#199): implement rate limiting on webhook endpoints

Implements comprehensive rate limiting on all webhook and coordinator endpoints to prevent DoS attacks. Follows TDD protocol with 14 passing tests. Implementation: - Added @nestjs/throttler package for rate limiting - Created ThrottlerApiKeyGuard for per-API-key rate limiting - Created ThrottlerValkeyStorageService for distributed rate limiting via Redis - Configured rate limits on stitcher endpoints (60 req/min) - Configured rate limits on coordinator endpoints (100 req/min) - Higher limits for health endpoints (300 req/min for monitoring) - Added environment variables for rate limit configuration - Rate limiting logs violations for security monitoring Rate Limits: - Stitcher webhooks: 60 requests/minute per API key - Coordinator endpoints: 100 requests/minute per API key - Health endpoints: 300 requests/minute (higher for monitoring) Storage: - Uses Valkey (Redis) for distributed rate limiting across API instances - Falls back to in-memory storage if Redis unavailable Testing: - 14 comprehensive rate limiting tests (all passing) - Tests verify: rate limit enforcement, Retry-After headers, per-API-key isolation - TDD approach: RED (failing tests) → GREEN (implementation) → REFACTOR Additional improvements: - Type safety improvements in websocket gateway - Array type notation standardization in coordinator service Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-02 13:07:16 -06:00
parent 210b3d2e8f
commit 41d56dadf0
14 changed files with 990 additions and 11 deletions
--- a/docs/scratchpads/199-implement-rate-limiting.md
+++ b/docs/scratchpads/199-implement-rate-limiting.md
@@ -0,0 +1,167 @@
+# Issue #199: Implement rate limiting on webhook endpoints
+
+## Objective
+Implement rate limiting on webhook and public-facing API endpoints to prevent DoS attacks and ensure system stability under high load conditions.
+
+## Approach
+
+### TDD Implementation Plan
+1. **RED**: Write failing tests for rate limiting
+   - Test rate limit enforcement (429 status)
+   - Test Retry-After header inclusion
+   - Test per-IP rate limiting
+   - Test per-API-key rate limiting
+   - Test that legitimate requests are not blocked
+   - Test storage mechanism (Redis/in-memory)
+
+2. **GREEN**: Implement NestJS throttler
+   - Install @nestjs/throttler package
+   - Configure global rate limits
+   - Configure per-endpoint rate limits
+   - Add custom guards for per-API-key limiting
+   - Integrate with Valkey (Redis) for distributed limiting
+   - Add Retry-After headers to 429 responses
+
+3. **REFACTOR**: Optimize and document
+   - Extract configuration to environment variables
+   - Add documentation
+   - Ensure code quality
+
+### Identified Webhook Endpoints
+
+**Stitcher Module** (`apps/api/src/stitcher/stitcher.controller.ts`):
+- `POST /stitcher/webhook` - Webhook endpoint for @mosaic bot
+- `POST /stitcher/dispatch` - Manual job dispatch endpoint
+
+**Coordinator Integration Module** (`apps/api/src/coordinator-integration/coordinator-integration.controller.ts`):
+- `POST /coordinator/jobs` - Create a job from coordinator
+- `PATCH /coordinator/jobs/:id/status` - Update job status
+- `PATCH /coordinator/jobs/:id/progress` - Update job progress
+- `POST /coordinator/jobs/:id/complete` - Mark job as complete
+- `POST /coordinator/jobs/:id/fail` - Mark job as failed
+- `GET /coordinator/jobs/:id` - Get job details
+- `GET /coordinator/health` - Integration health check
+
+### Rate Limit Configuration
+
+**Proposed limits**:
+- Global default: 100 requests per minute
+- Webhook endpoints: 60 requests per minute per IP
+- Coordinator endpoints: 100 requests per minute per API key
+- Health endpoints: 300 requests per minute (higher for monitoring)
+
+**Storage**: Use Valkey (Redis-compatible) for distributed rate limiting across multiple API instances.
+
+### Technology Stack
+- `@nestjs/throttler` - NestJS rate limiting module
+- Valkey (already in project) - Redis-compatible cache for distributed rate limiting
+- Custom guards for per-API-key limiting
+
+## Progress
+- [x] Create scratchpad
+- [x] Identify webhook endpoints requiring rate limiting
+- [x] Define rate limit configuration strategy
+- [x] Write failing tests for rate limiting (RED phase - TDD)
+- [x] Install @nestjs/throttler package
+- [x] Implement ThrottlerModule configuration
+- [x] Implement custom guards for per-API-key limiting
+- [x] Implement ThrottlerValkeyStorageService for distributed rate limiting
+- [x] Add rate limiting decorators to endpoints (GREEN phase - TDD)
+- [x] Add environment variables for rate limiting configuration
+- [x] Verify all tests pass (14/14 tests pass)
+- [x] Commit changes
+- [ ] Update issue #199
+
+## Testing Plan
+
+### Unit Tests
+1. **Rate limit enforcement**
+   - Verify 429 status code after exceeding limit
+   - Verify requests within limit are allowed
+
+2. **Retry-After header**
+   - Verify header is present in 429 responses
+   - Verify header value is correct
+
+3. **Per-IP limiting**
+   - Verify different IPs have independent limits
+   - Verify same IP is rate limited
+
+4. **Per-API-key limiting**
+   - Verify different API keys have independent limits
+   - Verify same API key is rate limited
+
+5. **Storage mechanism**
+   - Verify Redis/Valkey integration works
+   - Verify fallback to in-memory if Redis unavailable
+
+### Integration Tests
+1. **E2E rate limiting**
+   - Test actual HTTP requests hitting rate limits
+   - Test rate limits reset after time window
+
+## Environment Variables
+
+```bash
+# Rate limiting configuration
+RATE_LIMIT_TTL=60                    # Time window in seconds
+RATE_LIMIT_GLOBAL_LIMIT=100          # Global requests per window
+RATE_LIMIT_WEBHOOK_LIMIT=60          # Webhook endpoint limit
+RATE_LIMIT_COORDINATOR_LIMIT=100     # Coordinator endpoint limit
+RATE_LIMIT_HEALTH_LIMIT=300          # Health endpoint limit
+RATE_LIMIT_STORAGE=redis             # redis or memory
+```
+
+## Implementation Summary
+
+### Files Created
+1. `/home/localadmin/src/mosaic-stack/apps/api/src/common/throttler/throttler-api-key.guard.ts` - Custom guard for API-key based rate limiting
+2. `/home/localadmin/src/mosaic-stack/apps/api/src/common/throttler/throttler-storage.service.ts` - Valkey/Redis storage for distributed rate limiting
+3. `/home/localadmin/src/mosaic-stack/apps/api/src/common/throttler/index.ts` - Export barrel file
+4. `/home/localadmin/src/mosaic-stack/apps/api/src/stitcher/stitcher.rate-limit.spec.ts` - Rate limiting tests for stitcher endpoints (6 tests)
+5. `/home/localadmin/src/mosaic-stack/apps/api/src/coordinator-integration/coordinator-integration.rate-limit.spec.ts` - Rate limiting tests for coordinator endpoints (8 tests)
+
+### Files Modified
+1. `/home/localadmin/src/mosaic-stack/apps/api/src/app.module.ts` - Added ThrottlerModule and ThrottlerApiKeyGuard
+2. `/home/localadmin/src/mosaic-stack/apps/api/src/stitcher/stitcher.controller.ts` - Added @Throttle decorators (60 req/min)
+3. `/home/localadmin/src/mosaic-stack/apps/api/src/coordinator-integration/coordinator-integration.controller.ts` - Added @Throttle decorators (100 req/min, health: 300 req/min)
+4. `/home/localadmin/src/mosaic-stack/.env.example` - Added rate limiting environment variables
+5. `/home/localadmin/src/mosaic-stack/.env` - Added rate limiting environment variables
+6. `/home/localadmin/src/mosaic-stack/apps/api/package.json` - Added @nestjs/throttler dependency
+
+### Test Results
+- All 14 rate limiting tests pass (6 stitcher + 8 coordinator)
+- Tests verify: rate limit enforcement, Retry-After headers, per-API-key limiting, independent API key tracking
+- TDD approach followed: RED (failing tests) → GREEN (implementation) → REFACTOR
+
+### Rate Limits Configured
+- Stitcher endpoints: 60 requests/minute per API key
+- Coordinator endpoints: 100 requests/minute per API key
+- Health endpoint: 300 requests/minute per API key (higher for monitoring)
+- Storage: Valkey (Redis) for distributed limiting with in-memory fallback
+
+## Notes
+
+### Why @nestjs/throttler?
+- Official NestJS package with good TypeScript support
+- Supports Redis for distributed rate limiting
+- Flexible per-route configuration
+- Built-in guard system
+- Active maintenance
+
+### Security Considerations
+- Rate limiting by IP can be bypassed by rotating IPs
+- Implement per-API-key limiting as primary defense
+- Log rate limit violations for monitoring
+- Consider implementing progressive delays for repeated violations
+- Ensure rate limiting doesn't block legitimate traffic
+
+### Implementation Details
+- Use `@Throttle()` decorator for per-endpoint limits
+- Use `@SkipThrottle()` to exclude specific endpoints
+- Custom ThrottlerGuard to extract API key from X-API-Key header
+- Use Valkey connection from existing ValkeyModule
+
+## References
+- [NestJS Throttler Documentation](https://docs.nestjs.com/security/rate-limiting)
+- [OWASP Rate Limiting Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Denial_of_Service_Cheat_Sheet.html)