# Infrastructure & DevOps Guide ## Before Starting 1. Check assigned issue: `~/.config/mosaic/rails/git/issue-list.sh -a @me` 2. Create scratchpad: `docs/scratchpads/{issue-number}-{short-name}.md` 3. Review existing infrastructure configuration ## Vault Secrets Management **CRITICAL**: Follow canonical Vault structure for ALL secrets. ### Structure ``` {mount}/{service}/{component}/{secret-name} Examples: - secret-prod/postgres/database/app - secret-prod/redis/auth/default - secret-prod/authentik/admin/token ``` ### Environment Mounts - `secret-dev/` - Development environment - `secret-staging/` - Staging environment - `secret-prod/` - Production environment ### Standard Field Names - Credentials: `username`, `password` - Tokens: `token` - OAuth: `client_id`, `client_secret` - Connection strings: `url`, `host`, `port` See `docs/vault-secrets-structure.md` for complete reference. ## Container Standards ### Dockerfile Best Practices ```dockerfile # Use specific version tags FROM node:20-alpine # Create non-root user RUN addgroup -S app && adduser -S app -G app # Set working directory WORKDIR /app # Copy dependency files first (layer caching) COPY package*.json ./ RUN npm ci --only=production # Copy application code COPY --chown=app:app . . # Switch to non-root user USER app # Use exec form for CMD CMD ["node", "server.js"] ``` ### Container Security - Use minimal base images (alpine, distroless) - Run as non-root user - Don't store secrets in images - Scan images for vulnerabilities - Pin dependency versions ## Kubernetes/Docker Compose ### Resource Limits Always set resource limits to prevent runaway containers: ```yaml resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "500m" ``` ### Health Checks ```yaml livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 3 ``` ## CI/CD Pipelines ### Pipeline Stages 1. **Lint**: Code style and static analysis 2. **Test**: Unit and integration tests 3. **Build**: Compile and package 4. **Scan**: Security and vulnerability scanning 5. **Deploy**: Environment-specific deployment ### Pipeline Security - Use secrets management (not hardcoded) - Pin action/image versions - Implement approval gates for production - Audit pipeline access ## Monitoring & Logging ### Logging Standards - Use structured logging (JSON) - Include correlation IDs - Log at appropriate levels (ERROR, WARN, INFO, DEBUG) - Never log sensitive data ### Metrics to Collect - Request latency (p50, p95, p99) - Error rates - Resource utilization (CPU, memory) - Business metrics ### Alerting - Define SLOs (Service Level Objectives) - Alert on symptoms, not causes - Include runbook links in alerts - Avoid alert fatigue ## Testing Infrastructure ### Test Categories 1. **Unit tests**: Terraform/Ansible logic 2. **Integration tests**: Deployed resources work together 3. **Smoke tests**: Critical paths after deployment 4. **Chaos tests**: Failure mode validation ### Infrastructure Testing Tools - Terraform: `terraform validate`, `terraform plan` - Ansible: `ansible-lint`, molecule - Kubernetes: `kubectl dry-run`, kubeval - General: Terratest, ServerSpec ## Commit Format ``` chore(#67): Configure Redis cluster - Add Redis StatefulSet with 3 replicas - Configure persistence with PVC - Add Vault secret for auth password Refs #67 ``` ## Before Completing 1. Validate configuration syntax 2. Run infrastructure tests 3. Test in dev/staging first 4. Document any manual steps required 5. Update scratchpad and close issue