Files
bootstrap/guides/infrastructure.md

166 lines
3.8 KiB
Markdown

# Infrastructure & DevOps Guide
## Before Starting
1. Check assigned issue: `~/.mosaic/rails/git/issue-list.sh -a @me`
2. Create scratchpad: `docs/scratchpads/{issue-number}-{short-name}.md`
3. Review existing infrastructure configuration
## Vault Secrets Management
**CRITICAL**: Follow canonical Vault structure for ALL secrets.
### Structure
```
{mount}/{service}/{component}/{secret-name}
Examples:
- secret-prod/postgres/database/app
- secret-prod/redis/auth/default
- secret-prod/authentik/admin/token
```
### Environment Mounts
- `secret-dev/` - Development environment
- `secret-staging/` - Staging environment
- `secret-prod/` - Production environment
### Standard Field Names
- Credentials: `username`, `password`
- Tokens: `token`
- OAuth: `client_id`, `client_secret`
- Connection strings: `url`, `host`, `port`
See `docs/vault-secrets-structure.md` for complete reference.
## Container Standards
### Dockerfile Best Practices
```dockerfile
# Use specific version tags
FROM node:20-alpine
# Create non-root user
RUN addgroup -S app && adduser -S app -G app
# Set working directory
WORKDIR /app
# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm ci --only=production
# Copy application code
COPY --chown=app:app . .
# Switch to non-root user
USER app
# Use exec form for CMD
CMD ["node", "server.js"]
```
### Container Security
- Use minimal base images (alpine, distroless)
- Run as non-root user
- Don't store secrets in images
- Scan images for vulnerabilities
- Pin dependency versions
## Kubernetes/Docker Compose
### Resource Limits
Always set resource limits to prevent runaway containers:
```yaml
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
```
### Health Checks
```yaml
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
```
## CI/CD Pipelines
### Pipeline Stages
1. **Lint**: Code style and static analysis
2. **Test**: Unit and integration tests
3. **Build**: Compile and package
4. **Scan**: Security and vulnerability scanning
5. **Deploy**: Environment-specific deployment
### Pipeline Security
- Use secrets management (not hardcoded)
- Pin action/image versions
- Implement approval gates for production
- Audit pipeline access
## Monitoring & Logging
### Logging Standards
- Use structured logging (JSON)
- Include correlation IDs
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
- Never log sensitive data
### Metrics to Collect
- Request latency (p50, p95, p99)
- Error rates
- Resource utilization (CPU, memory)
- Business metrics
### Alerting
- Define SLOs (Service Level Objectives)
- Alert on symptoms, not causes
- Include runbook links in alerts
- Avoid alert fatigue
## Testing Infrastructure
### Test Categories
1. **Unit tests**: Terraform/Ansible logic
2. **Integration tests**: Deployed resources work together
3. **Smoke tests**: Critical paths after deployment
4. **Chaos tests**: Failure mode validation
### Infrastructure Testing Tools
- Terraform: `terraform validate`, `terraform plan`
- Ansible: `ansible-lint`, molecule
- Kubernetes: `kubectl dry-run`, kubeval
- General: Terratest, ServerSpec
## Commit Format
```
chore(#67): Configure Redis cluster
- Add Redis StatefulSet with 3 replicas
- Configure persistence with PVC
- Add Vault secret for auth password
Refs #67
```
## Before Completing
1. Validate configuration syntax
2. Run infrastructure tests
3. Test in dev/staging first
4. Document any manual steps required
5. Update scratchpad and close issue