166 lines
3.8 KiB
Markdown
166 lines
3.8 KiB
Markdown
# Infrastructure & DevOps Guide
|
|
|
|
## Before Starting
|
|
1. Check assigned issue: `~/.mosaic/rails/git/issue-list.sh -a @me`
|
|
2. Create scratchpad: `docs/scratchpads/{issue-number}-{short-name}.md`
|
|
3. Review existing infrastructure configuration
|
|
|
|
## Vault Secrets Management
|
|
|
|
**CRITICAL**: Follow canonical Vault structure for ALL secrets.
|
|
|
|
### Structure
|
|
```
|
|
{mount}/{service}/{component}/{secret-name}
|
|
|
|
Examples:
|
|
- secret-prod/postgres/database/app
|
|
- secret-prod/redis/auth/default
|
|
- secret-prod/authentik/admin/token
|
|
```
|
|
|
|
### Environment Mounts
|
|
- `secret-dev/` - Development environment
|
|
- `secret-staging/` - Staging environment
|
|
- `secret-prod/` - Production environment
|
|
|
|
### Standard Field Names
|
|
- Credentials: `username`, `password`
|
|
- Tokens: `token`
|
|
- OAuth: `client_id`, `client_secret`
|
|
- Connection strings: `url`, `host`, `port`
|
|
|
|
See `docs/vault-secrets-structure.md` for complete reference.
|
|
|
|
## Container Standards
|
|
|
|
### Dockerfile Best Practices
|
|
```dockerfile
|
|
# Use specific version tags
|
|
FROM node:20-alpine
|
|
|
|
# Create non-root user
|
|
RUN addgroup -S app && adduser -S app -G app
|
|
|
|
# Set working directory
|
|
WORKDIR /app
|
|
|
|
# Copy dependency files first (layer caching)
|
|
COPY package*.json ./
|
|
RUN npm ci --only=production
|
|
|
|
# Copy application code
|
|
COPY --chown=app:app . .
|
|
|
|
# Switch to non-root user
|
|
USER app
|
|
|
|
# Use exec form for CMD
|
|
CMD ["node", "server.js"]
|
|
```
|
|
|
|
### Container Security
|
|
- Use minimal base images (alpine, distroless)
|
|
- Run as non-root user
|
|
- Don't store secrets in images
|
|
- Scan images for vulnerabilities
|
|
- Pin dependency versions
|
|
|
|
## Kubernetes/Docker Compose
|
|
|
|
### Resource Limits
|
|
Always set resource limits to prevent runaway containers:
|
|
```yaml
|
|
resources:
|
|
requests:
|
|
memory: "128Mi"
|
|
cpu: "100m"
|
|
limits:
|
|
memory: "256Mi"
|
|
cpu: "500m"
|
|
```
|
|
|
|
### Health Checks
|
|
```yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 3
|
|
```
|
|
|
|
## CI/CD Pipelines
|
|
|
|
### Pipeline Stages
|
|
1. **Lint**: Code style and static analysis
|
|
2. **Test**: Unit and integration tests
|
|
3. **Build**: Compile and package
|
|
4. **Scan**: Security and vulnerability scanning
|
|
5. **Deploy**: Environment-specific deployment
|
|
|
|
### Pipeline Security
|
|
- Use secrets management (not hardcoded)
|
|
- Pin action/image versions
|
|
- Implement approval gates for production
|
|
- Audit pipeline access
|
|
|
|
## Monitoring & Logging
|
|
|
|
### Logging Standards
|
|
- Use structured logging (JSON)
|
|
- Include correlation IDs
|
|
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
|
|
- Never log sensitive data
|
|
|
|
### Metrics to Collect
|
|
- Request latency (p50, p95, p99)
|
|
- Error rates
|
|
- Resource utilization (CPU, memory)
|
|
- Business metrics
|
|
|
|
### Alerting
|
|
- Define SLOs (Service Level Objectives)
|
|
- Alert on symptoms, not causes
|
|
- Include runbook links in alerts
|
|
- Avoid alert fatigue
|
|
|
|
## Testing Infrastructure
|
|
|
|
### Test Categories
|
|
1. **Unit tests**: Terraform/Ansible logic
|
|
2. **Integration tests**: Deployed resources work together
|
|
3. **Smoke tests**: Critical paths after deployment
|
|
4. **Chaos tests**: Failure mode validation
|
|
|
|
### Infrastructure Testing Tools
|
|
- Terraform: `terraform validate`, `terraform plan`
|
|
- Ansible: `ansible-lint`, molecule
|
|
- Kubernetes: `kubectl dry-run`, kubeval
|
|
- General: Terratest, ServerSpec
|
|
|
|
## Commit Format
|
|
```
|
|
chore(#67): Configure Redis cluster
|
|
|
|
- Add Redis StatefulSet with 3 replicas
|
|
- Configure persistence with PVC
|
|
- Add Vault secret for auth password
|
|
|
|
Refs #67
|
|
```
|
|
|
|
## Before Completing
|
|
1. Validate configuration syntax
|
|
2. Run infrastructure tests
|
|
3. Test in dev/staging first
|
|
4. Document any manual steps required
|
|
5. Update scratchpad and close issue
|