221 lines
6.2 KiB
Markdown
221 lines
6.2 KiB
Markdown
# Infrastructure & DevOps Guide
|
|
|
|
## Before Starting
|
|
1. Check assigned issue: `~/.config/mosaic/rails/git/issue-list.sh -a @me`
|
|
2. Create scratchpad: `docs/scratchpads/{issue-number}-{short-name}.md`
|
|
3. Review existing infrastructure configuration
|
|
|
|
## Vault Secrets Management
|
|
|
|
**CRITICAL**: Follow canonical Vault structure for ALL secrets.
|
|
|
|
### Structure
|
|
```
|
|
{mount}/{service}/{component}/{secret-name}
|
|
|
|
Examples:
|
|
- secret-prod/postgres/database/app
|
|
- secret-prod/redis/auth/default
|
|
- secret-prod/authentik/admin/token
|
|
```
|
|
|
|
### Environment Mounts
|
|
- `secret-dev/` - Development environment
|
|
- `secret-staging/` - Staging environment
|
|
- `secret-prod/` - Production environment
|
|
|
|
### Standard Field Names
|
|
- Credentials: `username`, `password`
|
|
- Tokens: `token`
|
|
- OAuth: `client_id`, `client_secret`
|
|
- Connection strings: `url`, `host`, `port`
|
|
|
|
See `docs/vault-secrets-structure.md` for complete reference.
|
|
|
|
## Container Standards
|
|
|
|
### Dockerfile Best Practices
|
|
```dockerfile
|
|
# Use specific version tags
|
|
FROM node:20-alpine
|
|
|
|
# Create non-root user
|
|
RUN addgroup -S app && adduser -S app -G app
|
|
|
|
# Set working directory
|
|
WORKDIR /app
|
|
|
|
# Copy dependency files first (layer caching)
|
|
COPY package*.json ./
|
|
RUN npm ci --only=production
|
|
|
|
# Copy application code
|
|
COPY --chown=app:app . .
|
|
|
|
# Switch to non-root user
|
|
USER app
|
|
|
|
# Use exec form for CMD
|
|
CMD ["node", "server.js"]
|
|
```
|
|
|
|
### Container Security
|
|
- Use minimal base images (alpine, distroless)
|
|
- Run as non-root user
|
|
- Don't store secrets in images
|
|
- Scan images for vulnerabilities
|
|
- Pin dependency versions
|
|
|
|
## Kubernetes/Docker Compose
|
|
|
|
### Resource Limits
|
|
Always set resource limits to prevent runaway containers:
|
|
```yaml
|
|
resources:
|
|
requests:
|
|
memory: "128Mi"
|
|
cpu: "100m"
|
|
limits:
|
|
memory: "256Mi"
|
|
cpu: "500m"
|
|
```
|
|
|
|
### Health Checks
|
|
```yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 3
|
|
```
|
|
|
|
## CI/CD Pipelines
|
|
|
|
### Pipeline Stages
|
|
1. **Lint**: Code style and static analysis
|
|
2. **Test**: Unit and integration tests
|
|
3. **Build**: Compile and package
|
|
4. **Scan**: Security and vulnerability scanning
|
|
5. **Deploy**: Environment-specific deployment
|
|
|
|
### Pipeline Security
|
|
- Use secrets management (not hardcoded)
|
|
- Pin action/image versions
|
|
- Implement approval gates for production
|
|
- Audit pipeline access
|
|
|
|
## Steered-Autonomous Deployment (Hard Rule)
|
|
|
|
In lights-out mode, the agent owns deployment end-to-end when deployment is in scope.
|
|
The human is escalation-only for missing access, hard policy conflicts, or irreversible risk.
|
|
|
|
### Deployment Target Selection
|
|
|
|
1. Use explicit target from `docs/PRD.md` / `docs/PRD.json` or `docs/DEPLOYMENT.md`.
|
|
2. If unspecified, infer from existing project config/integration.
|
|
3. If multiple targets exist, choose the target already wired in CI/CD and document rationale.
|
|
|
|
### Supported Targets
|
|
|
|
- **Portainer**: Deploy via configured stack webhook/API, then verify service health and container status.
|
|
- **Coolify**: Trigger deployment via Coolify API/webhook, then verify deployment status and endpoint health.
|
|
- **Vercel**: Deploy via `vercel` CLI or connected Git integration, then verify preview/production URL health.
|
|
- **Other SaaS providers**: Use provider CLI/API/runbook with the same validation and rollback gates.
|
|
|
|
### Image Tagging and Promotion (Hard Rule)
|
|
|
|
For containerized deployments:
|
|
|
|
1. Build immutable image tags: `sha-<shortsha>` and `v{base-version}-rc.{build}`.
|
|
2. Use mutable environment tags only as pointers: `testing`, optional `staging`, and `prod`.
|
|
3. Deploy by immutable digest, not by mutable tag alone.
|
|
4. Promote the exact tested digest between environments (no rebuild between testing and prod).
|
|
5. Do not use `latest` or `dev` as deployment references.
|
|
|
|
Blue-green is the default strategy for production promotion.
|
|
Canary is allowed only when automated SLO/error-rate gates and auto-rollback triggers are implemented.
|
|
|
|
### Post-Deploy Validation (REQUIRED)
|
|
|
|
1. Health endpoints return expected status.
|
|
2. Critical smoke tests pass in target environment.
|
|
3. Running version and digest match the promoted release candidate.
|
|
4. Observability signals (errors/latency) are within expected thresholds.
|
|
|
|
### Rollback Rule
|
|
|
|
If post-deploy validation fails:
|
|
|
|
1. Execute rollback/redeploy-safe path immediately.
|
|
2. Mark deployment as blocked in `docs/TASKS.md`.
|
|
3. Record failure evidence and next remediation step in scratchpad and release notes.
|
|
|
|
### Registry Retention and Cleanup
|
|
|
|
Cleanup MUST be automated.
|
|
|
|
- Keep all final release tags (`vX.Y.Z`) indefinitely.
|
|
- Keep active environment digests (`prod`, `testing`, and active blue/green slots).
|
|
- Keep recent RC tags (`vX.Y.Z-rc.N`) based on retention window.
|
|
- Remove stale `sha-*` and RC tags outside retention window if they are not actively deployed.
|
|
|
|
## Monitoring & Logging
|
|
|
|
### Logging Standards
|
|
- Use structured logging (JSON)
|
|
- Include correlation IDs
|
|
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
|
|
- Never log sensitive data
|
|
|
|
### Metrics to Collect
|
|
- Request latency (p50, p95, p99)
|
|
- Error rates
|
|
- Resource utilization (CPU, memory)
|
|
- Business metrics
|
|
|
|
### Alerting
|
|
- Define SLOs (Service Level Objectives)
|
|
- Alert on symptoms, not causes
|
|
- Include runbook links in alerts
|
|
- Avoid alert fatigue
|
|
|
|
## Testing Infrastructure
|
|
|
|
### Test Categories
|
|
1. **Unit tests**: Terraform/Ansible logic
|
|
2. **Integration tests**: Deployed resources work together
|
|
3. **Smoke tests**: Critical paths after deployment
|
|
4. **Chaos tests**: Failure mode validation
|
|
|
|
### Infrastructure Testing Tools
|
|
- Terraform: `terraform validate`, `terraform plan`
|
|
- Ansible: `ansible-lint`, molecule
|
|
- Kubernetes: `kubectl dry-run`, kubeval
|
|
- General: Terratest, ServerSpec
|
|
|
|
## Commit Format
|
|
```
|
|
chore(#67): Configure Redis cluster
|
|
|
|
- Add Redis StatefulSet with 3 replicas
|
|
- Configure persistence with PVC
|
|
- Add Vault secret for auth password
|
|
|
|
Refs #67
|
|
```
|
|
|
|
## Before Completing
|
|
1. Validate configuration syntax
|
|
2. Run infrastructure tests
|
|
3. Test in dev/staging first
|
|
4. Document any manual steps required
|
|
5. Update scratchpad and close issue
|