- zone-list, record-list, record-create, record-update, record-delete - Named instance support (-a flag) with configurable default - Zone name-to-ID auto-resolution in shared _lib.sh - Updated credentials loader with cloudflare/cloudflare-<name> services - TOOLS.md and INFRASTRUCTURE.md guide documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 KiB
Infrastructure & DevOps Guide
Before Starting
- Check assigned issue:
~/.config/mosaic/tools/git/issue-list.sh -a @me - Create scratchpad:
docs/scratchpads/{issue-number}-{short-name}.md - Review existing infrastructure configuration
Vault Secrets Management
CRITICAL: Follow canonical Vault structure for ALL secrets.
Structure
{mount}/{service}/{component}/{secret-name}
Examples:
- secret-prod/postgres/database/app
- secret-prod/redis/auth/default
- secret-prod/authentik/admin/token
Environment Mounts
secret-dev/- Development environmentsecret-staging/- Staging environmentsecret-prod/- Production environment
Standard Field Names
- Credentials:
username,password - Tokens:
token - OAuth:
client_id,client_secret - Connection strings:
url,host,port
See docs/vault-secrets-structure.md for complete reference.
Container Standards
Dockerfile Best Practices
# Use specific version tags
FROM node:20-alpine
# Create non-root user
RUN addgroup -S app && adduser -S app -G app
# Set working directory
WORKDIR /app
# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm ci --only=production
# Copy application code
COPY --chown=app:app . .
# Switch to non-root user
USER app
# Use exec form for CMD
CMD ["node", "server.js"]
Container Security
- Use minimal base images (alpine, distroless)
- Run as non-root user
- Don't store secrets in images
- Scan images for vulnerabilities
- Pin dependency versions
Kubernetes/Docker Compose
Resource Limits
Always set resource limits to prevent runaway containers:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
Health Checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
CI/CD Pipelines
Pipeline Stages
- Lint: Code style and static analysis
- Test: Unit and integration tests
- Build: Compile and package
- Scan: Security and vulnerability scanning
- Deploy: Environment-specific deployment
Pipeline Security
- Use secrets management (not hardcoded)
- Pin action/image versions
- Implement approval gates for production
- Audit pipeline access
Steered-Autonomous Deployment (Hard Rule)
In lights-out mode, the agent owns deployment end-to-end when deployment is in scope. The human is escalation-only for missing access, hard policy conflicts, or irreversible risk.
Deployment Target Selection
- Use explicit target from
docs/PRD.md/docs/PRD.jsonordocs/DEPLOYMENT.md. - If unspecified, infer from existing project config/integration.
- If multiple targets exist, choose the target already wired in CI/CD and document rationale.
Supported Targets
- Portainer: Deploy via
~/.config/mosaic/tools/portainer/stack-redeploy.sh, then verify withstack-status.sh. - Coolify: Deploy via
~/.config/mosaic/tools/coolify/deploy.sh -u <uuid>, then verify withservice-status.sh. - Vercel: Deploy via
vercelCLI or connected Git integration, then verify preview/production URL health. - Other SaaS providers: Use provider CLI/API/runbook with the same validation and rollback gates.
Coolify API Operations
# List projects and services
~/.config/mosaic/tools/coolify/project-list.sh
~/.config/mosaic/tools/coolify/service-list.sh
# Check service status
~/.config/mosaic/tools/coolify/service-status.sh -u <uuid>
# Set env vars (takes effect on next deploy)
~/.config/mosaic/tools/coolify/env-set.sh -u <uuid> -k KEY -v VALUE
# Deploy
~/.config/mosaic/tools/coolify/deploy.sh -u <uuid>
Known Coolify Limitations:
- FQDN updates on compose sub-apps not supported via API (DB workaround required)
- Compose files must be base64-encoded in
docker_compose_rawfield - Magic variables (
SERVICE_FQDN_*) require list-style env syntax, not dict-style - Rate limit: 200 requests per interval
Cloudflare DNS Operations
Use the Cloudflare tools for any DNS configuration: pointing domains at services, adding TXT verification records, managing MX records, etc.
Multi-instance support: Credentials support named instances (e.g. personal, work). A default key in credentials.json determines which instance is used when -a is omitted. Pass -a <instance> to target a specific account.
# List all zones (domains) in the account
~/.config/mosaic/tools/cloudflare/zone-list.sh [-a instance]
# List DNS records for a zone (accepts zone name or ID)
~/.config/mosaic/tools/cloudflare/record-list.sh -z <zone> [-t type] [-n name]
# Create a DNS record
~/.config/mosaic/tools/cloudflare/record-create.sh -z <zone> -t <type> -n <name> -c <content> [-p] [-l ttl] [-P priority]
# Update a DNS record (requires record ID from record-list)
~/.config/mosaic/tools/cloudflare/record-update.sh -z <zone> -r <record-id> -t <type> -n <name> -c <content> [-p]
# Delete a DNS record
~/.config/mosaic/tools/cloudflare/record-delete.sh -z <zone> -r <record-id>
Flag reference:
| Flag | Purpose |
|---|---|
-z |
Zone name (e.g. mosaicstack.dev) or 32-char zone ID |
-a |
Named Cloudflare instance (omit for default) |
-t |
Record type: A, AAAA, CNAME, MX, TXT, SRV, etc. |
-n |
Record name: short (app) or FQDN (app.example.com) |
-c |
Record content/value (IP, hostname, TXT string, etc.) |
-r |
Record ID (from record-list.sh output) |
-p |
Enable Cloudflare proxy (orange cloud) — omit for DNS-only (grey cloud) |
-l |
TTL in seconds (default: 1 = auto) |
-P |
Priority for MX/SRV records |
-f |
Output format: table (default) or json |
Common workflows:
# Point a new subdomain at a server (proxied through Cloudflare)
~/.config/mosaic/tools/cloudflare/record-create.sh \
-z example.com -t A -n myapp -c 203.0.113.10 -p
# Add a TXT record for domain verification (never proxied)
~/.config/mosaic/tools/cloudflare/record-create.sh \
-z example.com -t TXT -n _verify -c "verification=abc123"
# Check what records exist before making changes
~/.config/mosaic/tools/cloudflare/record-list.sh -z example.com -t CNAME
# Update an existing record (get record ID from record-list first)
~/.config/mosaic/tools/cloudflare/record-update.sh \
-z example.com -r <record-id> -t A -n myapp -c 10.0.0.5 -p
DNS + Deployment integration: When deploying a new service via Coolify or Portainer that needs a public domain, the typical sequence is:
- Create the DNS record pointing at the host IP (with
-pfor Cloudflare proxy if desired) - Deploy the service via Coolify/Portainer
- Verify the domain resolves and the service is reachable
Proxy (-p) guidance:
- Use proxy (orange cloud) for web services — provides CDN, DDoS protection, and hides origin IP
- Skip proxy (grey cloud) for non-HTTP services (mail, SSH), wildcard records, or when the service handles its own TLS termination and needs direct client IP visibility
- Proxy is NOT compatible with non-standard ports outside Cloudflare's supported range
Stack Health Check
Verify all infrastructure services are reachable:
~/.config/mosaic/tools/health/stack-health.sh
Image Tagging and Promotion (Hard Rule)
For containerized deployments:
- Build immutable image tags:
sha-<shortsha>andv{base-version}-rc.{build}. - Use mutable environment tags only as pointers:
testing, optionalstaging, andprod. - Deploy by immutable digest, not by mutable tag alone.
- Promote the exact tested digest between environments (no rebuild between testing and prod).
- Do not use
latestordevas deployment references.
Blue-green is the default strategy for production promotion. Canary is allowed only when automated SLO/error-rate gates and auto-rollback triggers are implemented.
Post-Deploy Validation (REQUIRED)
- Health endpoints return expected status.
- Critical smoke tests pass in target environment.
- Running version and digest match the promoted release candidate.
- Observability signals (errors/latency) are within expected thresholds.
Rollback Rule
If post-deploy validation fails:
- Execute rollback/redeploy-safe path immediately.
- Mark deployment as blocked in
docs/TASKS.md. - Record failure evidence and next remediation step in scratchpad and release notes.
Registry Retention and Cleanup
Cleanup MUST be automated.
- Keep all final release tags (
vX.Y.Z) indefinitely. - Keep active environment digests (
prod,testing, and active blue/green slots). - Keep recent RC tags (
vX.Y.Z-rc.N) based on retention window. - Remove stale
sha-*and RC tags outside retention window if they are not actively deployed.
Monitoring & Logging
Logging Standards
- Use structured logging (JSON)
- Include correlation IDs
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
- Never log sensitive data
Metrics to Collect
- Request latency (p50, p95, p99)
- Error rates
- Resource utilization (CPU, memory)
- Business metrics
Alerting
- Define SLOs (Service Level Objectives)
- Alert on symptoms, not causes
- Include runbook links in alerts
- Avoid alert fatigue
Testing Infrastructure
Test Categories
- Unit tests: Terraform/Ansible logic
- Integration tests: Deployed resources work together
- Smoke tests: Critical paths after deployment
- Chaos tests: Failure mode validation
Infrastructure Testing Tools
- Terraform:
terraform validate,terraform plan - Ansible:
ansible-lint, molecule - Kubernetes:
kubectl dry-run, kubeval - General: Terratest, ServerSpec
Commit Format
chore(#67): Configure Redis cluster
- Add Redis StatefulSet with 3 replicas
- Configure persistence with PVC
- Add Vault secret for auth password
Refs #67
Before Completing
- Validate configuration syntax
- Run infrastructure tests
- Test in dev/staging first
- Document any manual steps required
- Update scratchpad and close issue