Files
bootstrap/guides/INFRASTRUCTURE.md
Jason Woltje e5c4bf25b3 feat: add Cloudflare DNS tool suite with multi-instance support
- zone-list, record-list, record-create, record-update, record-delete
- Named instance support (-a flag) with configurable default
- Zone name-to-ID auto-resolution in shared _lib.sh
- Updated credentials loader with cloudflare/cloudflare-<name> services
- TOOLS.md and INFRASTRUCTURE.md guide documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 21:31:52 -06:00

10 KiB

Infrastructure & DevOps Guide

Before Starting

  1. Check assigned issue: ~/.config/mosaic/tools/git/issue-list.sh -a @me
  2. Create scratchpad: docs/scratchpads/{issue-number}-{short-name}.md
  3. Review existing infrastructure configuration

Vault Secrets Management

CRITICAL: Follow canonical Vault structure for ALL secrets.

Structure

{mount}/{service}/{component}/{secret-name}

Examples:
- secret-prod/postgres/database/app
- secret-prod/redis/auth/default
- secret-prod/authentik/admin/token

Environment Mounts

  • secret-dev/ - Development environment
  • secret-staging/ - Staging environment
  • secret-prod/ - Production environment

Standard Field Names

  • Credentials: username, password
  • Tokens: token
  • OAuth: client_id, client_secret
  • Connection strings: url, host, port

See docs/vault-secrets-structure.md for complete reference.

Container Standards

Dockerfile Best Practices

# Use specific version tags
FROM node:20-alpine

# Create non-root user
RUN addgroup -S app && adduser -S app -G app

# Set working directory
WORKDIR /app

# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm ci --only=production

# Copy application code
COPY --chown=app:app . .

# Switch to non-root user
USER app

# Use exec form for CMD
CMD ["node", "server.js"]

Container Security

  • Use minimal base images (alpine, distroless)
  • Run as non-root user
  • Don't store secrets in images
  • Scan images for vulnerabilities
  • Pin dependency versions

Kubernetes/Docker Compose

Resource Limits

Always set resource limits to prevent runaway containers:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 3

CI/CD Pipelines

Pipeline Stages

  1. Lint: Code style and static analysis
  2. Test: Unit and integration tests
  3. Build: Compile and package
  4. Scan: Security and vulnerability scanning
  5. Deploy: Environment-specific deployment

Pipeline Security

  • Use secrets management (not hardcoded)
  • Pin action/image versions
  • Implement approval gates for production
  • Audit pipeline access

Steered-Autonomous Deployment (Hard Rule)

In lights-out mode, the agent owns deployment end-to-end when deployment is in scope. The human is escalation-only for missing access, hard policy conflicts, or irreversible risk.

Deployment Target Selection

  1. Use explicit target from docs/PRD.md / docs/PRD.json or docs/DEPLOYMENT.md.
  2. If unspecified, infer from existing project config/integration.
  3. If multiple targets exist, choose the target already wired in CI/CD and document rationale.

Supported Targets

  • Portainer: Deploy via ~/.config/mosaic/tools/portainer/stack-redeploy.sh, then verify with stack-status.sh.
  • Coolify: Deploy via ~/.config/mosaic/tools/coolify/deploy.sh -u <uuid>, then verify with service-status.sh.
  • Vercel: Deploy via vercel CLI or connected Git integration, then verify preview/production URL health.
  • Other SaaS providers: Use provider CLI/API/runbook with the same validation and rollback gates.

Coolify API Operations

# List projects and services
~/.config/mosaic/tools/coolify/project-list.sh
~/.config/mosaic/tools/coolify/service-list.sh

# Check service status
~/.config/mosaic/tools/coolify/service-status.sh -u <uuid>

# Set env vars (takes effect on next deploy)
~/.config/mosaic/tools/coolify/env-set.sh -u <uuid> -k KEY -v VALUE

# Deploy
~/.config/mosaic/tools/coolify/deploy.sh -u <uuid>

Known Coolify Limitations:

  • FQDN updates on compose sub-apps not supported via API (DB workaround required)
  • Compose files must be base64-encoded in docker_compose_raw field
  • Magic variables (SERVICE_FQDN_*) require list-style env syntax, not dict-style
  • Rate limit: 200 requests per interval

Cloudflare DNS Operations

Use the Cloudflare tools for any DNS configuration: pointing domains at services, adding TXT verification records, managing MX records, etc.

Multi-instance support: Credentials support named instances (e.g. personal, work). A default key in credentials.json determines which instance is used when -a is omitted. Pass -a <instance> to target a specific account.

# List all zones (domains) in the account
~/.config/mosaic/tools/cloudflare/zone-list.sh [-a instance]

# List DNS records for a zone (accepts zone name or ID)
~/.config/mosaic/tools/cloudflare/record-list.sh -z <zone> [-t type] [-n name]

# Create a DNS record
~/.config/mosaic/tools/cloudflare/record-create.sh -z <zone> -t <type> -n <name> -c <content> [-p] [-l ttl] [-P priority]

# Update a DNS record (requires record ID from record-list)
~/.config/mosaic/tools/cloudflare/record-update.sh -z <zone> -r <record-id> -t <type> -n <name> -c <content> [-p]

# Delete a DNS record
~/.config/mosaic/tools/cloudflare/record-delete.sh -z <zone> -r <record-id>

Flag reference:

Flag Purpose
-z Zone name (e.g. mosaicstack.dev) or 32-char zone ID
-a Named Cloudflare instance (omit for default)
-t Record type: A, AAAA, CNAME, MX, TXT, SRV, etc.
-n Record name: short (app) or FQDN (app.example.com)
-c Record content/value (IP, hostname, TXT string, etc.)
-r Record ID (from record-list.sh output)
-p Enable Cloudflare proxy (orange cloud) — omit for DNS-only (grey cloud)
-l TTL in seconds (default: 1 = auto)
-P Priority for MX/SRV records
-f Output format: table (default) or json

Common workflows:

# Point a new subdomain at a server (proxied through Cloudflare)
~/.config/mosaic/tools/cloudflare/record-create.sh \
  -z example.com -t A -n myapp -c 203.0.113.10 -p

# Add a TXT record for domain verification (never proxied)
~/.config/mosaic/tools/cloudflare/record-create.sh \
  -z example.com -t TXT -n _verify -c "verification=abc123"

# Check what records exist before making changes
~/.config/mosaic/tools/cloudflare/record-list.sh -z example.com -t CNAME

# Update an existing record (get record ID from record-list first)
~/.config/mosaic/tools/cloudflare/record-update.sh \
  -z example.com -r <record-id> -t A -n myapp -c 10.0.0.5 -p

DNS + Deployment integration: When deploying a new service via Coolify or Portainer that needs a public domain, the typical sequence is:

  1. Create the DNS record pointing at the host IP (with -p for Cloudflare proxy if desired)
  2. Deploy the service via Coolify/Portainer
  3. Verify the domain resolves and the service is reachable

Proxy (-p) guidance:

  • Use proxy (orange cloud) for web services — provides CDN, DDoS protection, and hides origin IP
  • Skip proxy (grey cloud) for non-HTTP services (mail, SSH), wildcard records, or when the service handles its own TLS termination and needs direct client IP visibility
  • Proxy is NOT compatible with non-standard ports outside Cloudflare's supported range

Stack Health Check

Verify all infrastructure services are reachable:

~/.config/mosaic/tools/health/stack-health.sh

Image Tagging and Promotion (Hard Rule)

For containerized deployments:

  1. Build immutable image tags: sha-<shortsha> and v{base-version}-rc.{build}.
  2. Use mutable environment tags only as pointers: testing, optional staging, and prod.
  3. Deploy by immutable digest, not by mutable tag alone.
  4. Promote the exact tested digest between environments (no rebuild between testing and prod).
  5. Do not use latest or dev as deployment references.

Blue-green is the default strategy for production promotion. Canary is allowed only when automated SLO/error-rate gates and auto-rollback triggers are implemented.

Post-Deploy Validation (REQUIRED)

  1. Health endpoints return expected status.
  2. Critical smoke tests pass in target environment.
  3. Running version and digest match the promoted release candidate.
  4. Observability signals (errors/latency) are within expected thresholds.

Rollback Rule

If post-deploy validation fails:

  1. Execute rollback/redeploy-safe path immediately.
  2. Mark deployment as blocked in docs/TASKS.md.
  3. Record failure evidence and next remediation step in scratchpad and release notes.

Registry Retention and Cleanup

Cleanup MUST be automated.

  • Keep all final release tags (vX.Y.Z) indefinitely.
  • Keep active environment digests (prod, testing, and active blue/green slots).
  • Keep recent RC tags (vX.Y.Z-rc.N) based on retention window.
  • Remove stale sha-* and RC tags outside retention window if they are not actively deployed.

Monitoring & Logging

Logging Standards

  • Use structured logging (JSON)
  • Include correlation IDs
  • Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
  • Never log sensitive data

Metrics to Collect

  • Request latency (p50, p95, p99)
  • Error rates
  • Resource utilization (CPU, memory)
  • Business metrics

Alerting

  • Define SLOs (Service Level Objectives)
  • Alert on symptoms, not causes
  • Include runbook links in alerts
  • Avoid alert fatigue

Testing Infrastructure

Test Categories

  1. Unit tests: Terraform/Ansible logic
  2. Integration tests: Deployed resources work together
  3. Smoke tests: Critical paths after deployment
  4. Chaos tests: Failure mode validation

Infrastructure Testing Tools

  • Terraform: terraform validate, terraform plan
  • Ansible: ansible-lint, molecule
  • Kubernetes: kubectl dry-run, kubeval
  • General: Terratest, ServerSpec

Commit Format

chore(#67): Configure Redis cluster

- Add Redis StatefulSet with 3 replicas
- Configure persistence with PVC
- Add Vault secret for auth password

Refs #67

Before Completing

  1. Validate configuration syntax
  2. Run infrastructure tests
  3. Test in dev/staging first
  4. Document any manual steps required
  5. Update scratchpad and close issue