From 51490def5aeb59cb8a42da0ae73f84f92d7d93a2 Mon Sep 17 00:00:00 2001 From: Jason Woltje Date: Sun, 22 Feb 2026 02:05:26 -0600 Subject: [PATCH] docs(coolify): update deployment docs with resolved issues and restart procedure Documents the pre-pull requirement for Coolify restart safety, OTEL configuration, and marks issues #441/#442/#443 as resolved with verification evidence. Co-Authored-By: Claude Opus 4.6 --- docs/COOLIFY-DEPLOYMENT.md | 80 +++++++++++++++++++++++++++----------- 1 file changed, 57 insertions(+), 23 deletions(-) diff --git a/docs/COOLIFY-DEPLOYMENT.md b/docs/COOLIFY-DEPLOYMENT.md index efa82f8..36c0956 100644 --- a/docs/COOLIFY-DEPLOYMENT.md +++ b/docs/COOLIFY-DEPLOYMENT.md @@ -93,37 +93,71 @@ Critical vars that were missing initially: - `BETTER_AUTH_URL` — **Required** in production. API won't start without it. Set to `https://api.mosaic.woltje.com`. +## Operations + +### Restart Procedure (IMPORTANT) + +Coolify's `CleanupDocker` action periodically prunes unused images. During a restart (stop → start), images become "unused" when containers stop and may be pruned before the start phase runs. This causes "No such image" failures. + +**Always pre-pull images before any Coolify restart/start:** + +```bash +ssh localadmin@10.1.1.44 + +# 1. Pre-pull all images (run in parallel) +docker pull git.mosaicstack.dev/mosaic/stack-postgres:latest & +docker pull valkey/valkey:8-alpine & +docker pull git.mosaicstack.dev/mosaic/stack-api:latest & +docker pull git.mosaicstack.dev/mosaic/stack-web:latest & +docker pull git.mosaicstack.dev/mosaic/stack-coordinator:latest & +docker pull git.mosaicstack.dev/mosaic/stack-orchestrator:latest & +wait + +# 2. Remove stale internal network (prevents "already exists" errors) +docker network rm ug0ssok4g44wocok8kws8gg8_internal 2>/dev/null || true + +# 3. Start via Coolify API +TOKEN="" +curl -X POST "http://10.1.1.44:8000/api/v1/services/ug0ssok4g44wocok8kws8gg8/start" \ + -H "Authorization: Bearer $TOKEN" + +# 4. Verify (wait ~30s for health checks) +docker ps --filter 'name=ug0ssok4g44wocok8kws8gg8' --format 'table {{.Names}}\t{{.Status}}' +``` + +### OTEL Configuration + +The coordinator's Python OTLP exporter initializes at import time, before checking `MOSAIC_TELEMETRY_ENABLED`. To suppress OTLP connection noise, set the standard OpenTelemetry env var in the service `.env`: + +``` +OTEL_SDK_DISABLED=true +``` + ## Current State (2026-02-22) -### Working +### Verified Working - All 6 containers running and healthy -- API health endpoint responds at `https://api.mosaic.woltje.com/health` -- Database migrations completed -- Inter-service networking (api→postgres, api→valkey) confirmed via health checks +- Web UI at `https://mosaic.woltje.com/login` — 200 OK +- API health at `https://api.mosaic.woltje.com/health` — healthy, PostgreSQL connected +- CORS: `access-control-allow-origin: https://mosaic.woltje.com` +- Runtime env injection: `NEXT_PUBLIC_API_URL=https://api.mosaic.woltje.com`, `AUTH_MODE=real` +- Valkey: PONG +- Coordinator: healthy, no OTLP noise (`OTEL_SDK_DISABLED=true`) +- Orchestrator: healthy +- TLS: Let's Encrypt certs (web + api), valid until May 23 2026 +- Auth endpoint: `/auth/get-session` responds correctly -### Issues +### Resolved Issues -1. **DNS: `mosaic.woltje.com` points to wrong server** - - Resolves to `10.1.1.45` (old Swarm node) instead of through Cloudflare (`174.137.97.162`) - - `api.mosaic.woltje.com` resolves correctly through Cloudflare - - Fix: Update Cloudflare DNS A record for `mosaic.woltje.com` +- **#441**: Coordinator OTLP noise — fixed via `OTEL_SDK_DISABLED=true` +- **#442**: Coolify managed lifecycle — root cause was image pruning during restart + CoolifyTask timeout on large pulls. Fix: pre-pull images before start. +- **#443**: Full stack connectivity — all checks pass -2. **Coordinator: OTLP exporter noise** - - Trying to export traces to `localhost:4318` which doesn't exist - - Container is healthy, errors are non-critical - - Fix: Set `MOSAIC_TELEMETRY_ENABLED=false` in Coolify env vars, or deploy an OTLP collector +### Known Limitations -3. **Coolify managed lifecycle** - - CoolifyTask was failing when starting the service via API/UI - - Containers were started manually via `docker compose up -d` from the service directory - - Coolify recognizes the containers (correct naming convention) but may not properly manage restarts/redeploys - - Needs investigation: check Coolify task logs, verify compose processing - -4. **Full connectivity verification needed** - - web→api communication untested (blocked by DNS issue) - - Orchestrator→valkey and orchestrator→api connectivity unverified - - Coordinator webhook endpoint untested +- Coolify restart is NOT safe without pre-pulling images first (CleanupDocker prunes between stop/start) +- CoolifyTask has ~40s timeout — large image pulls will fail if not cached ## SSH Access