docs(coolify): update deployment docs with operations guide #445

Merged
jason.woltje merged 1 commits from docs/coolify-operations into main 2026-02-22 08:05:47 +00:00

View File

@@ -93,37 +93,71 @@ Critical vars that were missing initially:
- `BETTER_AUTH_URL`**Required** in production. API won't start without it. Set to `https://api.mosaic.woltje.com`.
## Operations
### Restart Procedure (IMPORTANT)
Coolify's `CleanupDocker` action periodically prunes unused images. During a restart (stop → start), images become "unused" when containers stop and may be pruned before the start phase runs. This causes "No such image" failures.
**Always pre-pull images before any Coolify restart/start:**
```bash
ssh localadmin@10.1.1.44
# 1. Pre-pull all images (run in parallel)
docker pull git.mosaicstack.dev/mosaic/stack-postgres:latest &
docker pull valkey/valkey:8-alpine &
docker pull git.mosaicstack.dev/mosaic/stack-api:latest &
docker pull git.mosaicstack.dev/mosaic/stack-web:latest &
docker pull git.mosaicstack.dev/mosaic/stack-coordinator:latest &
docker pull git.mosaicstack.dev/mosaic/stack-orchestrator:latest &
wait
# 2. Remove stale internal network (prevents "already exists" errors)
docker network rm ug0ssok4g44wocok8kws8gg8_internal 2>/dev/null || true
# 3. Start via Coolify API
TOKEN="<from credentials.json>"
curl -X POST "http://10.1.1.44:8000/api/v1/services/ug0ssok4g44wocok8kws8gg8/start" \
-H "Authorization: Bearer $TOKEN"
# 4. Verify (wait ~30s for health checks)
docker ps --filter 'name=ug0ssok4g44wocok8kws8gg8' --format 'table {{.Names}}\t{{.Status}}'
```
### OTEL Configuration
The coordinator's Python OTLP exporter initializes at import time, before checking `MOSAIC_TELEMETRY_ENABLED`. To suppress OTLP connection noise, set the standard OpenTelemetry env var in the service `.env`:
```
OTEL_SDK_DISABLED=true
```
## Current State (2026-02-22)
### Working
### Verified Working
- All 6 containers running and healthy
- API health endpoint responds at `https://api.mosaic.woltje.com/health`
- Database migrations completed
- Inter-service networking (api→postgres, api→valkey) confirmed via health checks
- Web UI at `https://mosaic.woltje.com/login` — 200 OK
- API health at `https://api.mosaic.woltje.com/health` — healthy, PostgreSQL connected
- CORS: `access-control-allow-origin: https://mosaic.woltje.com`
- Runtime env injection: `NEXT_PUBLIC_API_URL=https://api.mosaic.woltje.com`, `AUTH_MODE=real`
- Valkey: PONG
- Coordinator: healthy, no OTLP noise (`OTEL_SDK_DISABLED=true`)
- Orchestrator: healthy
- TLS: Let's Encrypt certs (web + api), valid until May 23 2026
- Auth endpoint: `/auth/get-session` responds correctly
### Issues
### Resolved Issues
1. **DNS: `mosaic.woltje.com` points to wrong server**
- Resolves to `10.1.1.45` (old Swarm node) instead of through Cloudflare (`174.137.97.162`)
- `api.mosaic.woltje.com` resolves correctly through Cloudflare
- Fix: Update Cloudflare DNS A record for `mosaic.woltje.com`
- **#441**: Coordinator OTLP noise — fixed via `OTEL_SDK_DISABLED=true`
- **#442**: Coolify managed lifecycle — root cause was image pruning during restart + CoolifyTask timeout on large pulls. Fix: pre-pull images before start.
- **#443**: Full stack connectivity — all checks pass
2. **Coordinator: OTLP exporter noise**
- Trying to export traces to `localhost:4318` which doesn't exist
- Container is healthy, errors are non-critical
- Fix: Set `MOSAIC_TELEMETRY_ENABLED=false` in Coolify env vars, or deploy an OTLP collector
### Known Limitations
3. **Coolify managed lifecycle**
- CoolifyTask was failing when starting the service via API/UI
- Containers were started manually via `docker compose up -d` from the service directory
- Coolify recognizes the containers (correct naming convention) but may not properly manage restarts/redeploys
- Needs investigation: check Coolify task logs, verify compose processing
4. **Full connectivity verification needed**
- web→api communication untested (blocked by DNS issue)
- Orchestrator→valkey and orchestrator→api connectivity unverified
- Coordinator webhook endpoint untested
- Coolify restart is NOT safe without pre-pulling images first (CleanupDocker prunes between stop/start)
- CoolifyTask has ~40s timeout — large image pulls will fail if not cached
## SSH Access