docs(coolify): update deployment docs with operations guide (#445)
Co-authored-by: Jason Woltje <jason@diversecanvas.com> Co-committed-by: Jason Woltje <jason@diversecanvas.com>
This commit was merged in pull request #445.
This commit is contained in:
@@ -93,37 +93,71 @@ Critical vars that were missing initially:
|
||||
|
||||
- `BETTER_AUTH_URL` — **Required** in production. API won't start without it. Set to `https://api.mosaic.woltje.com`.
|
||||
|
||||
## Operations
|
||||
|
||||
### Restart Procedure (IMPORTANT)
|
||||
|
||||
Coolify's `CleanupDocker` action periodically prunes unused images. During a restart (stop → start), images become "unused" when containers stop and may be pruned before the start phase runs. This causes "No such image" failures.
|
||||
|
||||
**Always pre-pull images before any Coolify restart/start:**
|
||||
|
||||
```bash
|
||||
ssh localadmin@10.1.1.44
|
||||
|
||||
# 1. Pre-pull all images (run in parallel)
|
||||
docker pull git.mosaicstack.dev/mosaic/stack-postgres:latest &
|
||||
docker pull valkey/valkey:8-alpine &
|
||||
docker pull git.mosaicstack.dev/mosaic/stack-api:latest &
|
||||
docker pull git.mosaicstack.dev/mosaic/stack-web:latest &
|
||||
docker pull git.mosaicstack.dev/mosaic/stack-coordinator:latest &
|
||||
docker pull git.mosaicstack.dev/mosaic/stack-orchestrator:latest &
|
||||
wait
|
||||
|
||||
# 2. Remove stale internal network (prevents "already exists" errors)
|
||||
docker network rm ug0ssok4g44wocok8kws8gg8_internal 2>/dev/null || true
|
||||
|
||||
# 3. Start via Coolify API
|
||||
TOKEN="<from credentials.json>"
|
||||
curl -X POST "http://10.1.1.44:8000/api/v1/services/ug0ssok4g44wocok8kws8gg8/start" \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# 4. Verify (wait ~30s for health checks)
|
||||
docker ps --filter 'name=ug0ssok4g44wocok8kws8gg8' --format 'table {{.Names}}\t{{.Status}}'
|
||||
```
|
||||
|
||||
### OTEL Configuration
|
||||
|
||||
The coordinator's Python OTLP exporter initializes at import time, before checking `MOSAIC_TELEMETRY_ENABLED`. To suppress OTLP connection noise, set the standard OpenTelemetry env var in the service `.env`:
|
||||
|
||||
```
|
||||
OTEL_SDK_DISABLED=true
|
||||
```
|
||||
|
||||
## Current State (2026-02-22)
|
||||
|
||||
### Working
|
||||
### Verified Working
|
||||
|
||||
- All 6 containers running and healthy
|
||||
- API health endpoint responds at `https://api.mosaic.woltje.com/health`
|
||||
- Database migrations completed
|
||||
- Inter-service networking (api→postgres, api→valkey) confirmed via health checks
|
||||
- Web UI at `https://mosaic.woltje.com/login` — 200 OK
|
||||
- API health at `https://api.mosaic.woltje.com/health` — healthy, PostgreSQL connected
|
||||
- CORS: `access-control-allow-origin: https://mosaic.woltje.com`
|
||||
- Runtime env injection: `NEXT_PUBLIC_API_URL=https://api.mosaic.woltje.com`, `AUTH_MODE=real`
|
||||
- Valkey: PONG
|
||||
- Coordinator: healthy, no OTLP noise (`OTEL_SDK_DISABLED=true`)
|
||||
- Orchestrator: healthy
|
||||
- TLS: Let's Encrypt certs (web + api), valid until May 23 2026
|
||||
- Auth endpoint: `/auth/get-session` responds correctly
|
||||
|
||||
### Issues
|
||||
### Resolved Issues
|
||||
|
||||
1. **DNS: `mosaic.woltje.com` points to wrong server**
|
||||
- Resolves to `10.1.1.45` (old Swarm node) instead of through Cloudflare (`174.137.97.162`)
|
||||
- `api.mosaic.woltje.com` resolves correctly through Cloudflare
|
||||
- Fix: Update Cloudflare DNS A record for `mosaic.woltje.com`
|
||||
- **#441**: Coordinator OTLP noise — fixed via `OTEL_SDK_DISABLED=true`
|
||||
- **#442**: Coolify managed lifecycle — root cause was image pruning during restart + CoolifyTask timeout on large pulls. Fix: pre-pull images before start.
|
||||
- **#443**: Full stack connectivity — all checks pass
|
||||
|
||||
2. **Coordinator: OTLP exporter noise**
|
||||
- Trying to export traces to `localhost:4318` which doesn't exist
|
||||
- Container is healthy, errors are non-critical
|
||||
- Fix: Set `MOSAIC_TELEMETRY_ENABLED=false` in Coolify env vars, or deploy an OTLP collector
|
||||
### Known Limitations
|
||||
|
||||
3. **Coolify managed lifecycle**
|
||||
- CoolifyTask was failing when starting the service via API/UI
|
||||
- Containers were started manually via `docker compose up -d` from the service directory
|
||||
- Coolify recognizes the containers (correct naming convention) but may not properly manage restarts/redeploys
|
||||
- Needs investigation: check Coolify task logs, verify compose processing
|
||||
|
||||
4. **Full connectivity verification needed**
|
||||
- web→api communication untested (blocked by DNS issue)
|
||||
- Orchestrator→valkey and orchestrator→api connectivity unverified
|
||||
- Coordinator webhook endpoint untested
|
||||
- Coolify restart is NOT safe without pre-pulling images first (CleanupDocker prunes between stop/start)
|
||||
- CoolifyTask has ~40s timeout — large image pulls will fail if not cached
|
||||
|
||||
## SSH Access
|
||||
|
||||
|
||||
Reference in New Issue
Block a user