Files
stack/docs/COOLIFY-DEPLOYMENT.md
2026-02-22 08:05:47 +00:00

172 lines
7.1 KiB
Markdown

# Mosaic Stack — Coolify Deployment
## Overview
Coolify deployment on VM `10.1.1.44` (Proxmox). Replaces the Docker Swarm deployment on w-docker0 (`10.1.1.45`).
## Architecture
```
Internet → Cloudflare → Public IP (174.137.97.162)
→ Main Traefik (10.1.1.43) — TCP TLS passthrough for *.woltje.com
→ Coolify Traefik (10.1.1.44) — terminates TLS via Cloudflare DNS-01 wildcard certs
→ Service containers
```
## Services (Core Stack)
| Service | Image | Internal Port | External Domain |
| ------------ | ----------------------------------------------- | --------------- | ----------------------- |
| postgres | `git.mosaicstack.dev/mosaic/stack-postgres` | 5432 | — |
| valkey | `valkey/valkey:8-alpine` | 6379 | — |
| api | `git.mosaicstack.dev/mosaic/stack-api` | 3001 | `api.mosaic.woltje.com` |
| web | `git.mosaicstack.dev/mosaic/stack-web` | 3000 | `mosaic.woltje.com` |
| coordinator | `git.mosaicstack.dev/mosaic/stack-coordinator` | 8000 | — |
| orchestrator | `git.mosaicstack.dev/mosaic/stack-orchestrator` | 3001 (internal) | — |
Matrix (synapse, element-web) and speech services (speaches, kokoro-tts) are NOT included in the core stack. Deploy separately if needed.
## Compose File
`docker-compose.coolify.yml` in the repo root. This is the Coolify-compatible version of the deployment compose.
Key differences from the Swarm compose (`docker-compose.swarm.portainer.yml`):
- No `deploy:` blocks (Swarm-only)
- No Traefik labels (Coolify manages routing)
- Bridge network instead of overlay
- `restart: unless-stopped` instead of Swarm restart policies
- `SERVICE_FQDN_*` magic environment variables for Coolify domain assignment
- List-style environment syntax (required for Coolify magic vars)
## Coolify IDs
| Resource | UUID |
| ----------- | -------------------------- |
| Project | `rs04g008kgkkw4s0wgsk40w4` |
| Environment | `gko8csc804g8og0oosc8ccs8` |
| Service | `ug0ssok4g44wocok8kws8gg8` |
| Server | `as8kcogk08skskkcsok888g4` |
### Application UUIDs
| App | UUID |
| ------------ | --------------------------- |
| postgres | `jcw0ogskkw040os48ggkgkc8` |
| valkey | `skssgwcggc0c8owoogcso8og` |
| api | `mc40cgwwo8okwwoko84408k4k` |
| web | `c48gcwgc40ok44scscowc8cc` |
| coordinator | `s8gwog4c44w08c8sgkcg04k8` |
| orchestrator | `uo4wkg88co0ckc4c4k44sowc` |
## Coolify API
Base URL: `http://10.1.1.44:8000/api/v1`
Auth: Bearer token from `credentials.json``coolify.app_token`
### Patterns & Gotchas
- **Compose must be base64-encoded** when sending via `docker_compose_raw` field
- **`SERVICE_FQDN_*` magic vars**: Coolify reads these from the compose to auto-assign domains. Format: `SERVICE_FQDN_{NAME}_{PORT}` (e.g., `SERVICE_FQDN_API_3001`). Must use list-style env syntax (`- SERVICE_FQDN_API_3001`), NOT dict-style.
- **FQDN updates on sub-applications**: Coolify API doesn't support updating FQDNs on compose service sub-apps via REST. Workaround: update directly in Coolify's PostgreSQL DB (`coolify-db` container, `service_applications` table).
- **Environment variable management**: Use `PATCH /api/v1/services/{uuid}/envs` with `{ "key": "VAR_NAME", "value": "val", "is_preview": false }`
- **Service start**: `POST /api/v1/services/{uuid}/start`
- **Coolify uses PostgreSQL** (not SQLite) for its internal database — container `coolify-db`
### DB Access (for workarounds)
```bash
ssh localadmin@10.1.1.44
docker exec -it coolify-db psql -U coolify -d coolify
-- Check service app FQDNs
SELECT name, fqdn FROM service_applications WHERE service_id = (
SELECT id FROM services WHERE uuid = 'ug0ssok4g44wocok8kws8gg8'
);
```
## Environment Variables
All env vars are set via Coolify API and stored in `/data/coolify/services/{uuid}/.env` on the node.
Critical vars that were missing initially:
- `BETTER_AUTH_URL`**Required** in production. API won't start without it. Set to `https://api.mosaic.woltje.com`.
## Operations
### Restart Procedure (IMPORTANT)
Coolify's `CleanupDocker` action periodically prunes unused images. During a restart (stop → start), images become "unused" when containers stop and may be pruned before the start phase runs. This causes "No such image" failures.
**Always pre-pull images before any Coolify restart/start:**
```bash
ssh localadmin@10.1.1.44
# 1. Pre-pull all images (run in parallel)
docker pull git.mosaicstack.dev/mosaic/stack-postgres:latest &
docker pull valkey/valkey:8-alpine &
docker pull git.mosaicstack.dev/mosaic/stack-api:latest &
docker pull git.mosaicstack.dev/mosaic/stack-web:latest &
docker pull git.mosaicstack.dev/mosaic/stack-coordinator:latest &
docker pull git.mosaicstack.dev/mosaic/stack-orchestrator:latest &
wait
# 2. Remove stale internal network (prevents "already exists" errors)
docker network rm ug0ssok4g44wocok8kws8gg8_internal 2>/dev/null || true
# 3. Start via Coolify API
TOKEN="<from credentials.json>"
curl -X POST "http://10.1.1.44:8000/api/v1/services/ug0ssok4g44wocok8kws8gg8/start" \
-H "Authorization: Bearer $TOKEN"
# 4. Verify (wait ~30s for health checks)
docker ps --filter 'name=ug0ssok4g44wocok8kws8gg8' --format 'table {{.Names}}\t{{.Status}}'
```
### OTEL Configuration
The coordinator's Python OTLP exporter initializes at import time, before checking `MOSAIC_TELEMETRY_ENABLED`. To suppress OTLP connection noise, set the standard OpenTelemetry env var in the service `.env`:
```
OTEL_SDK_DISABLED=true
```
## Current State (2026-02-22)
### Verified Working
- All 6 containers running and healthy
- Web UI at `https://mosaic.woltje.com/login` — 200 OK
- API health at `https://api.mosaic.woltje.com/health` — healthy, PostgreSQL connected
- CORS: `access-control-allow-origin: https://mosaic.woltje.com`
- Runtime env injection: `NEXT_PUBLIC_API_URL=https://api.mosaic.woltje.com`, `AUTH_MODE=real`
- Valkey: PONG
- Coordinator: healthy, no OTLP noise (`OTEL_SDK_DISABLED=true`)
- Orchestrator: healthy
- TLS: Let's Encrypt certs (web + api), valid until May 23 2026
- Auth endpoint: `/auth/get-session` responds correctly
### Resolved Issues
- **#441**: Coordinator OTLP noise — fixed via `OTEL_SDK_DISABLED=true`
- **#442**: Coolify managed lifecycle — root cause was image pruning during restart + CoolifyTask timeout on large pulls. Fix: pre-pull images before start.
- **#443**: Full stack connectivity — all checks pass
### Known Limitations
- Coolify restart is NOT safe without pre-pulling images first (CleanupDocker prunes between stop/start)
- CoolifyTask has ~40s timeout — large image pulls will fail if not cached
## SSH Access
```bash
ssh localadmin@10.1.1.44
# Note: localadmin cannot sudo without TTY/password
# Use docker to access files:
docker run --rm -v /data/coolify/services:/srv alpine cat /srv/{uuid}/docker-compose.yml
# Use docker exec for Coolify DB:
docker exec -it coolify-db psql -U coolify -d coolify
```