fix(federation): healthcheck + restart policy for federated-test stacks #492
Reference in New Issue
Block a user
Delete Branch "fix/federation-test-stack-healthcheck"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Both
mos-test-1andmos-test-2stacks (Portainer IDs 146/147) were stuck atreplicas=0/1with the gateway exiting cleanly (exit code 0) after exactly 84 seconds.The 84s timing is not accidental — it matches the old healthcheck configuration precisely:
start_period(20s) + 3 x interval(30s) = 80s → first SIGTERM at ~84sAfter 3 failed
wgethealthchecks, Docker Swarm sends SIGTERM. NestJS/Fastify shuts down cleanly → exit 0. Sincerestart_policy: on-failureonly restarts on non-zero exits, the container stayed dead.Root cause candidates for wget failures
localhostmay resolve to::1on Alpine, but the gateway binds to0.0.0.0. Using127.0.0.1explicitly eliminates this.Changes
1. Switch healthcheck from wget to
node -e http.getUses
127.0.0.1(notlocalhost) to force IPv4 and avoid any IPv6 resolution race.nodeis guaranteed available in the runtime image (it IS the runtime).retriesincreased from 3 to 5: gives 150s of retry window after start_period.start_periodincreased from 20s to 60s: covers the NestJS GC cold-start window (~40-50s).2. Change restart policy from
on-failuretoanyPrevents the SIGTERM/clean-exit edge case from permanently killing the container.
anyrestarts on exit 0, SIGTERM, and non-zero exits alike.Test plan