fix(federation): healthcheck + restart policy for federated-test stacks #492

jason.woltje · 2026-04-22T02:52:10Z

jason.woltje commented

2026-04-22 02:52:10 +00:00

Problem

Both mos-test-1 and mos-test-2 stacks (Portainer IDs 146/147) were stuck at replicas=0/1 with the gateway exiting cleanly (exit code 0) after exactly 84 seconds.

The 84s timing is not accidental — it matches the old healthcheck configuration precisely:
start_period(20s) + 3 x interval(30s) = 80s → first SIGTERM at ~84s

After 3 failed wget healthchecks, Docker Swarm sends SIGTERM. NestJS/Fastify shuts down cleanly → exit 0. Since restart_policy: on-failure only restarts on non-zero exits, the container stayed dead.

Root cause candidates for wget failures

IPv6 resolution: localhost may resolve to ::1 on Alpine, but the gateway binds to 0.0.0.0. Using 127.0.0.1 explicitly eliminates this.
NestJS cold start: The GC service logs a non-fatal error during startup (~30-40s into the boot). wget may be attempting the check before the HTTP server is fully accepting connections.

Changes

1. Switch healthcheck from wget to `node -e http.get`

Uses 127.0.0.1 (not localhost) to force IPv4 and avoid any IPv6 resolution race.
node is guaranteed available in the runtime image (it IS the runtime).
retries increased from 3 to 5: gives 150s of retry window after start_period.
start_period increased from 20s to 60s: covers the NestJS GC cold-start window (~40-50s).

2. Change restart policy from `on-failure` to `any`

Prevents the SIGTERM/clean-exit edge case from permanently killing the container. any restarts on exit 0, SIGTERM, and non-zero exits alike.

Test plan

CI passes on this branch
PR merged to main
Stacks 146 + 147 redeployed via Portainer API with updated stack file
https://mos-test-1.woltje.com/health returns 200 {"status":"ok"}
https://mos-test-2.woltje.com/health returns 200 {"status":"ok"}
Both gateway services show replicas=1/1 in Portainer

## Problem Both `mos-test-1` and `mos-test-2` stacks (Portainer IDs 146/147) were stuck at `replicas=0/1` with the gateway exiting cleanly (exit code 0) after exactly 84 seconds. The 84s timing is not accidental — it matches the old healthcheck configuration precisely: `start_period(20s) + 3 x interval(30s) = 80s → first SIGTERM at ~84s` After 3 failed `wget` healthchecks, Docker Swarm sends SIGTERM. NestJS/Fastify shuts down cleanly → exit 0. Since `restart_policy: on-failure` only restarts on non-zero exits, the container stayed dead. ## Root cause candidates for wget failures - IPv6 resolution: `localhost` may resolve to `::1` on Alpine, but the gateway binds to `0.0.0.0`. Using `127.0.0.1` explicitly eliminates this. - NestJS cold start: The GC service logs a non-fatal error during startup (~30-40s into the boot). wget may be attempting the check before the HTTP server is fully accepting connections. ## Changes ### 1. Switch healthcheck from wget to `node -e http.get` Uses `127.0.0.1` (not `localhost`) to force IPv4 and avoid any IPv6 resolution race. `node` is guaranteed available in the runtime image (it IS the runtime). `retries` increased from 3 to 5: gives 150s of retry window after start_period. `start_period` increased from 20s to 60s: covers the NestJS GC cold-start window (~40-50s). ### 2. Change restart policy from `on-failure` to `any` Prevents the SIGTERM/clean-exit edge case from permanently killing the container. `any` restarts on exit 0, SIGTERM, and non-zero exits alike. ## Test plan - CI passes on this branch - PR merged to main - Stacks 146 + 147 redeployed via Portainer API with updated stack file - https://mos-test-1.woltje.com/health returns 200 {"status":"ok"} - https://mos-test-2.woltje.com/health returns 200 {"status":"ok"} - Both gateway services show replicas=1/1 in Portainer

jason.woltje added 1 commit 2026-04-22 02:52:10 +00:00

fix(federation): use node http healthcheck and any-restart for gateway test stack

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/pr/ci Pipeline was successful

Details

acb0b50b0d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jason.woltje merged commit bb24292cf7 into main

2026-04-22 02:56:41 +00:00

jason.woltje referenced this issue from a commit

2026-04-22 02:56:42 +00:00

fix(federation): healthcheck + restart policy for federated-test stacks (#492)

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: mosaicstack/stack#492