Files
stack/tools/federation-harness/README.md
Jarvis cb118a53d9 fix(federation): harness CRIT bugs — admin bootstrap auth + peer FK + boot deadline (review remediation)
CRIT-1: Replace nonexistent x-admin-key header with Authorization: Bearer <token>;
add bootstrapAdmin() to call POST /api/bootstrap/setup on each pristine gateway
before any admin-guarded endpoint is used.

CRIT-2: Fix cross-gateway peer FK violation — peer keypair is now created on
Server B first (so the grant FK resolves against B's own federation_peers table),
then Server A creates its own keypair and redeems the enrollment token at B.

HIGH-3: waitForStack() now polls both gateways in parallel via Promise.all, each
with an independent deadline, so a slow gateway-a cannot starve gateway-b's budget.

MED-4: seed() throws immediately with a clear error if scenario !== 'all';
per-variant narrowing deferred to M3-11 with explicit JSDoc note.

Also: remove ADMIN_API_KEY (no such path in AdminGuard) from compose, replace
with ADMIN_BOOTSTRAP_PASSWORD; add BETTER_AUTH_URL production-code limitation
as a TODO in the README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 21:54:46 -05:00

244 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Federation Test Harness
Local two-gateway federation test infrastructure for Mosaic Stack M3+.
This harness boots two real gateway instances (`gateway-a`, `gateway-b`) on a
shared Docker bridge network, each backed by its own Postgres (pgvector) +
Valkey, sharing a single Step-CA. It is the test bed for all M3+ federation
E2E tests.
## Prerequisites
- Docker with Compose v2 (`docker compose version` ≥ 2.20)
- pnpm (for running via repo scripts)
- `infra/step-ca/dev-password` must exist (copy from `infra/step-ca/dev-password.example`)
## Network Topology
```
Host machine
├── localhost:14001 → gateway-a (Server A — home / requesting)
├── localhost:14002 → gateway-b (Server B — work / serving)
├── localhost:15432 → postgres-a
├── localhost:15433 → postgres-b
├── localhost:16379 → valkey-a
├── localhost:16380 → valkey-b
└── localhost:19000 → step-ca (shared CA)
Docker network: fed-test-net (bridge)
gateway-a ←──── mTLS ────→ gateway-b
↘ ↗
step-ca
```
Ports are chosen to avoid collision with the base dev stack (5433, 6380, 14242, 9000).
## Starting the Harness
```bash
# From repo root
docker compose -f tools/federation-harness/docker-compose.two-gateways.yml up -d
# Wait for all services to be healthy (~60-90s on first boot due to NestJS cold start)
docker compose -f tools/federation-harness/docker-compose.two-gateways.yml ps
```
## Seeding Test Data
The seed script provisions three grant scope variants (A, B, C) and walks the
full enrollment flow so Server A ends up with active peers pointing at Server B.
```bash
# Assumes stack is already running
pnpm tsx tools/federation-harness/seed.ts
# Or boot + seed in one step
pnpm tsx tools/federation-harness/seed.ts --boot
```
### Scope Variants
| Variant | Resources | Filters | Excluded | Purpose |
| ------- | ------------------ | ---------------------------------- | ----------- | ------------------------------- |
| A | tasks, notes | include_personal: true | (none) | Personal data federation |
| B | tasks | include_teams: ['T1'], no personal | (none) | Team-scoped, no personal |
| C | tasks, credentials | include_personal: true | credentials | Sanity: excluded wins over list |
## Using from Vitest
```ts
import {
bootHarness,
tearDownHarness,
serverA,
serverB,
seed,
} from '../../tools/federation-harness/harness.js';
import type { HarnessHandle } from '../../tools/federation-harness/harness.js';
let handle: HarnessHandle;
beforeAll(async () => {
handle = await bootHarness();
}, 180_000); // allow 3 min for Docker pull + NestJS cold start
afterAll(async () => {
await tearDownHarness(handle);
});
test('variant A: list tasks returns personal tasks', async () => {
// NOTE: Only 'all' is supported for now — per-variant narrowing is M3-11.
const seedResult = await seed(handle, 'all');
const a = serverA(handle);
const res = await fetch(`${a.baseUrl}/api/federation/tasks`, {
headers: { 'x-federation-grant': seedResult.grants.variantA.id },
});
expect(res.status).toBe(200);
});
```
> **Note:** `seed()` bootstraps a fresh admin user on each gateway via
> `POST /api/bootstrap/setup`. Both gateways must have zero users (pristine DB).
> If either gateway already has users, `seed()` throws with a clear error.
> Reset state with `docker compose down -v`.
The `bootHarness()` function is **idempotent**: if both gateways are already
healthy, it reuses the running stack and returns `ownedStack: false`. Tests
should not call `tearDownHarness` when `ownedStack` is false unless they
explicitly want to shut down a shared stack.
## Vitest Config (pnpm test:federation)
Add to `vitest.config.ts` at repo root (or a dedicated config):
```ts
// vitest.federation.config.ts
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
include: ['**/*.federation.test.ts'],
testTimeout: 60_000,
hookTimeout: 180_000,
reporters: ['verbose'],
},
});
```
Then add to root `package.json`:
```json
"test:federation": "vitest run --config vitest.federation.config.ts"
```
## Nuking State
```bash
# Remove containers AND volumes (ephemeral state — CA keys, DBs, everything)
docker compose -f tools/federation-harness/docker-compose.two-gateways.yml down -v
```
On next `up`, Step-CA re-initialises from scratch and generates new CA keys.
## Step-CA Root Certificate
The CA root lives in the `fed-harness-step-ca` Docker volume at
`/home/step/certs/root_ca.crt`. To extract it to the host:
```bash
docker run --rm \
-v fed-harness-step-ca:/home/step \
alpine cat /home/step/certs/root_ca.crt > /tmp/fed-harness-root-ca.crt
```
## Troubleshooting
### Port conflicts
Default host ports: 14001, 14002, 15432, 15433, 16379, 16380, 19000.
Override via environment variables before `docker compose up`:
```bash
GATEWAY_A_HOST_PORT=14101 GATEWAY_B_HOST_PORT=14102 \
docker compose -f tools/federation-harness/docker-compose.two-gateways.yml up -d
```
### Image pull failures
The gateway image is digest-pinned to:
```
git.mosaicstack.dev/mosaicstack/stack/gateway@sha256:1069117740e00ccfeba357cae38c43f3729fe5ae702740ce474f6512414d7c02
```
(sha-9f1a081, post-#491 IMG-FIX)
If the registry is unreachable, Docker will use the locally cached image if
present. If no local image exists, the compose up will fail with a pull error.
In that case:
1. Ensure you can reach `git.mosaicstack.dev` (VPN, DNS, etc.).
2. Log in: `docker login git.mosaicstack.dev`
3. Pull manually: `docker pull git.mosaicstack.dev/mosaicstack/stack/gateway@sha256:1069117740e00ccfeba357cae38c43f3729fe5ae702740ce474f6512414d7c02`
### NestJS cold start
Gateway containers take 4060 seconds to become healthy on first boot (Node.js
module resolution + NestJS DI bootstrap). The `start_period: 60s` in the
compose healthcheck covers this. `bootHarness()` polls for up to 3 minutes.
### Step-CA startup
Step-CA initialises on first boot (generates CA keys). This takes ~5-10s.
The `start_period: 30s` in the healthcheck covers it. Both gateways wait for
Step-CA to be healthy before starting (`depends_on: step-ca: condition: service_healthy`).
### dev-password missing
The Step-CA container requires `infra/step-ca/dev-password` to be mounted.
Copy the example and set a local password:
```bash
cp infra/step-ca/dev-password.example infra/step-ca/dev-password
# Edit the file to set your preferred dev CA password
```
The file is `.gitignore`d — do not commit it.
## Image Digest Note
The gateway image is pinned to `sha256:1069117740e00ccfeba357cae38c43f3729fe5ae702740ce474f6512414d7c02`
(sha-9f1a081). This is the digest promoted by PR #491 (IMG-FIX). The `latest`
tag is forbidden per Mosaic image policy. When a new gateway build is promoted,
update the digest in `docker-compose.two-gateways.yml` and in this file.
## Known Limitations
### BETTER_AUTH_URL enrollment URL bug (production code — not fixed here)
`apps/gateway/src/federation/federation.controller.ts:145` constructs the
enrollment URL using `process.env['BETTER_AUTH_URL'] ?? 'http://localhost:14242'`.
In non-harness deployments (where `BETTER_AUTH_URL` is not set or points to the
web origin rather than the gateway's own base URL) this produces an incorrect
enrollment URL that points to the wrong host or port.
The harness works around this by explicitly setting
`BETTER_AUTH_URL: 'http://gateway-b:3000'` in the compose file so the enrollment
URL correctly references gateway-b's internal Docker hostname.
**TODO:** Fix `federation.controller.ts` to derive the enrollment URL from its own
listening address (e.g. `GATEWAY_BASE_URL` env var or a dedicated
`FEDERATION_ENROLLMENT_BASE_URL` env var) rather than reusing `BETTER_AUTH_URL`.
Tracked as a follow-up to PR #505 — do not bundle with harness changes.
## Permanent Infrastructure
This harness is designed to outlive M3 and be reused by M4+ milestone tests.
It is not a throwaway scaffold — treat it as production test infrastructure:
- Keep it idempotent.
- Do not hardcode test assumptions in the harness layer (put them in tests).
- Update the seed script when new scope variants are needed.
- The README and harness should be kept in sync as the federation API evolves.