Replace the single fixed 300ms capture-pane delay in `agent send --verify` with a
bounded polling loop. After sending, the loop polls `capture-pane` every 400ms
(VERIFY_POLL_INTERVAL_MS) up to a configurable total timeout (default 6000ms,
VERIFY_DEFAULT_TIMEOUT_MS). classifySendResult is called on each poll: accepted/draft
return immediately; unverifiable keeps polling until timeout, then fails closed with
the existing "no pane change after send" message.
New `--verify-timeout <ms>` option on `agent send` (default 6000ms documented).
Injectable SleepFn added to FleetCommandDeps for test isolation — no real sleeps in
tests. Exports VERIFY_POLL_INTERVAL_MS and VERIFY_DEFAULT_TIMEOUT_MS as constants.
classifySendResult and all other pure functions remain unchanged.
Tests: multi-poll acceptance on 2nd/3rd poll => exit 0; pane unchanged until timeout
=> exit 1; draft detected on first poll => exit 1. All 386 tests pass.
docs/fleet/PRD.md Known-limitations updated: verify now polls up to bounded timeout
(default ~6s, --verify-timeout); definitive acceptance still deferred to Phase-3
heartbeat-ack.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Blocker fix: send --verify now captures a BEFORE snapshot immediately
before the send and an AFTER snapshot after the delay, then uses
classifySendResult(before, after) to classify. A wedged pane showing
stale non-empty content is no longer falsely reported as 'accepted' —
BEFORE==AFTER maps to 'unverifiable' (exit 1, "no pane change after
send"). Blank AFTER still fails closed as 'unverifiable'. Only
AFTER != BEFORE without a draft suffix counts as 'accepted' (exit 0).
Should-fix: agent watch now uses a GROUPED VIEWER SESSION instead of a
bare 'tmux attach -r' against the agent session. A bare attach lets the
viewer terminal shrink the agent's window; a grouped session has
independent sizing so the agent's window is never affected.
Sequence: new-session -d -t '=<agent>' -s '<agent>-watch-<pid>' (runner),
attach -r to viewer session (interactiveRunner), kill-session on detach
(runner). New builder functions exported: buildAgentWatchCreateViewerCommand,
buildAgentWatchAttachCommand, buildAgentWatchKillViewerCommand,
buildViewerSessionName. buildAgentWatchCommand kept but deprecated.
New exports: classifySendResult(before, after) — the testable classifier.
Tests added:
- classifySendResult unit suite (6 cases): accepted/draft/unverifiable/
stale-pane/both-blank/before-blank-after-response
- send --verify regression: stale (before==after non-empty) => exit 1
- send --verify regression: blank AFTER => exit 1
- send --verify regression: draft after pane change => exit 1
- send --verify regression: changed non-draft => exit 0
- send --verify: 3-call sequence assertion (before-capture, send, after-capture)
- watch dispatch: grouped viewer session created/attached/killed; no bare
attach against agent session; viewer name matches <agent>-watch-<pid>
PRD Known-limitations updated: pane-change check rationale, Phase-3
heartbeat-ack requirement, grouped-session watch design.
All gates pass: pnpm typecheck, pnpm lint, pnpm --filter @mosaicstack/mosaic test
(382 tests, 74 fleet), prettier --check.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
FR-1 fleet ps: joins systemd show (ActiveState/SubState/UnitFileState),
tmux list-panes (pid/command/dead/activity), and file-based heartbeat
(~/.config/mosaic/fleet/run/<name>.hb) into one table per roster agent.
Flags DRIFT (roster runtime ≠ actual pane command) and BOOT-ENABLE
(active but UnitFileState=disabled). --json output includes tenant_id
and host on every record (FR-6 zero-foreclosure for multi-tenant/host).
FR-3 agent watch: read-only tmux attach (-r flag) so the operator can
observe any session without injecting keystrokes or resizing the window.
Registered as a new verb alongside tail/send/reset in registerFleetAgentCommands.
FR-5 agent send --verify: after keystroke injection, captures the last 5
pane lines and checks for draft heuristic (last non-empty line starts
with '> '). Exits non-zero and writes to stderr if the message appears
unsubmitted. Default send behavior is unchanged when --verify is omitted.
New pure exported helpers (all unit-testable without real tmux/systemd):
buildSystemdShowCommand, buildTmuxListPanesCommand, buildAgentWatchCommand,
buildAgentVerifyAcceptedCommand, parseHeartbeat, parseSystemdShow,
parseTmuxListPanes, detectDrift, getDefaultTenantAndHost, isSendAccepted,
heartbeatPath. Added 31 new spec cases (62 total) covering exact command
construction, JSON shape, heartbeat parsing, drift detection, and verify flow.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Establish Fleet workstream doctrine under mvp-20260312: north star (incl.
fleet-as-means-of-production), Phase-2 observability PRD, workstream tasks,
and scratchpad. Collision-safe: scoped to docs/fleet/, touches none of the
MVP single-writer control-plane files.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Fresh `mosaic gateway install` (npm) left the gateway DB schema empty —
sign-in 500'd with `relation "users" does not exist`, and every entry
point (auth, bootstrap setup) failed because they all query the users
table first. Five stacked bugs on the local (PGlite) tier:
1. `packages/db/package.json` `files: ["dist"]` excluded the `drizzle/`
SQL migrations from the published tarball.
2. `runMigrations()` only supports postgres-js — unusable for embedded
PGlite.
3. `apps/gateway/src/database/database.module.ts` never invoked
migrations at startup.
4. `createPgliteDb` didn't load pgvector, so migration 0001's
`CREATE EXTENSION vector` failed.
5. Drizzle's PG migrator wraps every migration in one outer
transaction, which trips Postgres' `check_safe_enum_use` on
migration 0009 (`ALTER TYPE ADD VALUE 'pending'` → `SET DEFAULT
'pending'` in the same tx).
Changes:
- Ship `drizzle/` in the published tarball.
- `createPgliteDb` loads `@electric-sql/pglite/vector`.
- New `runPgliteMigrations(handle)` walks the Drizzle journal and
runs each statement-breakpoint chunk through PGlite's `client.exec()`
(autocommit per statement). Records into `drizzle.__drizzle_migrations`
for interop with the postgres-js path. Per-statement try/catch
surfaces which statement of which migration failed.
- `DatabaseModule` runs migrations in `OnModuleInit` before
`app.listen()`. Local tier: explicit `runPgliteMigrations` then
`storageAdapter.migrate()`. Postgres tier: just `storageAdapter.migrate()`,
which already calls `runMigrations(url)` internally — no double-call.
- Removed `packages/storage/src/test-utils/pglite-with-vector.ts`. The
"intentionally not exported" rationale is moot now that migration
0001 forces pgvector load anyway. The integration test uses
`createPgliteDb` + `runPgliteMigrations` from `@mosaicstack/db`.
Tests: BetterAuth tables exist after migrate; idempotent (re-runs 0009);
partial-failure surfaces statement-level context and leaves no ledger row.
QA on a fresh PGlite install:
- `Applying PGlite schema migrations...` then `Initializing storage
adapter (pglite)...` in startup log.
- `GET /api/bootstrap/status` → `{"needsSetup":true}` HTTP 200 (was 500).
- `POST /api/bootstrap/setup` reaches Zod validator (was 500).
Scope: this PR fixes the local (PGlite) tier. Postgres-tier first
install still has the outer-transaction problem and a journal ordering
bug (0009's `when` < 0008's). Documented inline as TODO and in the
scratchpad — needs a separate change with real-Postgres validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CRIT-1: Validate cert subjectUserId against grant.subjectUserId from DB;
use authoritative DB value in FederationContext
- CRIT-2: Add @Inject(GrantsService) decorator (tsx/esbuild requirement)
- HIGH-1: Validate UTF8String TLV tag, length, and bounds in OID parser
- HIGH-2: Collapse all 403 wire messages to a generic string to prevent
grant enumeration; keep internal logger detail
- HIGH-3: Assert federation wire envelope shape in all guard tests
- HIGH-4: Regression test for subjectUserId cert/DB mismatch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds FederationAuthGuard that validates inbound mTLS client certs on
federation API routes. Extracts custom OIDs (grantId, subjectUserId),
loads the grant+peer from DB in one query, asserts active status, and
validates cert serial as defense-in-depth. Attaches FederationContext
to requests on success and uses federation wire-format error envelopes
(not raw NestJS exceptions) for 401/403 responses.
New files:
- apps/gateway/src/federation/oid.util.ts — shared OID extraction (no dupe ASN.1 logic)
- apps/gateway/src/federation/server/federation-auth.guard.ts — guard impl
- apps/gateway/src/federation/server/federation-context.ts — FederationContext type + module augment
- apps/gateway/src/federation/server/index.ts — barrel export
- apps/gateway/src/federation/server/__tests__/federation-auth.guard.spec.ts — 11 unit tests
Modified:
- apps/gateway/src/federation/grants.service.ts — adds getGrantWithPeer() with join
- apps/gateway/src/federation/federation.module.ts — registers FederationAuthGuard as provider
Closes#462
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- HIGH-A: resolveEntry now uses promise-cache pattern so concurrent
callers serialize on a single in-flight build, eliminating duplicate
key material in heap and duplicate DB round-trips
- HIGH-B: flushPeer destroys the evicted undici Agent so stale TLS
connections close on cert rotation
- MED-C: add regression test for PEER_MISCONFIGURED when
STEP_CA_ROOT_CERT_PATH is unset
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CRIT-1: regenerate pnpm-lock.yaml so apps/gateway resolves undici@7.24.6
(prior PR pushed package.json without lockfile update; CI failed with
ERR_PNPM_OUTDATED_LOCKFILE). Incidentally cleans 57 lines of stale
peer-dep entries.
CRIT-2: cache-hit test no longer swallows resolveEntry errors. Calls the
private method directly twice and asserts identity equality plus a
single DB select, removing the silent-failure path the prior assertion
allowed.
HIGH-1: mTLS Agent now pins Step-CA root via STEP_CA_ROOT_CERT_PATH.
Without the env var resolveEntry throws PEER_MISCONFIGURED, refusing to
dial peers against the public trust store. PEM is read once and cached
on the service instance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements FederationClientService — a NestJS injectable that dials peer
gateways over mTLS (undici Agent with cert+sealed-key from federation_peers),
invokes list/get/capabilities verbs, validates responses via Zod, and surfaces
all failure modes as typed FederationClientError with a coherent error code
taxonomy (PEER_NOT_FOUND, PEER_INACTIVE, PEER_MISCONFIGURED, NETWORK,
FORBIDDEN, HTTP_{status}, INVALID_RESPONSE).
Per-peer Agent instances are cached in a Map for the service lifetime;
flushPeer(peerId) invalidates the cache for M5/M6 cert rotation and
revocation events.
Wired into FederationModule providers + exports so QuerySourceService
(M3-09) can inject it.
13 unit tests covering all required scenarios via undici MockAgent +
real sealClientKey/unsealClientKey round-trip.
Closes#462
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>