Remediation of coder3 independent-validation blocker on PR #547.
lane-brief.sh inspected only the open-PR index/title/head fields, never the PR
BODY or Gitea issue linkage. A body-only "Closes #546" was therefore invisible,
so issue #546 (open, with PR #547 'Closes #546' in its body) was placed under
DISPATCH CANDIDATES with work-underway count 0 — re-dispatchable in-flight work,
unacceptable for a dispatch-truth tool.
Fix:
- Fetch open PRs as JSON including `body`; resolve PR->issue links via Gitea's
closing-keyword set (close/closes/closed, fix/fixes/fixed, resolve/resolves/
resolved), case-insensitive, word-boundary anchored, `#` directly following the
keyword. Any issue so linked from an OPEN PR is classified WORK UNDERWAY.
- Preserve the prior title/head bare-ref heuristic and per-repo behavior; require
`#` immediately after the keyword so cross-repo `owner/repo#N` forms don't leak.
- Bare `#N` prose mentions in a body are intentionally NOT links (e.g. "#538 line
of work") to avoid marking live, dispatchable issues as in-flight.
Tests (committed, RED-on-revert non-vacuity):
- test-lane-brief-pr-linkage.sh: open-PR-with-'Closes #546'-in-body excludes #546
from candidates (and a reverted copy with the body-scan removed regresses #546
to a candidate — RED proof); bare #777 and substring 'hotfix #999' stay
candidates (word-boundary + closing-keyword-only guards).
- test-ci-wait-exit-matrix.sh: ci-wait.sh exit matrix 0 (all-success) / 1
(terminal-not-success: failure + error/killed) / 2 (usage) / 3 (timeout).
shellcheck -x + bash -n clean on all four files; no secret values. ci-wait.sh
unchanged (coder3 PASS preserved). Closed-issue exclusion unchanged.
Refs #546, PR #547
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two defects an independent survey of the tooling surface found in the new
helpers, fixed pre-gate:
- lane-brief.sh: label filter used jq contains() (substring) — `-l security`
wrongly matched label `domain/6-security`. Now exact-token match against
tea's space-separated labels string. Verified live: `-l security` -> 0,
`-l domain/6-security` -> the real holders.
- ci-wait.sh: unknown owner silently defaulted to the `usc` Woodpecker
instance (wrong credentials, wrong pipelines). Now fails hard requiring
`-a <instance>`, matching lane-brief's FATAL-on-unresolved behavior.
Verified: usc owner still infers and exits 0; unknown owner exits 2 with
guidance.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Kt2D8TsnDwhtzEAPijsNmR
Two additive orchestration tools distilled from forensic analysis of a live
U-Connect delivery session, both adopted by the live orchestrator before this
contribution.
lane-brief.sh (git/): one call returns the CURRENT open issue set for a repo
lane (milestone/label) from Gitea, classified for dispatch. Defeats stale
worker self-report (workers brief from static notes and report already-CLOSED
issues as "todo"). Closed excluded by definition; partitions by PR-linkage
(reliable) not assignee/dependency (empty in this fleet). Login resolution:
-L > $GITEA_LOGIN > owner inference > detect-platform.sh fallback.
ci-wait.sh (woodpecker/): blocks until pipeline(s) reach terminal state,
wrapping pipeline-status.sh (resolves repo->id, instance-aware). Replaces
hand-rolled `curl .../repos/1/pipelines/$n` loops that hardcode repo id 1.
Intended as a Monitor command + long (>=1500s) timed fallback, not a tight
poll. Exit 0=all success / 1=terminal non-success / 2=usage / 3=timeout.
Tested live vs usc/uconnect. README updated. No version bump (separate
release PR per convention).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Kt2D8TsnDwhtzEAPijsNmR
Fresh `mosaic gateway install` (npm) left the gateway DB schema empty —
sign-in 500'd with `relation "users" does not exist`, and every entry
point (auth, bootstrap setup) failed because they all query the users
table first. Five stacked bugs on the local (PGlite) tier:
1. `packages/db/package.json` `files: ["dist"]` excluded the `drizzle/`
SQL migrations from the published tarball.
2. `runMigrations()` only supports postgres-js — unusable for embedded
PGlite.
3. `apps/gateway/src/database/database.module.ts` never invoked
migrations at startup.
4. `createPgliteDb` didn't load pgvector, so migration 0001's
`CREATE EXTENSION vector` failed.
5. Drizzle's PG migrator wraps every migration in one outer
transaction, which trips Postgres' `check_safe_enum_use` on
migration 0009 (`ALTER TYPE ADD VALUE 'pending'` → `SET DEFAULT
'pending'` in the same tx).
Changes:
- Ship `drizzle/` in the published tarball.
- `createPgliteDb` loads `@electric-sql/pglite/vector`.
- New `runPgliteMigrations(handle)` walks the Drizzle journal and
runs each statement-breakpoint chunk through PGlite's `client.exec()`
(autocommit per statement). Records into `drizzle.__drizzle_migrations`
for interop with the postgres-js path. Per-statement try/catch
surfaces which statement of which migration failed.
- `DatabaseModule` runs migrations in `OnModuleInit` before
`app.listen()`. Local tier: explicit `runPgliteMigrations` then
`storageAdapter.migrate()`. Postgres tier: just `storageAdapter.migrate()`,
which already calls `runMigrations(url)` internally — no double-call.
- Removed `packages/storage/src/test-utils/pglite-with-vector.ts`. The
"intentionally not exported" rationale is moot now that migration
0001 forces pgvector load anyway. The integration test uses
`createPgliteDb` + `runPgliteMigrations` from `@mosaicstack/db`.
Tests: BetterAuth tables exist after migrate; idempotent (re-runs 0009);
partial-failure surfaces statement-level context and leaves no ledger row.
QA on a fresh PGlite install:
- `Applying PGlite schema migrations...` then `Initializing storage
adapter (pglite)...` in startup log.
- `GET /api/bootstrap/status` → `{"needsSetup":true}` HTTP 200 (was 500).
- `POST /api/bootstrap/setup` reaches Zod validator (was 500).
Scope: this PR fixes the local (PGlite) tier. Postgres-tier first
install still has the outer-transaction problem and a journal ordering
bug (0009's `when` < 0008's). Documented inline as TODO and in the
scratchpad — needs a separate change with real-Postgres validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CRIT-1: Validate cert subjectUserId against grant.subjectUserId from DB;
use authoritative DB value in FederationContext
- CRIT-2: Add @Inject(GrantsService) decorator (tsx/esbuild requirement)
- HIGH-1: Validate UTF8String TLV tag, length, and bounds in OID parser
- HIGH-2: Collapse all 403 wire messages to a generic string to prevent
grant enumeration; keep internal logger detail
- HIGH-3: Assert federation wire envelope shape in all guard tests
- HIGH-4: Regression test for subjectUserId cert/DB mismatch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds FederationAuthGuard that validates inbound mTLS client certs on
federation API routes. Extracts custom OIDs (grantId, subjectUserId),
loads the grant+peer from DB in one query, asserts active status, and
validates cert serial as defense-in-depth. Attaches FederationContext
to requests on success and uses federation wire-format error envelopes
(not raw NestJS exceptions) for 401/403 responses.
New files:
- apps/gateway/src/federation/oid.util.ts — shared OID extraction (no dupe ASN.1 logic)
- apps/gateway/src/federation/server/federation-auth.guard.ts — guard impl
- apps/gateway/src/federation/server/federation-context.ts — FederationContext type + module augment
- apps/gateway/src/federation/server/index.ts — barrel export
- apps/gateway/src/federation/server/__tests__/federation-auth.guard.spec.ts — 11 unit tests
Modified:
- apps/gateway/src/federation/grants.service.ts — adds getGrantWithPeer() with join
- apps/gateway/src/federation/federation.module.ts — registers FederationAuthGuard as provider
Closes#462
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- HIGH-A: resolveEntry now uses promise-cache pattern so concurrent
callers serialize on a single in-flight build, eliminating duplicate
key material in heap and duplicate DB round-trips
- HIGH-B: flushPeer destroys the evicted undici Agent so stale TLS
connections close on cert rotation
- MED-C: add regression test for PEER_MISCONFIGURED when
STEP_CA_ROOT_CERT_PATH is unset
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CRIT-1: regenerate pnpm-lock.yaml so apps/gateway resolves undici@7.24.6
(prior PR pushed package.json without lockfile update; CI failed with
ERR_PNPM_OUTDATED_LOCKFILE). Incidentally cleans 57 lines of stale
peer-dep entries.
CRIT-2: cache-hit test no longer swallows resolveEntry errors. Calls the
private method directly twice and asserts identity equality plus a
single DB select, removing the silent-failure path the prior assertion
allowed.
HIGH-1: mTLS Agent now pins Step-CA root via STEP_CA_ROOT_CERT_PATH.
Without the env var resolveEntry throws PEER_MISCONFIGURED, refusing to
dial peers against the public trust store. PEM is read once and cached
on the service instance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements FederationClientService — a NestJS injectable that dials peer
gateways over mTLS (undici Agent with cert+sealed-key from federation_peers),
invokes list/get/capabilities verbs, validates responses via Zod, and surfaces
all failure modes as typed FederationClientError with a coherent error code
taxonomy (PEER_NOT_FOUND, PEER_INACTIVE, PEER_MISCONFIGURED, NETWORK,
FORBIDDEN, HTTP_{status}, INVALID_RESPONSE).
Per-peer Agent instances are cached in a Map for the service lifetime;
flushPeer(peerId) invalidates the cache for M5/M6 cert rotation and
revocation events.
Wired into FederationModule providers + exports so QuerySourceService
(M3-09) can inject it.
13 unit tests covering all required scenarios via undici MockAgent +
real sealClientKey/unsealClientKey round-trip.
Closes#462
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>