fix(pr-ci-wait): CI-history primary tier — close webhook-lag false-green (#550 )

F-06 follow-up per Mos ruling. The no-CI fast-exit was a pure empty-poll streak (NO_CI_MAX×interval ≈ 45s), so a slow-to-register pipeline (webhook/queue lag) looked like 'no CI' and could false-green a merge gate before the pipeline existed. Two-tier no-CI determination: - PRIMARY: probe the repo's DEFAULT BRANCH commit status once at startup. If it has CI history, the repo runs CI → an empty status on the PR head means the pipeline has not REGISTERED yet → never fast-green; poll until it registers or timeout (both safe). Closes the webhook-lag false-green. - SECONDARY: the empty-poll streak fast-exit now applies ONLY to genuinely CI-less repos (default branch also has no CI history). Preserves the original no-CI win. - Probe failure → conservative REPO_HAS_CI=1 (assume CI; wait-then-timeout beats false-green). All early returns are explicit 'return 0' + guarded call so the probe can never abort under set -e. Verified: bash -n + shellcheck clean; behavioral harness covers established-repo (stays 1), CI-less (→0), empty-branch/probe-fail (conservative 1), and the no-status gate (has-CI never fast-greens, CI-less fast-exits). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Kt2D8TsnDwhtzEAPijsNmR
fix(framework/tools): wrapper hardening — TLS validation, cred-path fallback, no-CI fast-exit (#550 )
2026-06-18 14:18:32 -05:00 · 2026-06-18 14:02:43 -05:00 · 2026-06-16 21:35:40 +00:00 · 2026-06-16 01:10:44 +00:00
30 changed files with 2217 additions and 12 deletions
--- a/apps/appservice/src/tests/server.test.ts
+++ b/apps/appservice/src/tests/server.test.ts
@@ -3,6 +3,8 @@ import { describe, expect, it, vi } from 'vitest';
 import { AppserviceDaemon } from '../server.js';
 import type { DaemonConfig, DaemonRequest } from '../server.js';
 const AGENTS_TYPE = 'org.uscllc.mosaic_as.agents';
 const cfg: DaemonConfig = {
  homeserverUrl: 'https://hs.example',
  domain: 'hs.example',
@@ -228,6 +230,149 @@ describe('AppserviceDaemon routing', () => {
    expect(bad.status).toBe(400);
  });
  // A daemon whose fetch mock backs account_data with a mutable in-test object,
  // so register/verify/revoke round-trip through the (faked) homeserver.
  const makeAgentDaemon = () => {
    const accountData: { value: Record<string, unknown> | null } = { value: null };
    const fetchMock = vi.fn(async (input: URL | string, init?: RequestInit) => {
      const url = new URL(String(input));
      const path = url.pathname;
      if (path.includes(`/account_data/${AGENTS_TYPE}`)) {
        if (init?.method === 'PUT') {
          accountData.value = JSON.parse(String(init.body)) as Record<string, unknown>;
          return jsonResponse(200, {});
        }
        if (accountData.value === null) {
          return jsonResponse(404, { errcode: 'M_NOT_FOUND', error: 'not found' });
        }
        return jsonResponse(200, accountData.value);
      }
      if (path.endsWith('/register')) return jsonResponse(200, { user_id: 'whatever' });
      if (path.includes('/send/m.room.message/')) return jsonResponse(200, { event_id: '$sent' });
      return jsonResponse(200, {});
    });
    const daemon = new AppserviceDaemon(cfg, fetchMock as unknown as typeof fetch, () => {});
    return { daemon, fetchMock };
  };
  const registerAgent = async (
    daemon: AppserviceDaemon,
    body: Record<string, unknown> = { alias: 'pi0', host: 'web1' },
  ) =>
    daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/agents',
        authorizationHeader: 'Bearer bridge-secret',
        body,
      }),
    );
  it('host token registers an agent and returns agent_user_id + bridge_token', async () => {
    const { daemon, fetchMock } = makeAgentDaemon();
    const res = await registerAgent(daemon, { alias: 'pi0', host: 'web1' });
    expect(res.status).toBe(200);
    expect(res.body.agent_user_id).toBe('@agent-pi0-web1:hs.example');
    expect(String(res.body.bridge_token).startsWith('magt_')).toBe(true);
    const registerCall = fetchMock.mock.calls
      .map((c) => new URL(String(c[0])))
      .find((u) => u.pathname.endsWith('/register'));
    expect(registerCall).toBeDefined();
  });
  it('register requires a HOST token (agent token and no token are 403)', async () => {
    const { daemon } = makeAgentDaemon();
    const minted = await registerAgent(daemon);
    const agentToken = String(minted.body.bridge_token);
    const asAgent = await daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/agents',
        authorizationHeader: `Bearer ${agentToken}`,
        body: { alias: 'pi1', host: 'web2' },
      }),
    );
    expect(asAgent.status).toBe(403);
    const noAuth = await daemon.handle(
      request({ method: 'POST', path: '/bridge/v1/agents', body: { alias: 'pi1', host: 'web2' } }),
    );
    expect(noAuth.status).toBe(403);
  });
  it('agent-scoped token may send as itself but not as another agent', async () => {
    const { daemon } = makeAgentDaemon();
    const minted = await registerAgent(daemon, { alias: 'pi0', host: 'web1' });
    const agentToken = String(minted.body.bridge_token);
    const self = await daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/messages',
        authorizationHeader: `Bearer ${agentToken}`,
        body: { room_id: '!r:hs.example', agent: 'pi0-web1', body: 'hi' },
      }),
    );
    expect(self.status).toBe(200);
    const other = await daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/messages',
        authorizationHeader: `Bearer ${agentToken}`,
        body: { room_id: '!r:hs.example', agent: 'pi9-web9', body: 'hi' },
      }),
    );
    expect(other.status).toBe(403);
    expect(other.body.error).toBe('token not scoped to this agent');
  });
  it('revoked agent token is rejected on messages', async () => {
    const { daemon } = makeAgentDaemon();
    const minted = await registerAgent(daemon, { alias: 'pi0', host: 'web1' });
    const agentToken = String(minted.body.bridge_token);
    const revoke = await daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/agents/revoke',
        authorizationHeader: 'Bearer bridge-secret',
        body: { agent_user_id: '@agent-pi0-web1:hs.example' },
      }),
    );
    expect(revoke.status).toBe(200);
    expect(revoke.body.revoked).toBe(1);
    const afterRevoke = await daemon.handle(
      request({
        method: 'POST',
        path: '/bridge/v1/messages',
        authorizationHeader: `Bearer ${agentToken}`,
        body: { room_id: '!r:hs.example', agent: 'pi0-web1', body: 'hi' },
      }),
    );
    expect(afterRevoke.status).toBe(403);
  });
  it('GET /bridge/v1/agents lists registered agents (host only)', async () => {
    const { daemon } = makeAgentDaemon();
    await registerAgent(daemon, { alias: 'pi0', host: 'web1', display_name: 'Pi Zero' });
    const res = await daemon.handle(
      request({
        method: 'GET',
        path: '/bridge/v1/agents',
        authorizationHeader: 'Bearer bridge-secret',
      }),
    );
    expect(res.status).toBe(200);
    const agents = res.body.agents as Array<Record<string, unknown>>;
    expect(agents).toHaveLength(1);
    expect(agents[0]?.agent_user_id).toBe('@agent-pi0-web1:hs.example');
    expect(agents[0]?.display_name).toBe('Pi Zero');
  });
  it('empty bridge token list denies everything', async () => {
    const daemon = new AppserviceDaemon({ ...cfg, bridgeTokens: [] }, undefined, () => {});
    const res = await daemon.handle(
--- a/apps/appservice/src/server.ts
+++ b/apps/appservice/src/server.ts
@@ -1,11 +1,14 @@
 import { createHmac, randomBytes, timingSafeEqual } from 'node:crypto';
 import {
  AgentTokenStore,
  AppserviceIntent,
  TransactionHandler,
  validateBridgeMessage,
  validateBridgeTyping,
  validateProvisionRoom,
  validateRegisterAgent,
  validateRevokeAgent,
 } from '@mosaicstack/appservice';
 import type { AppserviceConfig, MatrixEvent } from '@mosaicstack/appservice';
@@ -37,6 +40,13 @@ const safeEqual = (a: string, b: string): boolean => timingSafeEqual(digest(a),
 const TXN_PATH = /^\/_matrix\/app\/v1\/transactions\/([^/]+)$/;
 /**
 * Resolved identity for an authenticated /bridge/v1/* caller. Host principals
 * (the agent-comms host daemons) are unrestricted; agent principals are scoped
 * to a single virtual user and may only act as themselves.
 */
 export type BridgePrincipal = { kind: 'host' } | { kind: 'agent'; agentUserId: string } | null;
 /**
 * HTTP-framework-agnostic request router for the mosaic-as daemon: the
 * Application Service transactions endpoint (Synapse-facing) plus the
@@ -46,6 +56,7 @@ const TXN_PATH = /^\/_matrix\/app\/v1\/transactions\/([^/]+)$/;
 export class AppserviceDaemon {
  readonly intent: AppserviceIntent;
  private readonly transactions: TransactionHandler;
  private readonly agents: AgentTokenStore;
  constructor(
    private readonly cfg: DaemonConfig,
@@ -53,6 +64,7 @@ export class AppserviceDaemon {
    private readonly log: (line: string) => void = (line) => console.log(line),
  ) {
    this.intent = new AppserviceIntent(cfg, fetchImpl);
    this.agents = new AgentTokenStore(this.intent);
    this.transactions = new TransactionHandler({
      hsToken: cfg.hsToken,
      onEvent: (event) => this.onEvent(event),
@@ -69,10 +81,20 @@ export class AppserviceDaemon {
    }
  }
-  private bridgeAuthorized(authorizationHeader: string | undefined): boolean {
+  /** Resolve the calling principal, or null when unauthorized. Fail-closed:
-    if (!authorizationHeader?.startsWith('Bearer ')) return false;
+   * host tokens win (timing-safe compare); otherwise a magt_* bearer is looked
   * up in the agent token store; anything else is rejected. */
  private async bridgeAuthorized(
    authorizationHeader: string | undefined,
  ): Promise<BridgePrincipal> {
    if (!authorizationHeader?.startsWith('Bearer ')) return null;
    const presented = authorizationHeader.slice('Bearer '.length);
-    return this.cfg.bridgeTokens.some((token) => safeEqual(presented, token));
+    if (this.cfg.bridgeTokens.some((token) => safeEqual(presented, token))) {
      return { kind: 'host' };
    }
    const agentUserId = await this.agents.verifyToken(presented);
    if (agentUserId) return { kind: 'agent', agentUserId };
    return null;
  }
  async handle(req: DaemonRequest): Promise<DaemonResponse> {
@@ -89,12 +111,60 @@ export class AppserviceDaemon {
    }
    if (req.path.startsWith('/bridge/v1/')) {
-      if (!this.bridgeAuthorized(req.authorizationHeader)) {
+      const principal = await this.bridgeAuthorized(req.authorizationHeader);
      if (!principal) {
        return { status: 403, body: { errcode: 'M_FORBIDDEN', error: 'bad bridge token' } };
      }
      try {
        if (req.method === 'POST' && req.path === '/bridge/v1/agents') {
          if (principal.kind !== 'host') {
            return {
              status: 403,
              body: { errcode: 'M_FORBIDDEN', error: 'agents cannot register agents' },
            };
          }
          validateRegisterAgent(req.body);
          const { agentUserId, token } = await this.agents.register({
            alias: req.body.alias,
            host: req.body.host,
            displayName: req.body.display_name,
          });
          this.log(`registered agent ${agentUserId}`);
          return { status: 200, body: { agent_user_id: agentUserId, bridge_token: token } };
        }
        if (req.method === 'POST' && req.path === '/bridge/v1/agents/revoke') {
          if (principal.kind !== 'host') {
            return {
              status: 403,
              body: { errcode: 'M_FORBIDDEN', error: 'agents cannot revoke agents' },
            };
          }
          validateRevokeAgent(req.body);
          const revoked = await this.agents.revoke(req.body.agent_user_id);
          this.log(`revoked ${revoked} token(s) for ${req.body.agent_user_id}`);
          return { status: 200, body: { revoked } };
        }
        if (req.method === 'GET' && req.path === '/bridge/v1/agents') {
          if (principal.kind !== 'host') {
            return {
              status: 403,
              body: { errcode: 'M_FORBIDDEN', error: 'agents cannot list agents' },
            };
          }
          const agents = await this.agents.list();
          return { status: 200, body: { agents } };
        }
        if (req.method === 'POST' && req.path === '/bridge/v1/messages') {
          validateBridgeMessage(req.body);
          if (
            principal.kind === 'agent' &&
            this.intent.agentUserId(req.body.agent) !== principal.agentUserId
          ) {
            return {
              status: 403,
              body: { errcode: 'M_FORBIDDEN', error: 'token not scoped to this agent' },
            };
          }
          const eventId = await this.intent.sendAsAgent({
            roomId: req.body.room_id,
            agent: req.body.agent,
@@ -107,6 +177,15 @@ export class AppserviceDaemon {
        }
        if (req.method === 'POST' && req.path === '/bridge/v1/typing') {
          validateBridgeTyping(req.body);
          if (
            principal.kind === 'agent' &&
            this.intent.agentUserId(req.body.agent) !== principal.agentUserId
          ) {
            return {
              status: 403,
              body: { errcode: 'M_FORBIDDEN', error: 'token not scoped to this agent' },
            };
          }
          await this.intent.setTyping(req.body.room_id, req.body.agent, req.body.typing);
          return { status: 200, body: {} };
        }
--- a/docs/plans/agent-reflection-loop-PRD.md
+++ b/docs/plans/agent-reflection-loop-PRD.md
@@ -0,0 +1,173 @@
 # PRD — Agent Reflection Loop (durable kernel)
 **Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544)
 **Source design:** jarvis-brain `docs/planning/AGENT-REFLECTION-LOOP.md` (commit df6576fc, debate-hardened v2)
 **Status:** in-progress
 **Scope rule:** Build the **durable kernel** only. The closed calibration/skill-synthesis loop
 (design §7–§8) is **gated** behind Phase-0 experiments P1/P2/P3 and is explicitly out of scope here.
 ---
 ## 1. Problem
 At end-of-run an agent holds context that never reaches the diff or the "done" message —
 assumptions, shortcuts, untested paths, the single most-likely way the work is wrong. That context
 is what a lead/human needs to judge trust, and it evaporates when the session ends. Capture it
 mechanically as **structured data** (`reflection.v1`), and derive a **review risk-floor** from the
 change surface so risky diffs are flagged for independent review.
 ## 2. Non-goals (gated on Phase-0)
 - No closed calibration loop (predicted-vs-actual scoring as a routing input).
 - No skill synthesis.
 - No automated reviewer routing/dispatch. The kernel **writes** the sidecar; pickup is future work.
 ## 3. Components & exact placement (main-branch truth)
 | #   | Component            | Path                                                                                             | Mirror                              |
 | --- | -------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------- |
 | a   | Stop hook (capture)  | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh`                                        | `tools/qa/prevent-memory-write.sh`  |
 | a   | Hook registration    | `packages/mosaic/framework/runtime/claude/settings.json` (`hooks.Stop`)                          | existing `PreToolUse`/`PostToolUse` |
 | b   | JSON Schema          | `packages/macp/src/schemas/reflection.v1.schema.json`                                            | `schemas/task.schema.json`          |
 | b   | TS types (zod) + DTO | `packages/types/src/reflection/{index.ts,reflection.dto.ts}` + re-export from `src/index.ts`     | `packages/types/src/federation/*`   |
 | c   | Diff risk-floor      | `packages/macp/src/risk-floor.ts` (+ `__tests__/risk-floor.test.ts`, export from `src/index.ts`) | `packages/macp/src/gate-runner.ts`  |
 | d   | Phase-0 scripts      | `scripts/analysis/reflect-{git-history,board-history,calibration}.sh`                            | `scripts/publish-npmjs.sh`          |
 **Activation note (deliberate deviation):** the `settings-overlays/` directory has **no merge
 mechanism** (referenced only in docs), so a hooks overlay there would be inert. The Stop hook is
 registered in the canonical `runtime/claude/settings.json` — the same file the `mosaic` launcher
 reflects into `~/.claude/settings.json` (verified byte-identical hooks live there). Still fully
 vendored in-repo.
 ## 4. `reflection.v1` schema (authoritative field list)
 ```jsonc
 {
  "schema": "reflection.v1", // literal
  "task_ref": "string", // canonical task ref; kernel derives from REFLECTION_TASK_REF or repo+branch
  "agent": "string", // persona/runtime id (REFLECTION_AGENT or "unknown")
  "session_id": "string", // from Stop payload session_id, else "unknown"
  "timestamp": "string", // ISO-8601 UTC
  "repo": "string", // repo root basename
  "confidence": 0.0, // FLOAT [0,1] — SELF-REPORTED (optional; null if not supplied)
  "most_likely_wrong": {
    // SELF-REPORTED (optional)
    "surface": "auth|data|infra|ui|build|test|docs|none",
    "description": "string",
  },
  "known_not_in_diff": "string|null", // SELF-REPORTED: "what I know that isn't visible in the diff"
  "risk": {
    // MECHANICAL — from risk-floor
    "needs_review": true,
    "score": 0.0, // [0,1]
    "surface": "auth|data|infra|ui|build|test|docs|none",
    "reason": "string",
  },
  "files_changed": ["string"], // MECHANICAL — git diff name-only
  "provenance": {
    "source": "stop-hook",
    "reflection_attempt": 1,
    "degraded": false, // true if self-report inputs missing/unreadable
    "reflection_mode": "off|solo|orchestrated",
  },
 }
 ```
 **Mechanical vs self-reported.** A bash Stop hook cannot author the agent's self-assessment. The
 hook populates the **mechanical** fields deterministically (risk, files_changed, provenance, ids).
 The **self-reported** fields are read from an optional agent-supplied input file
 (`$REFLECTION_INPUT`, default `<repo>/.mosaic/reflection-input.json`) and merged if present;
 absent/unreadable → those fields null and `provenance.degraded=true`. This realizes the design's
 "hook is a pre-seed, not the asker" (§4).
 ## 5. Stop hook behavior (fail-closed, non-blocking)
 1. Read Stop payload JSON from stdin.
 2. **Fail-closed:** if `REFLECTION_MODE` is unset or `off` → `exit 0` immediately (strict no-op). This
   is the global-registration safety guarantee.
 3. **Sentinel guard:** if `<sidecar>.lock` exists → `exit 0` (prevents re-fire loops). Create it,
   `trap` cleanup.
 4. Determine output dir: `$REFLECTION_DIR` else `<repo>/.mosaic/reflections/`. `mkdir -p`.
 5. Compute mechanical fields: `git diff --name-only` (HEAD + staged + worktree, best-effort),
   call risk-floor logic (inline bash port OR `node -e` into `@mosaicstack/macp` — see §6), session
   ids from payload + env.
 6. Merge optional `$REFLECTION_INPUT` self-report if readable JSON.
 7. Write `reflection.v1` to a temp file, `mv` (atomic) to `<dir>/<session>-<ts>.reflection.json`.
 8. Always `exit 0`. **Never** emit a `decision` field (Stop hooks are observational).
 Hook must never fail the session: wrap risky steps, default to `degraded:true` on any error, exit 0.
 ## 6. Risk-floor (`packages/macp/src/risk-floor.ts`)
 Pure, deterministic, no IO. Single source of truth for the verdict; the hook calls it via
 `node --input-type=module -e` (importing the built package) **or**, to avoid a node dependency in the
 hook path, the hook ports the same surface table. **Decision:** implement the canonical logic in TS
 (tested), and have the hook shell out to node when available, else fall back to a minimal inline
 classifier flagged `degraded:true`. (Keep the TS the authority; the inline path is a safety net.)
 ```ts
 export type ReviewSurface = 'auth' | 'data' | 'infra' | 'ui' | 'build' | 'test' | 'docs' | 'none';
 export interface RiskFloorInput {
  filesChanged: string[];
  insertions?: number;
  deletions?: number;
 }
 export interface RiskFloorVerdict {
  needs_review: boolean;
  score: number;
  surface: ReviewSurface;
  reason: string;
 }
 export function evaluateRiskFloor(input: RiskFloorInput): RiskFloorVerdict;
 ```
 Surface classification by path regex (first match wins, highest-risk surface dominates):
 - `auth` (weight 1.0): `auth`, `login`, `session`, `token`, `permission`, `rbac`, `credential`, `secret`
 - `data` (0.9): `migration`, `prisma`, `schema`, `\.sql`, `entity`, `repository`, `seed`
 - `infra` (0.85): `docker`, `\.woodpecker`, `compose`, `traefik`, `deploy`, `helm`, `k8s`, `terraform`
 - `build` (0.6): `package.json`, `tsconfig`, `turbo.json`, `pnpm-`, `\.config\.`, `eslint`, `vite`
 - `ui` (0.4): `\.tsx`, `\.css`, `components/`, `apps/web/`
 - `test` (0.2): `\.spec\.`, `\.test\.`, `__tests__/`
 - `docs` (0.1): `\.md`, `docs/`
 - `none` (0.0): anything else
 `needs_review = score >= THRESHOLD` (default `0.5`, overridable). `reason` names the files+surface
 that tripped it. **Subordinate to CI:** this is a _floor_ (minimum review requirement) only;
 consumers MUST treat CI/tests as authoritative above the floor (precedence: CI/tests > human merge >
 reviewer verdict > self-reflection). Documented in the module header.
 ## 7. Phase-0 experiment scripts (`scripts/analysis/`)
 Offline, no-infra bash. Each script: `#!/usr/bin/env bash`, `set -euo pipefail`, header `Usage:` +
 `Requirements:`, flag parsing, **prints its pre-registered kill condition**, emits structured
 (JSON/markdown) output. They are harnesses + rubrics — real corpora are wired later.
 - `reflect-git-history.sh` (**P2** — only-self-reflection bucket): scan `git log` for failure signals
  (reverts, `fix:`/`hotfix` shortly after a feature merge) over a window; classify each by which gate
  would catch it (CI / human-review / only-self-reflection) via a pre-registered heuristic; tally.
  Kill: bucket-3 near-empty → no §7/§8.
 - `reflect-board-history.sh` (**P3** — outcome detectability): given a task/board export (or the
  git history of `data/` task files), measure the fraction of completed tasks with a
  machine-detectable correct/wrong signal within 30 days. Kill: base-rate < 20% → caveat-notes only.
 - `reflect-calibration.sh` (**P1** — confidence signal): consume a labeled corpus (JSONL of
  `{confidence, correct}`), compute discrimination (AUC/lift) on the self-rated-high subset, print
  the metric vs the pre-registered chance threshold. Kill: AUC ≈ chance on the high subset → no §7/§8.
 ## 8. CI / quality gates
 - TS packages: `pnpm typecheck` (tsc --noEmit), `pnpm lint` (eslint), `pnpm format:check`
  (prettier), `pnpm test` (vitest). ESM, NodeNext, `.js` import specifiers, `*.dto.ts` at boundaries.
 - New files in existing packages need no CI config change; add ≥1 vitest spec per new TS module.
 - Bash scripts/hook are dev/runtime tooling, not CI-built; keep them `shellcheck`-clean.
 ## 9. Acceptance criteria
 1. `REFLECTION_MODE` unset → hook is a strict no-op (`exit 0`, no file written). **(test)**
 2. With `REFLECTION_MODE=solo`, hook writes a schema-valid `reflection.v1` with correct mechanical
   fields; self-report merged when `$REFLECTION_INPUT` present, `degraded:true` when absent.
 3. `evaluateRiskFloor` deterministic across all surfaces; unit-tested incl. auth/data/infra → review,
   docs/test → no review, empty → `none`/no review.
 4. `reflection.v1` zod type + JSON Schema agree; sidecar validates against the schema.
 5. Phase-0 scripts run offline, print kill conditions, emit structured output, shellcheck-clean.
 6. `pnpm typecheck && pnpm lint && pnpm format:check && pnpm test` green; independent review passed.
--- a/docs/scratchpads/544-agent-reflection-loop.md
+++ b/docs/scratchpads/544-agent-reflection-loop.md
@@ -0,0 +1,55 @@
 # Scratchpad — #544 Agent Reflection Loop (durable kernel)
 **Started:** 2026-06-16 · **Branch:** `feat/agent-reflection-loop` · **Base:** `main` @ c461380
 ## Goal
 Bake the durable kernel of the agent reflection loop into the Mosaic Stack
 monorepo through full delivery gates. Kernel only; closed loop (§7–§8) gated on
 Phase-0. Authoritative spec: `docs/plans/agent-reflection-loop-PRD.md`. Task
 breakdown: `docs/tasks/544-agent-reflection-loop.md`.
 ## Timeline / decisions
 - Mapped house style against `main` truth (the earlier recon had mapped a dirty
  feature branch and returned non-existent paths; re-cloned `main` clean).
 - macp uses co-located `*.spec.ts`; types uses `src/<mod>/{*.ts, *.dto.ts, __tests__/*.spec.ts}`.
 - zod v4 + class-validator/class-transformer present in `@mosaicstack/types`;
  `packages/types/tsconfig.json` enables `experimentalDecorators`/`emitDecoratorMetadata`.
 - **Gotcha (fixed):** `class-transformer`'s `@Type` calls `Reflect.getMetadata`
  at module-load time; the types vitest env has no `reflect-metadata`, so any test
  importing the reflection barrel crashed on import. `chat.dto.ts` avoids this by
  using class-validator only. Fix: dropped `@Type`/`@ValidateNested` from the DTO;
  zod owns deep nested validation.
 - **Gotcha (fixed):** Stop hook `EXIT` trap referenced a `main`-local `lock` →
  `unbound variable` under `set -u` at exit. Promoted to a global `LOCKFILE`.
 - **Gotcha (fixed):** the hook's own lock + `.mosaic/` scratch leaked into
  `files_changed`. Excluded `^\.mosaic/` from the change-surface scan.
 ## Verification evidence
 - macp: typecheck OK, lint OK, **88 tests pass** (15 new risk-floor).
 - types: typecheck OK, lint OK, **64 tests pass** (10 new reflection).
 - Root: `pnpm typecheck` (41 tasks), `pnpm lint` (23), `pnpm format:check`, `pnpm build` (23) — all green.
 - Stop hook smoke (throwaway git repo): TEST1 no-op (mode unset, 0 files);
  TEST2 solo degraded, `.mosaic/` excluded, auth→needs_review; TEST3 self-report
  merged, degraded=false; TEST4 lock suppresses re-fire. All pass, always exit 0.
 - shellcheck clean: hook + `reflect-{git-history,board-history,calibration}.sh`.
 - Phase-0 smoke: P2 on this repo (142 failures classified), P1 AUC=0.875 on a
  synthetic fixture, P3 base-rate on a synthetic board — all emit structured output
  - kill conditions.
 ## Open risks / follow-ups
 - Full `pnpm test` (DB-bound packages) validated via CI's postgres service, not
  locally; affected packages (macp, types) are DB-independent and green here.
 - sequential-thinking MCP was registered mid-session (effective next session);
  this session compensated with the written PRD as the planning artifact.
 - Phase-0 corpora are not yet wired — scripts are harnesses + pre-registered
  rubrics (P1/P2/P3 tasks tracked in jarvis-brain `agent-reflection-loop` project).
 ## Gate status
 - [x] PRD authored · [x] issue #544 created + linked · [x] code + tests
 - [x] local gates green · [ ] independent code review · [ ] PR opened
 - [ ] CI terminal green · [ ] merged to main · [ ] issue closed
--- a/docs/tasks/544-agent-reflection-loop.md
+++ b/docs/tasks/544-agent-reflection-loop.md
@@ -0,0 +1,67 @@
 # 544: Agent Reflection Loop — durable kernel
 **Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544)
 **PRD:** [`docs/plans/agent-reflection-loop-PRD.md`](../plans/agent-reflection-loop-PRD.md)
 **Branch:** `feat/agent-reflection-loop`
 ## Context
 Build the **durable kernel** of the agent reflection loop: passive end-of-run
 capture of the doer's end-state as structured `reflection.v1` data, plus a
 deterministic diff **review risk-floor**. The closed calibration / skill-synthesis
 loop (design §7–§8) stays **gated** behind Phase-0 experiments P1/P2/P3 and is
 explicitly out of scope here. Source design: jarvis-brain
 `docs/planning/AGENT-REFLECTION-LOOP.md` (debate-hardened v2).
 Scope rule, non-goals, the full `reflection.v1` field list, and acceptance
 criteria live in the PRD. This file is the task breakdown + status.
 ## Work items
 | #   | Item                                                  | Path                                                      | Status |
 | --- | ----------------------------------------------------- | --------------------------------------------------------- | ------ |
 | 1   | Diff risk-floor (pure, deterministic) + unit tests    | `packages/macp/src/risk-floor.ts`, `risk-floor.spec.ts`   | done   |
 | 2   | `reflection.v1` JSON Schema (documented contract)     | `packages/macp/src/schemas/reflection.v1.schema.json`     | done   |
 | 3   | `reflection.v1` zod schemas + self-report DTO + tests | `packages/types/src/reflection/*`                         | done   |
 | 4   | Stop hook (fail-closed capture)                       | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh` | done   |
 | 5   | Hook registration (`hooks.Stop`)                      | `packages/mosaic/framework/runtime/claude/settings.json`  | done   |
 | 6   | Phase-0 experiment harnesses (P1/P2/P3)               | `scripts/analysis/reflect-*.sh`                           | done   |
 ## Design decisions (this implementation)
 - **Mechanical vs self-reported split.** A bash Stop hook cannot author the
  agent's self-assessment, so it writes the mechanical fields (risk-floor verdict,
  `files_changed`, ids, provenance) and merges an optional agent-supplied
  `$REFLECTION_INPUT` self-report; absent/unreadable ⇒ those fields `null` and
  `provenance.degraded = true`.
 - **Risk-floor authority.** `evaluateRiskFloor` (TS, tested) is the source of
  truth. The hook ports the same surface table inline to avoid a node/build
  dependency on the hook path; the two are documented as kept in sync.
 - **Hook registration deviation.** `settings-overlays/` has no merge mechanism
  (docs-only), so a hooks overlay there would be inert. The Stop hook is
  registered in the canonical `runtime/claude/settings.json` — the same file the
  `mosaic` launcher reflects into `~/.claude/settings.json`. Still vendored in-repo.
 - **DTO without class-transformer.** `reflection.dto.ts` uses class-validator only
  (no `@Type`), matching `chat.dto.ts`, so the module imports without a
  `reflect-metadata` shim in the types-package test env. Deep nested validation is
  owned by the zod `ReflectionSelfReportSchema` (the runtime authority the hook uses).
 - **`.mosaic/` excluded** from the change surface — it is agent scratch
  (reflections, locks, self-report input), not part of the diff under review.
 ## Verification
 - `pnpm --filter @mosaicstack/macp test` → 88 passed (15 new risk-floor).
 - `pnpm --filter @mosaicstack/types test` → 64 passed (10 new reflection).
 - Root `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, `pnpm build` → green.
 - Stop hook smoke: fail-closed no-op (mode unset), solo capture (degraded),
  self-report merge (degraded=false), re-fire lock guard — all pass.
 - All bash (hook + 3 Phase-0 scripts) shellcheck-clean; Phase-0 scripts emit
  structured JSON/markdown and print their pre-registered kill conditions.
 ## Activation (post-merge, deployment concern — not a blocker)
 The Stop hook only activates when a launcher/profile sets
 `REFLECTION_MODE=solo|orchestrated`; unset/`off` is a strict no-op, so global
 registration is safe. `framework/install.sh` rsyncs the hook into
 `~/.config/mosaic/tools/qa/`, and the `mosaic` launcher reflects the updated
 `settings.json` (`hooks.Stop`) into `~/.claude/settings.json`.
--- a/packages/appservice/src/tests/agent-store.test.ts
+++ b/packages/appservice/src/tests/agent-store.test.ts
@@ -0,0 +1,116 @@
 import { describe, expect, it } from 'vitest';
 import { AGENTS_ACCOUNT_DATA_TYPE, AgentTokenStore } from '../agent-store.js';
 import type { AppserviceIntent } from '../intent.js';
 /** Fake intent: in-memory account_data, no-op user provisioning. Only the
 * surface AgentTokenStore touches is implemented. */
 const makeFakeIntent = () => {
  const store: Record<string, Record<string, unknown>> = {};
  const fake = {
    domain: 'hs.example',
    getSenderAccountData: async (type: string): Promise<Record<string, unknown> | null> =>
      store[type] ?? null,
    setSenderAccountData: async (type: string, content: Record<string, unknown>): Promise<void> => {
      store[type] = structuredClone(content);
    },
    ensureRegistered: async (agent: string): Promise<string> => `@agent-${agent}:hs.example`,
    setDisplayName: async (): Promise<void> => {},
  };
  return { intent: fake as unknown as AppserviceIntent, store };
 };
 describe('AgentTokenStore', () => {
  it('mints a magt_ token and stores only its sha256 (never plaintext)', async () => {
    const { intent, store } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    const { agentUserId, token } = await s.register({ alias: 'pi0', host: 'web1' });
    expect(agentUserId).toBe('@agent-pi0-web1:hs.example');
    expect(token.startsWith('magt_')).toBe(true);
    const raw = JSON.stringify(store[AGENTS_ACCOUNT_DATA_TYPE]);
    expect(raw).not.toContain(token);
    // The stored hash is sha256hex(token), 64 hex chars.
    const { createHash } = await import('node:crypto');
    const hash = createHash('sha256').update(token).digest('hex');
    expect(raw).toContain(hash);
  });
  it('verifyToken returns the agentUserId for a fresh token, null otherwise', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    const { agentUserId, token } = await s.register({ alias: 'pi0', host: 'web1' });
    expect(await s.verifyToken(token)).toBe(agentUserId);
    expect(await s.verifyToken('magt_garbage')).toBeNull();
    expect(await s.verifyToken('not-a-token')).toBeNull();
    expect(await s.verifyToken('')).toBeNull();
  });
  it('revoke invalidates tokens, returns count, and hides agent from list', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    const { agentUserId, token } = await s.register({ alias: 'pi0', host: 'web1' });
    expect((await s.list()).map((a) => a.agent_user_id)).toContain(agentUserId);
    const count = await s.revoke(agentUserId);
    expect(count).toBe(1);
    expect(await s.verifyToken(token)).toBeNull();
    expect((await s.list()).map((a) => a.agent_user_id)).not.toContain(agentUserId);
    // Idempotent on unknown / already-revoked.
    expect(await s.revoke(agentUserId)).toBe(0);
    expect(await s.revoke('@agent-nope:hs.example')).toBe(0);
  });
  it('re-register after revoke yields a working token and the agent reappears', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    const { agentUserId, token: t1 } = await s.register({ alias: 'pi0', host: 'web1' });
    await s.revoke(agentUserId);
    const { token: t2 } = await s.register({ alias: 'pi0', host: 'web1' });
    expect(await s.verifyToken(t1)).toBeNull();
    expect(await s.verifyToken(t2)).toBe(agentUserId);
    expect((await s.list()).map((a) => a.agent_user_id)).toContain(agentUserId);
  });
  it('agent A token never verifies as agent B', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    const a = await s.register({ alias: 'pi0', host: 'web1' });
    const b = await s.register({ alias: 'pi1', host: 'web2' });
    expect(await s.verifyToken(a.token)).toBe(a.agentUserId);
    expect(await s.verifyToken(b.token)).toBe(b.agentUserId);
    expect(a.agentUserId).not.toBe(b.agentUserId);
  });
  it('rejects an ambiguous re-registration that collides on one Matrix id', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    // alias="a-b",host="c" and alias="a",host="b-c" both -> @agent-a-b-c.
    const first = await s.register({ alias: 'a-b', host: 'c' });
    expect(first.agentUserId).toBe('@agent-a-b-c:hs.example');
    await expect(s.register({ alias: 'a', host: 'b-c' })).rejects.toThrow(/collision/);
    // The original registration is untouched: still one active token, correct pair.
    expect(await s.verifyToken(first.token)).toBe(first.agentUserId);
    const summary = (await s.list()).find((x) => x.agent_user_id === first.agentUserId);
    expect(summary?.alias).toBe('a-b');
    expect(summary?.host).toBe('c');
    expect(summary?.active_token_count).toBe(1);
  });
  it('display_name is stored and surfaced in list', async () => {
    const { intent } = makeFakeIntent();
    const s = new AgentTokenStore(intent);
    await s.register({ alias: 'pi0', host: 'web1', displayName: 'Pi Zero' });
    const summary = (await s.list())[0];
    expect(summary?.display_name).toBe('Pi Zero');
    expect(summary?.active_token_count).toBe(1);
  });
 });
--- a/packages/appservice/src/agent-registry.dto.ts
+++ b/packages/appservice/src/agent-registry.dto.ts
@@ -0,0 +1,63 @@
 /** DTOs for agent registration + scoped/revocable bridge tokens (US-007). */
 export interface RegisterAgentDto {
  /** Agent alias slug, e.g. "pi0". Combined with host into the agent slug. */
  alias: string;
  /** Host slug, e.g. "web1". Combined with alias into the agent slug. */
  host: string;
  display_name?: string;
 }
 export interface RevokeAgentDto {
  agent_user_id: string;
 }
 export interface RegisterAgentResponse {
  agent_user_id: string;
  bridge_token: string;
 }
 export interface AgentSummary {
  agent_user_id: string;
  alias: string;
  host: string;
  display_name?: string;
  created_at: string;
  active_token_count: number;
 }
 const SLUG_RE = /^[a-z0-9][a-z0-9_.-]*$/;
 /** Combined agent slug, e.g. alias="pi0", host="web1" -> "pi0-web1". */
 export function agentSlug(alias: string, host: string): string {
  return `${alias}-${host}`;
 }
 const assertSlug = (value: unknown, field: string): void => {
  if (typeof value !== 'string' || value.length === 0 || !SLUG_RE.test(value)) {
    throw new Error(`${field} must match [a-z0-9][a-z0-9_.-]* (lowercase, non-empty)`);
  }
 };
 export function validateRegisterAgent(input: unknown): asserts input is RegisterAgentDto {
  const o = input as Partial<RegisterAgentDto> | null | undefined;
  if (!o || typeof o !== 'object') throw new Error('payload must be an object');
  assertSlug(o.alias, 'alias');
  assertSlug(o.host, 'host');
  if (o.display_name !== undefined) {
    if (typeof o.display_name !== 'string' || o.display_name.length === 0) {
      throw new Error('display_name must be a non-empty string');
    }
    if (o.display_name.length > 100) {
      throw new Error('display_name must be at most 100 chars');
    }
  }
 }
 export function validateRevokeAgent(input: unknown): asserts input is RevokeAgentDto {
  const o = input as Partial<RevokeAgentDto> | null | undefined;
  if (!o || typeof o !== 'object') throw new Error('payload must be an object');
  if (typeof o.agent_user_id !== 'string' || !o.agent_user_id.startsWith('@')) {
    throw new Error('agent_user_id must be a Matrix user id');
  }
 }
--- a/packages/appservice/src/agent-store.ts
+++ b/packages/appservice/src/agent-store.ts
@@ -0,0 +1,160 @@
 import { createHash, randomBytes, timingSafeEqual } from 'node:crypto';
 import { agentSlug } from './agent-registry.dto.js';
 import type { AgentSummary } from './agent-registry.dto.js';
 import type { AppserviceIntent } from './intent.js';
 /** account_data type holding the agent registry on the AS sender user. */
 export const AGENTS_ACCOUNT_DATA_TYPE = 'org.uscllc.mosaic_as.agents';
 const TOKEN_PREFIX = 'magt_';
 interface StoredAgent {
  alias: string;
  host: string;
  display_name?: string;
  created_at: string;
  /** sha256hex of each active token. Plaintext tokens are NEVER stored. */
  token_hashes: string[];
  revoked_at?: string;
 }
 interface AgentRegistry {
  agents: Record<string, StoredAgent>;
 }
 const sha256hex = (value: string): string => createHash('sha256').update(value).digest('hex');
 const mintToken = (): string => `${TOKEN_PREFIX}${randomBytes(32).toString('base64url')}`;
 /**
 * Persists scoped/revocable bridge tokens for agent virtual users in Matrix
 * account_data on the AS sender user (no new infra; survives restart).
 *
 * Tokens are stored only as sha256 hashes (the high-entropy `magt_` token makes
 * plain sha256 safe — no salt/KDF needed since brute force is infeasible).
 *
 * KNOWN v1 LIMIT: Synapse caps a single account_data object (default
 * max_account_data_size, ~100KB). Each agent + hash entry is small, so this
 * supports thousands of agents, but a very large fleet would eventually need a
 * dedicated store. Revoked agents with no active tokens are pruned of hashes
 * (kept as tombstones) to bound growth.
 */
 export class AgentTokenStore {
  constructor(private readonly intent: AppserviceIntent) {}
  /** Read the registry fresh from account_data (low-frequency ops favor
   * correctness over caching; verifyToken/list also read fresh). */
  private async read(): Promise<AgentRegistry> {
    const data = await this.intent.getSenderAccountData(AGENTS_ACCOUNT_DATA_TYPE);
    const agents = data?.agents;
    if (agents && typeof agents === 'object') {
      return { agents: agents as Record<string, StoredAgent> };
    }
    return { agents: {} };
  }
  private async write(registry: AgentRegistry): Promise<void> {
    await this.intent.setSenderAccountData(AGENTS_ACCOUNT_DATA_TYPE, {
      agents: registry.agents,
    });
  }
  /** Ensure the virtual user exists, mint a fresh token, store its hash, and
   * return the plaintext token ONCE. Clears any prior revocation. */
  async register(opts: {
    alias: string;
    host: string;
    displayName?: string;
  }): Promise<{ agentUserId: string; token: string }> {
    const slug = agentSlug(opts.alias, opts.host);
    const agentUserId = await this.intent.ensureRegistered(slug);
    if (opts.displayName !== undefined) {
      await this.intent.setDisplayName(slug, opts.displayName);
    }
    const token = mintToken();
    const hash = sha256hex(token);
    const registry = await this.read();
    const existing = registry.agents[agentUserId];
    if (existing) {
      // The agent slug `<alias>-<host>` joins with a `-`, which is also a legal
      // slug char, so distinct pairs can collide on one Matrix id (e.g.
      // a/b-c and a-b/c both -> @agent-a-b-c). They ARE the same Matrix user,
      // but silently overwriting the stored alias/host of a different pair
      // would conflate two logical agents into one token bucket. Reject the
      // ambiguous re-registration instead of overwriting.
      if (existing.alias !== opts.alias || existing.host !== opts.host) {
        throw new Error(
          `agent id collision: ${agentUserId} already registered as ` +
            `${existing.alias}/${existing.host}, refusing ${opts.alias}/${opts.host}`,
        );
      }
      if (opts.displayName !== undefined) existing.display_name = opts.displayName;
      existing.token_hashes = [...existing.token_hashes, hash];
      delete existing.revoked_at;
    } else {
      registry.agents[agentUserId] = {
        alias: opts.alias,
        host: opts.host,
        ...(opts.displayName !== undefined ? { display_name: opts.displayName } : {}),
        created_at: new Date().toISOString(),
        token_hashes: [hash],
      };
    }
    await this.write(registry);
    return { agentUserId, token };
  }
  /** Return the agentUserId bound to an active (non-revoked) token, else null.
   * Constant-time hash comparison; no early-out on match. */
  async verifyToken(token: string): Promise<string | null> {
    if (!token.startsWith(TOKEN_PREFIX)) return null;
    const presented = Buffer.from(sha256hex(token), 'hex');
    const registry = await this.read();
    let matched: string | null = null;
    for (const [agentUserId, agent] of Object.entries(registry.agents)) {
      if (agent.revoked_at) continue;
      for (const stored of agent.token_hashes) {
        const candidate = Buffer.from(stored, 'hex');
        if (candidate.length === presented.length && timingSafeEqual(candidate, presented)) {
          // No early break: keep scanning so timing does not reveal match position.
          matched = agentUserId;
        }
      }
    }
    return matched;
  }
  /** Revoke all active tokens for an agent. Idempotent; returns count revoked. */
  async revoke(agentUserId: string): Promise<number> {
    const registry = await this.read();
    const agent = registry.agents[agentUserId];
    if (!agent) return 0;
    const count = agent.token_hashes.length;
    agent.token_hashes = [];
    agent.revoked_at = new Date().toISOString();
    await this.write(registry);
    return count;
  }
  /** List agents with at least one active token (never advertise revoked/phantom). */
  async list(): Promise<AgentSummary[]> {
    const registry = await this.read();
    const out: AgentSummary[] = [];
    for (const [agentUserId, agent] of Object.entries(registry.agents)) {
      if (agent.revoked_at || agent.token_hashes.length === 0) continue;
      out.push({
        agent_user_id: agentUserId,
        alias: agent.alias,
        host: agent.host,
        ...(agent.display_name !== undefined ? { display_name: agent.display_name } : {}),
        created_at: agent.created_at,
        active_token_count: agent.token_hashes.length,
      });
    }
    return out;
  }
 }
--- a/packages/appservice/src/index.ts
+++ b/packages/appservice/src/index.ts
@@ -10,6 +10,14 @@ export {
  validateProvisionRoom,
 } from './bridge.dto.js';
 export type { BridgeMessageDto, BridgeTypingDto, ProvisionRoomDto } from './bridge.dto.js';
 export { agentSlug, validateRegisterAgent, validateRevokeAgent } from './agent-registry.dto.js';
 export type {
  RegisterAgentDto,
  RevokeAgentDto,
  RegisterAgentResponse,
  AgentSummary,
 } from './agent-registry.dto.js';
 export { AgentTokenStore, AGENTS_ACCOUNT_DATA_TYPE } from './agent-store.js';
 export type {
  AppserviceConfig,
  EventHandler,
--- a/packages/appservice/src/intent.ts
+++ b/packages/appservice/src/intent.ts
@@ -233,4 +233,30 @@ export class AppserviceIntent {
      body: { displayname: displayName },
    });
  }
  /** Read an account_data object on the AS sender user. Returns null when the
   * key has never been written (M_NOT_FOUND), so callers can treat that as an
   * empty store; any other error propagates. */
  async getSenderAccountData(type: string): Promise<Record<string, unknown> | null> {
    const user = encodeURIComponent(this.senderUserId);
    const key = encodeURIComponent(type);
    try {
      return await this.request('GET', `/_matrix/client/v3/user/${user}/account_data/${key}`, {
        userId: this.senderUserId,
      });
    } catch (err) {
      if (err instanceof MatrixApiError && err.errcode === 'M_NOT_FOUND') return null;
      throw err;
    }
  }
  /** Write an account_data object on the AS sender user. */
  async setSenderAccountData(type: string, content: Record<string, unknown>): Promise<void> {
    const user = encodeURIComponent(this.senderUserId);
    const key = encodeURIComponent(type);
    await this.request('PUT', `/_matrix/client/v3/user/${user}/account_data/${key}`, {
      userId: this.senderUserId,
      body: content,
    });
  }
 }
--- a/packages/macp/src/index.ts
+++ b/packages/macp/src/index.ts
@@ -39,6 +39,11 @@ export { normalizeGate, runShell, countAIFindings, runGate, runGates } from './g
 export type { NormalizedGate } from './gate-runner.js';
 // Risk-floor (agent reflection loop — diff review classifier)
 export { evaluateRiskFloor, DEFAULT_RISK_THRESHOLD } from './risk-floor.js';
 export type { ReviewSurface, RiskFloorInput, RiskFloorVerdict } from './risk-floor.js';
 // Event emitter
 export { nowISO, appendEvent, emitEvent } from './event-emitter.js';
--- a/packages/macp/src/risk-floor.spec.ts
+++ b/packages/macp/src/risk-floor.spec.ts
@@ -0,0 +1,87 @@
 import { describe, expect, it } from 'vitest';
 import { DEFAULT_RISK_THRESHOLD, evaluateRiskFloor, type ReviewSurface } from './risk-floor.js';
 describe('evaluateRiskFloor', () => {
  it('returns a no-review "none" verdict for an empty diff', () => {
    const v = evaluateRiskFloor({ filesChanged: [] });
    expect(v).toEqual({
      needs_review: false,
      score: 0,
      surface: 'none',
      reason: 'no files changed',
    });
  });
  it('ignores empty/non-string entries', () => {
    const v = evaluateRiskFloor({ filesChanged: ['', '   ' as unknown as string].filter(Boolean) });
    // only the whitespace string survives the Boolean filter; it classifies to none
    expect(v.surface).toBe('none');
    expect(v.needs_review).toBe(false);
  });
  it.each<[string, string, ReviewSurface, boolean]>([
    ['auth', 'apps/api/src/auth/session.guard.ts', 'auth', true],
    ['data', 'packages/db/migrations/0007_add_users.sql', 'data', true],
    ['infra', '.woodpecker/deploy.yml', 'infra', true],
    ['build', 'packages/types/tsconfig.json', 'build', true],
    ['ui', 'apps/web/src/components/Button.tsx', 'ui', false],
    ['test', 'packages/macp/src/risk-floor.spec.ts', 'test', false],
    ['docs', 'docs/plans/agent-reflection-loop-PRD.md', 'docs', false],
    ['none', 'README', 'none', false],
  ])(
    'classifies a single %s file → surface=%s needs_review=%s',
    (_label, file, surface, needsReview) => {
      const v = evaluateRiskFloor({ filesChanged: [file] });
      expect(v.surface).toBe(surface);
      expect(v.needs_review).toBe(needsReview);
      expect(v.reason).toContain(
        file === 'README' ? 'no sensitive surface' : surface === 'none' ? '' : surface,
      );
    },
  );
  it('lets the highest-risk surface dominate a mixed diff', () => {
    const v = evaluateRiskFloor({
      filesChanged: [
        'docs/readme.md',
        'apps/web/src/components/Nav.tsx',
        'apps/api/src/auth/token.service.ts',
      ],
    });
    expect(v.surface).toBe('auth');
    expect(v.score).toBe(1.0);
    expect(v.needs_review).toBe(true);
    expect(v.reason).toContain('token.service.ts');
    expect(v.reason).not.toContain('readme.md');
  });
  it('names every file that ties at the dominant surface', () => {
    const v = evaluateRiskFloor({
      filesChanged: ['src/login.ts', 'src/permission-check.ts'],
    });
    expect(v.surface).toBe('auth');
    expect(v.reason).toContain('src/login.ts');
    expect(v.reason).toContain('src/permission-check.ts');
  });
  it('treats docs+test-only diffs as below the floor', () => {
    const v = evaluateRiskFloor({
      filesChanged: ['docs/guide.md', 'packages/x/src/x.test.ts'],
    });
    expect(v.needs_review).toBe(false);
    expect(v.surface).toBe('test'); // higher weight than docs
  });
  it('honors a custom threshold', () => {
    const docsOnly = { filesChanged: ['docs/guide.md'] };
    expect(evaluateRiskFloor(docsOnly, 0.05).needs_review).toBe(true);
    expect(evaluateRiskFloor(docsOnly, DEFAULT_RISK_THRESHOLD).needs_review).toBe(false);
  });
  it('is deterministic across call order', () => {
    const a = evaluateRiskFloor({ filesChanged: ['a.md', 'auth/x.ts', 'b.tsx'] });
    const b = evaluateRiskFloor({ filesChanged: ['b.tsx', 'a.md', 'auth/x.ts'] });
    expect(a).toEqual(b);
  });
 });
--- a/packages/macp/src/risk-floor.ts
+++ b/packages/macp/src/risk-floor.ts
@@ -0,0 +1,138 @@
 /**
 * Diff risk-floor — deterministic review-need classifier.
 *
 * Given the set of changed files in a diff, derive a *minimum* review
 * requirement ("floor") from the change surface. This is the mechanical half
 * of the agent reflection loop (design §6): risky surfaces (auth, data, infra)
 * trip a review requirement regardless of what the agent self-reports.
 *
 * Precedence (authoritative ordering, see design §5):
 *   CI/tests  >  human merge  >  reviewer verdict  >  self-reflection
 * This module sits at the *floor*. It NEVER overrides CI or a human; a
 * `needs_review: false` verdict means "no surface tripped the floor", not
 * "safe to merge". Consumers MUST keep CI/tests authoritative above it.
 *
 * Pure and deterministic: no IO, no clock, no randomness. Same input → same
 * verdict. Safe to call from a Stop hook via `node -e` or to port inline.
 */
 /** Review surfaces, ordered most- to least-sensitive. */
 export type ReviewSurface = 'auth' | 'data' | 'infra' | 'build' | 'ui' | 'test' | 'docs' | 'none';
 export interface RiskFloorInput {
  /** Paths of changed files, repo-relative. Order-insensitive. */
  filesChanged: string[];
  /** Optional diff size signals; reserved for future weighting. */
  insertions?: number;
  deletions?: number;
 }
 export interface RiskFloorVerdict {
  /** True when the change surface meets/exceeds the review threshold. */
  needs_review: boolean;
  /** Aggregate risk score in [0, 1] — the max surface weight across files. */
  score: number;
  /** The dominant (highest-weight) surface across all changed files. */
  surface: ReviewSurface;
  /** Human-readable explanation naming the surface and tripping files. */
  reason: string;
 }
 /** Default review threshold; `score >= THRESHOLD` ⇒ `needs_review`. */
 export const DEFAULT_RISK_THRESHOLD = 0.5;
 interface SurfaceRule {
  surface: ReviewSurface;
  weight: number;
  /** Case-insensitive regex matched against the file path. */
  pattern: RegExp;
 }
 /**
 * Surface classification rules, evaluated highest-weight first. The first
 * rule whose pattern matches a path classifies that file; the file's surface
 * is the highest-risk surface it matches (rules are pre-sorted by weight).
 */
 const SURFACE_RULES: readonly SurfaceRule[] = [
  {
    surface: 'auth',
    weight: 1.0,
    pattern: /auth|login|session|token|permission|rbac|credential|secret/i,
  },
  {
    surface: 'data',
    weight: 0.9,
    pattern: /migration|prisma|schema|\.sql|entity|repository|seed/i,
  },
  {
    surface: 'infra',
    weight: 0.85,
    pattern: /docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform/i,
  },
  {
    surface: 'build',
    weight: 0.6,
    pattern: /package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite/i,
  },
  { surface: 'ui', weight: 0.4, pattern: /\.tsx|\.css|components\/|apps\/web\// },
  { surface: 'test', weight: 0.2, pattern: /\.spec\.|\.test\.|__tests__\// },
  { surface: 'docs', weight: 0.1, pattern: /\.md$|docs\// },
 ];
 const NONE_WEIGHT = 0.0;
 /** Classify a single path to its highest-risk surface and weight. */
 function classify(path: string): { surface: ReviewSurface; weight: number } {
  for (const rule of SURFACE_RULES) {
    if (rule.pattern.test(path)) {
      return { surface: rule.surface, weight: rule.weight };
    }
  }
  return { surface: 'none', weight: NONE_WEIGHT };
 }
 /**
 * Evaluate the review risk-floor for a diff.
 *
 * @param input         changed files (+ optional size signals)
 * @param threshold     review cutoff; defaults to {@link DEFAULT_RISK_THRESHOLD}
 */
 export function evaluateRiskFloor(
  input: RiskFloorInput,
  threshold: number = DEFAULT_RISK_THRESHOLD,
 ): RiskFloorVerdict {
  const files = (input.filesChanged ?? []).filter((f) => typeof f === 'string' && f.length > 0);
  if (files.length === 0) {
    return {
      needs_review: false,
      score: 0,
      surface: 'none',
      reason: 'no files changed',
    };
  }
  let topSurface: ReviewSurface = 'none';
  let topWeight = NONE_WEIGHT;
  const tripping: string[] = [];
  for (const file of files) {
    const { surface, weight } = classify(file);
    if (weight > topWeight) {
      topWeight = weight;
      topSurface = surface;
      tripping.length = 0;
      tripping.push(file);
    } else if (weight === topWeight && surface === topSurface && surface !== 'none') {
      tripping.push(file);
    }
  }
  const needs_review = topWeight >= threshold;
  const reason =
    topSurface === 'none'
      ? `no sensitive surface in ${files.length} changed file(s)`
      : `${topSurface} surface (weight ${topWeight}) in: ${tripping.join(', ')}`;
  return { needs_review, score: topWeight, surface: topSurface, reason };
 }
--- a/packages/macp/src/schemas/reflection.v1.schema.json
+++ b/packages/macp/src/schemas/reflection.v1.schema.json
@@ -0,0 +1,105 @@
 {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://mosaicstack.dev/schemas/reflection/reflection.v1.schema.json",
  "title": "Agent Reflection (v1)",
  "description": "End-of-run reflection sidecar. Mechanical fields are written by the Stop hook; self-reported fields are merged from an optional agent-supplied input and are null when absent (provenance.degraded=true).",
  "type": "object",
  "required": [
    "schema",
    "task_ref",
    "agent",
    "session_id",
    "timestamp",
    "repo",
    "risk",
    "files_changed",
    "provenance"
  ],
  "properties": {
    "schema": {
      "const": "reflection.v1"
    },
    "task_ref": {
      "type": "string",
      "description": "Canonical task ref; derived from REFLECTION_TASK_REF or repo+branch."
    },
    "agent": {
      "type": "string",
      "description": "Persona/runtime id (REFLECTION_AGENT or 'unknown')."
    },
    "session_id": {
      "type": "string",
      "description": "From the Stop payload session_id, else 'unknown'."
    },
    "timestamp": {
      "type": "string",
      "format": "date-time",
      "description": "ISO-8601 UTC capture time."
    },
    "repo": {
      "type": "string",
      "description": "Repo root basename."
    },
    "confidence": {
      "type": ["number", "null"],
      "minimum": 0,
      "maximum": 1,
      "description": "SELF-REPORTED. Agent's overall confidence; null when not supplied."
    },
    "most_likely_wrong": {
      "type": ["object", "null"],
      "description": "SELF-REPORTED. The single most-likely way the work is wrong.",
      "required": ["surface", "description"],
      "properties": {
        "surface": { "$ref": "#/$defs/surface" },
        "description": { "type": "string" }
      },
      "additionalProperties": false
    },
    "known_not_in_diff": {
      "type": ["string", "null"],
      "description": "SELF-REPORTED. What the agent knows that isn't visible in the diff."
    },
    "risk": {
      "type": "object",
      "description": "MECHANICAL. Output of the diff risk-floor.",
      "required": ["needs_review", "score", "surface", "reason"],
      "properties": {
        "needs_review": { "type": "boolean" },
        "score": { "type": "number", "minimum": 0, "maximum": 1 },
        "surface": { "$ref": "#/$defs/surface" },
        "reason": { "type": "string" }
      },
      "additionalProperties": false
    },
    "files_changed": {
      "type": "array",
      "items": { "type": "string" },
      "description": "MECHANICAL. git diff name-only."
    },
    "provenance": {
      "type": "object",
      "required": ["source", "reflection_attempt", "degraded", "reflection_mode"],
      "properties": {
        "source": { "const": "stop-hook" },
        "reflection_attempt": { "type": "integer", "minimum": 1 },
        "degraded": {
          "type": "boolean",
          "description": "True when self-report inputs were missing/unreadable."
        },
        "reflection_mode": {
          "type": "string",
          "enum": ["off", "solo", "orchestrated"]
        }
      },
      "additionalProperties": false
    }
  },
  "additionalProperties": false,
  "$defs": {
    "surface": {
      "type": "string",
      "enum": ["auth", "data", "infra", "build", "ui", "test", "docs", "none"]
    }
  }
 }
--- a/packages/mosaic/framework/runtime/claude/settings.json
+++ b/packages/mosaic/framework/runtime/claude/settings.json
@@ -34,6 +34,17 @@
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "~/.config/mosaic/tools/qa/reflect-stop-hook.sh",
            "timeout": 15
          }
        ]
      }
    ]
  },
  "enabledPlugins": {
--- a/packages/mosaic/framework/tools/_lib/credentials.sh
+++ b/packages/mosaic/framework/tools/_lib/credentials.sh
@@ -16,7 +16,12 @@
 # After loading, service-specific env vars are exported.
 # Run `load_credentials --help` for details.
-MOSAIC_CREDENTIALS_FILE="${MOSAIC_CREDENTIALS_FILE:-$HOME/src/jarvis-brain/credentials.json}"
+if [[ -z "${MOSAIC_CREDENTIALS_FILE:-}" ]]; then
  for _cand in "$HOME/.config/mosaic/credentials.json" "$HOME/src/jarvis-brain/credentials.json"; do
    if [[ -f "$_cand" ]]; then MOSAIC_CREDENTIALS_FILE="$_cand"; break; fi
  done
  : "${MOSAIC_CREDENTIALS_FILE:=$HOME/src/jarvis-brain/credentials.json}"
 fi
 _mosaic_require_jq() {
  if ! command -v jq &>/dev/null; then
@@ -34,6 +39,19 @@ _mosaic_read_cred() {
  jq -r "$jq_path // empty" "$MOSAIC_CREDENTIALS_FILE"
 }
 # Decide curl TLS flag for a target URL: validate public hosts (MITM matters on
 # WAN); allow self-signed only for private-network IP literals (trusted LAN) or an
 # explicit $MOSAIC_INSECURE_TLS opt-in. Echoes "-k" or "" (empty).
 _mosaic_tls_opt() {
  local url="$1" host
  [[ -n "${MOSAIC_INSECURE_TLS:-}" ]] && { echo "-k"; return; }
  host=$(printf '%s' "$url" | sed -E 's#^[a-zA-Z]+://([^/:]+).*#\1#')
  if [[ "$host" =~ ^(10\.|127\.|192\.168\.|172\.(1[6-9]|2[0-9]|3[01])\.) ]]; then
    echo "-k"; return
  fi
  echo ""
 }
 # Sync Woodpecker credentials to ~/.woodpecker/<instance>.env
 # Only writes when values differ to avoid unnecessary disk writes.
 _mosaic_sync_woodpecker_env() {
@@ -261,7 +279,8 @@ mosaic_http() {
  local base_url="${4:-}"
  local response
-  response=$(curl -sk -w "\n%{http_code}" -X "$method" \
+  local _tls; _tls=$(_mosaic_tls_opt "${base_url}${endpoint}")
  response=$(curl -sS $_tls -w "\n%{http_code}" -X "$method" \
    -H "$auth_header" \
    -H "Content-Type: application/json" \
    "${base_url}${endpoint}")
@@ -279,7 +298,8 @@ mosaic_http_post() {
  local base_url="${4:-}"
  local response
-  response=$(curl -sk -w "\n%{http_code}" -X POST \
+  local _tls; _tls=$(_mosaic_tls_opt "${base_url}${endpoint}")
  response=$(curl -sS $_tls -w "\n%{http_code}" -X POST \
    -H "$auth_header" \
    -H "Content-Type: application/json" \
    -d "$data" \
@@ -297,7 +317,8 @@ mosaic_http_patch() {
  local base_url="${4:-}"
  local response
-  response=$(curl -sk -w "\n%{http_code}" -X PATCH \
+  local _tls; _tls=$(_mosaic_tls_opt "${base_url}${endpoint}")
  response=$(curl -sS $_tls -w "\n%{http_code}" -X PATCH \
    -H "$auth_header" \
    -H "Content-Type: application/json" \
    -d "$data" \
--- a/packages/mosaic/framework/tools/git/pr-ci-wait.sh
+++ b/packages/mosaic/framework/tools/git/pr-ci-wait.sh
@@ -72,6 +72,11 @@ elif values and all(v == "success" for v in values):
    print("success")
 elif any(v in {"pending", "running", "queued", "waiting"} for v in values):
    print("pending")
 elif not values and not state:
    # No pipeline/status of any kind reported for this commit. Distinct from
    # "unknown" (an ambiguous/unrecognized status that should keep polling):
    # this signals a repo/commit that simply has no CI configured.
    print("no-status")
 else:
    print("unknown")
 PY
@@ -142,6 +147,21 @@ gitea_get_commit_status_json() {
    curl -fsSL -H "User-Agent: curl/8" -H "Authorization: token ${token}" "$url"
 }
 gitea_get_default_branch() {
    local host="$1"
    local repo="$2"
    local token="$3"
    local url="https://${host}/api/v1/repos/${repo}"
    curl -fsSL -H "User-Agent: curl/8" -H "Authorization: token ${token}" "$url" | python3 -c '
 import json, sys
 print((json.load(sys.stdin) or {}).get("default_branch", ""))
 '
 }
 github_get_default_branch() {
    gh api "repos/${OWNER}/${REPO}" --jq '.default_branch'
 }
 while [[ $# -gt 0 ]]; do
    case "$1" in
        -n|--number)
@@ -245,6 +265,51 @@ else
    exit 1
 fi
 # No-CI determination is TWO-TIER (primary: CI history; secondary: empty-poll streak).
 #
 # PRIMARY — "does this repo run CI at all?" Probed once, up front, from the DEFAULT
 # BRANCH's commit status. A repo whose default branch carries CI statuses
 # demonstrably runs CI, so an EMPTY status on the PR head means the pipeline simply
 # has not registered YET (webhook/queue lag) — NOT that the repo is CI-less. In that
 # case we must NEVER fast-green; we keep polling until the pipeline registers or the
 # timeout fires (both safe). This closes the webhook-lag false-green: a slow-to-
 # register pipeline feeding a merge gate can no longer be mistaken for "no CI".
 #
 # SECONDARY — the empty-poll streak below applies ONLY to genuinely CI-less repos
 # (default branch also has no CI history, e.g. device-imaging class), where burning
 # the full timeout would be pure waste. There, NO_CI_MAX empty polls => fast-exit 0.
 #
 # Probe failure is treated conservatively as REPO_HAS_CI=1 (assume CI present): we
 # would rather wait-then-timeout than risk a false-green, per the merge-gate priority.
 REPO_HAS_CI=1
 detect_repo_ci() {
    local def_branch def_status
    # Every early exit returns 0: a probe miss must leave the conservative
    # REPO_HAS_CI=1 default in place, never abort the caller under `set -e`.
    if [[ "$PLATFORM" == "github" ]]; then
        def_branch=$(github_get_default_branch 2>/dev/null) || {
            echo "[pr-ci-wait] WARN: default-branch probe failed; assuming CI-enabled (will not fast-green on empty status)."; return 0; }
        [[ -n "$def_branch" ]] || return 0
        def_status=$(github_get_commit_status_json "$OWNER" "$REPO" "$def_branch" 2>/dev/null | extract_state_from_status_json) || return 0
    else
        def_branch=$(gitea_get_default_branch "$HOST" "$OWNER/$REPO" "$TOKEN" 2>/dev/null) || {
            echo "[pr-ci-wait] WARN: default-branch probe failed; assuming CI-enabled (will not fast-green on empty status)."; return 0; }
        [[ -n "$def_branch" ]] || return 0
        def_status=$(gitea_get_commit_status_json "$HOST" "$OWNER/$REPO" "$TOKEN" "$def_branch" 2>/dev/null | extract_state_from_status_json) || return 0
    fi
    if [[ "$def_status" == "no-status" || -z "$def_status" ]]; then
        REPO_HAS_CI=0
        echo "[pr-ci-wait] default branch '${def_branch}' has no CI status history — treating repo as CI-less (empty-poll fast-exit enabled)."
    else
        REPO_HAS_CI=1
        echo "[pr-ci-wait] default branch '${def_branch}' has CI history (state=${def_status}) — repo runs CI; empty status on PR head => awaiting registration, will not fast-green."
    fi
 }
 detect_repo_ci || true
 NO_CI_STREAK=0
 NO_CI_MAX=3
 while true; do
    NOW_TS=$(date +%s)
    if (( NOW_TS > DEADLINE_TS )); then
@@ -272,11 +337,35 @@ while true; do
            echo "Error: CI reported ${STATE} for PR #$PR_NUMBER." >&2
            exit 1
            ;;
        no-status)
            if [[ "$REPO_HAS_CI" == "1" ]]; then
                # PRIMARY tier: repo demonstrably runs CI but this commit's pipeline
                # has not registered yet (webhook/queue lag). Do NOT fast-green — keep
                # polling until it registers or the timeout fires. Reset the streak so
                # a later genuine CI-less misread can't accumulate across this state.
                NO_CI_STREAK=0
                echo "[pr-ci-wait] empty status on PR head but repo runs CI — awaiting pipeline registration (webhook lag), not fast-greening."
            else
                # SECONDARY tier: genuinely CI-less repo (default branch has no CI
                # history either). Empty polls => fast-exit green after NO_CI_MAX.
                NO_CI_STREAK=$((NO_CI_STREAK + 1))
                if (( NO_CI_STREAK >= NO_CI_MAX )); then
                    echo "[INFO] no CI configured for this repo/commit (PR #$PR_NUMBER, ${NO_CI_STREAK} consecutive empty polls, default branch also CI-less); treating as green."
                    exit 0
                fi
            fi
            sleep "$INTERVAL_SEC"
            ;;
        pending|unknown)
            # A pipeline exists but hasn't reached a terminal state (or is
            # transiently ambiguous) — keep waiting, and reset the no-CI streak
            # since this commit is not in the "no CI at all" condition.
            NO_CI_STREAK=0
            sleep "$INTERVAL_SEC"
            ;;
        *)
            echo "[pr-ci-wait] Unrecognized state '${STATE}', continuing to poll..."
            NO_CI_STREAK=0
            sleep "$INTERVAL_SEC"
            ;;
    esac
--- a/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh
+++ b/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh
@@ -0,0 +1,197 @@
 #!/usr/bin/env bash
 # reflect-stop-hook.sh — Stop hook (agent reflection loop, durable kernel)
 #
 # At end-of-run, capture the doer's end-state as a structured `reflection.v1`
 # sidecar: the mechanical diff risk-floor plus any self-report the agent left
 # behind. This is the passive capture half of the design (§10 step 1). It does
 # NOT route, score, or gate — it only writes the sidecar; pickup is future work.
 #
 # FAIL-CLOSED: if REFLECTION_MODE is unset or "off", this is a strict no-op.
 # Global registration is therefore safe; the feature only activates when a
 # launcher/profile explicitly sets REFLECTION_MODE=solo|orchestrated.
 #
 # NON-BLOCKING: Stop hooks are observational. This script NEVER emits a
 # `decision` field and ALWAYS exits 0 — it can never fail or stall a session.
 #
 # Environment contract:
 #   REFLECTION_MODE            off|solo|orchestrated   (default: off → no-op)
 #   REFLECTION_DIR             output dir              (default: <repo>/.mosaic/reflections)
 #   REFLECTION_INPUT           self-report JSON        (default: <repo>/.mosaic/reflection-input.json)
 #   REFLECTION_TASK_REF        canonical task ref      (default: <repo>#<branch>)
 #   REFLECTION_AGENT           persona/runtime id      (default: unknown)
 #   REFLECTION_RISK_THRESHOLD  review cutoff [0,1]     (default: 0.5)
 #
 # Risk-floor surface table is kept in sync with the authoritative TS
 # implementation at packages/macp/src/risk-floor.ts (evaluateRiskFloor).
 #
 # Exit codes: always 0 (observational hook).
 set -euo pipefail
 # ---- fail-closed gate -------------------------------------------------------
 MODE="${REFLECTION_MODE:-off}"
 if [[ "$MODE" != "solo" && "$MODE" != "orchestrated" ]]; then
  exit 0
 fi
 # Read the Stop payload (best-effort; never required).
 INPUT="$(cat || true)"
 # Sentinel lock path (global so the EXIT trap can clean it after main returns).
 LOCKFILE=""
 trap 'rm -f "${LOCKFILE:-}" 2>/dev/null || true' EXIT
 main() {
  command -v jq >/dev/null 2>&1 || return 0   # no jq → silently no-op
  local session_id payload_cwd repo_dir repo_name branch task_ref agent
  session_id="$(printf '%s' "$INPUT" | jq -r '.session_id // "unknown"' 2>/dev/null || echo unknown)"
  # Sanitize: session_id is interpolated into file/lock paths — allow safe
  # filename chars only (defends against ../ or / in the payload).
  session_id="${session_id//[^a-zA-Z0-9_-]/}"
  session_id="${session_id:-unknown}"
  payload_cwd="$(printf '%s' "$INPUT" | jq -r '.cwd // empty' 2>/dev/null || true)"
  # Resolve repo root: prefer git toplevel from the payload cwd, else PWD.
  local start_dir="${payload_cwd:-$PWD}"
  repo_dir="$(git -C "$start_dir" rev-parse --show-toplevel 2>/dev/null || echo "$start_dir")"
  repo_name="$(basename "$repo_dir")"
  branch="$(git -C "$repo_dir" rev-parse --abbrev-ref HEAD 2>/dev/null || echo detached)"
  task_ref="${REFLECTION_TASK_REF:-${repo_name}#${branch}}"
  agent="${REFLECTION_AGENT:-unknown}"
  # ---- sentinel guard: avoid re-fire loops --------------------------------
  local out_dir lock
  out_dir="${REFLECTION_DIR:-${repo_dir}/.mosaic/reflections}"
  mkdir -p "$out_dir" 2>/dev/null || return 0
  lock="${out_dir}/.${session_id}.lock"
  if [[ -e "$lock" ]]; then
    return 0
  fi
  : > "$lock" 2>/dev/null || true
  LOCKFILE="$lock"
  # ---- mechanical: changed files ------------------------------------------
  # Union of committed-vs-HEAD~ is out of scope; capture the working surface:
  # staged + unstaged + untracked, best-effort.
  # Exclude .mosaic/ (agent scratch: reflections, locks, self-report input) —
  # it is tooling state, not part of the diff under review.
  local files
  files="$(
    {
      git -C "$repo_dir" diff --name-only HEAD 2>/dev/null || true
      git -C "$repo_dir" diff --name-only --staged 2>/dev/null || true
      git -C "$repo_dir" ls-files --others --exclude-standard 2>/dev/null || true
    } | sed '/^$/d' | grep -v '^\.mosaic/' | sort -u || true
  )"
  # ---- mechanical: risk-floor (inline port of evaluateRiskFloor) ----------
  local threshold="${REFLECTION_RISK_THRESHOLD:-0.5}"
  local top_surface="none" top_weight="0.0" tripping=""
  local f surface weight
  while IFS= read -r f; do
    [[ -z "$f" ]] && continue
    surface="$(classify_surface "$f")"
    weight="$(surface_weight "$surface")"
    if awk "BEGIN{exit !($weight > $top_weight)}"; then
      top_weight="$weight"; top_surface="$surface"; tripping="$f"
    elif [[ "$surface" == "$top_surface" && "$surface" != "none" ]] && awk "BEGIN{exit !($weight == $top_weight)}"; then
      tripping="${tripping:+$tripping, }$f"
    fi
  done <<< "$files"
  local needs_review reason file_count
  file_count="$(printf '%s\n' "$files" | sed '/^$/d' | wc -l | tr -d ' ')"
  if awk "BEGIN{exit !($top_weight >= $threshold)}"; then needs_review=true; else needs_review=false; fi
  if [[ "$top_surface" == "none" ]]; then
    if [[ "$file_count" -eq 0 ]]; then reason="no files changed"; else reason="no sensitive surface in ${file_count} changed file(s)"; fi
  else
    reason="${top_surface} surface (weight ${top_weight}) in: ${tripping}"
  fi
  # ---- self-report merge (optional) ---------------------------------------
  local input_file degraded self_json
  input_file="${REFLECTION_INPUT:-${repo_dir}/.mosaic/reflection-input.json}"
  degraded=true
  self_json='{"confidence":null,"most_likely_wrong":null,"known_not_in_diff":null}'
  if [[ -r "$input_file" ]] && jq -e . "$input_file" >/dev/null 2>&1; then
    self_json="$(jq '{
      confidence: (.confidence // null),
      most_likely_wrong: (.most_likely_wrong // null),
      known_not_in_diff: (.known_not_in_diff // null)
    }' "$input_file" 2>/dev/null || echo "$self_json")"
    degraded=false
  fi
  # ---- assemble + atomic write --------------------------------------------
  local ts files_json record tmp final
  ts="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
  files_json="$(printf '%s\n' "$files" | jq -R . | jq -s 'map(select(length>0))')"
  record="$(jq -n \
    --arg task_ref "$task_ref" \
    --arg agent "$agent" \
    --arg session_id "$session_id" \
    --arg ts "$ts" \
    --arg repo "$repo_name" \
    --argjson needs_review "$needs_review" \
    --argjson score "$top_weight" \
    --arg surface "$top_surface" \
    --arg reason "$reason" \
    --argjson files "$files_json" \
    --argjson self "$self_json" \
    --argjson degraded "$degraded" \
    --arg mode "$MODE" \
    '{
      schema: "reflection.v1",
      task_ref: $task_ref,
      agent: $agent,
      session_id: $session_id,
      timestamp: $ts,
      repo: $repo,
      confidence: $self.confidence,
      most_likely_wrong: $self.most_likely_wrong,
      known_not_in_diff: $self.known_not_in_diff,
      risk: { needs_review: $needs_review, score: $score, surface: $surface, reason: $reason },
      files_changed: $files,
      provenance: { source: "stop-hook", reflection_attempt: 1, degraded: $degraded, reflection_mode: $mode }
    }' 2>/dev/null || true)"
  [[ -z "$record" ]] && return 0
  final="${out_dir}/${session_id}-${ts//[:]/}.reflection.json"
  tmp="${final}.tmp"
  printf '%s\n' "$record" > "$tmp" 2>/dev/null || return 0
  mv -f "$tmp" "$final" 2>/dev/null || true
 }
 # classify_surface PATH → surface name (highest-risk match wins, mirrors TS)
 classify_surface() {
  local p="$1"
  if printf '%s' "$p" | grep -qiE 'auth|login|session|token|permission|rbac|credential|secret'; then echo auth; return; fi
  if printf '%s' "$p" | grep -qiE 'migration|prisma|schema|\.sql|entity|repository|seed'; then echo data; return; fi
  if printf '%s' "$p" | grep -qiE 'docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform'; then echo infra; return; fi
  if printf '%s' "$p" | grep -qiE 'package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite'; then echo build; return; fi
  if printf '%s' "$p" | grep -qE '\.tsx|\.css|components/|apps/web/'; then echo ui; return; fi
  if printf '%s' "$p" | grep -qE '\.spec\.|\.test\.|__tests__/'; then echo test; return; fi
  if printf '%s' "$p" | grep -qE '\.md$|docs/'; then echo docs; return; fi
  echo none
 }
 # surface_weight SURFACE → numeric weight (mirrors TS SURFACE_RULES)
 surface_weight() {
  case "$1" in
    auth) echo 1.0 ;;
    data) echo 0.9 ;;
    infra) echo 0.85 ;;
    build) echo 0.6 ;;
    ui) echo 0.4 ;;
    test) echo 0.2 ;;
    docs) echo 0.1 ;;
    *) echo 0.0 ;;
  esac
 }
 main || true
 exit 0
--- a/packages/mosaic/framework/tools/woodpecker/_lib.sh
+++ b/packages/mosaic/framework/tools/woodpecker/_lib.sh
@@ -12,7 +12,7 @@ wp_resolve_repo_id() {
  local full_name="$1"
  local response http_code body repo_id
-  response=$(curl -sk -w "\n%{http_code}" \
+  response=$(curl -sS -w "\n%{http_code}" \
    -H "Authorization: Bearer $WOODPECKER_TOKEN" \
    "${WOODPECKER_URL}/api/repos/lookup/${full_name}")
--- a/packages/mosaic/framework/tools/woodpecker/pipeline-list.sh
+++ b/packages/mosaic/framework/tools/woodpecker/pipeline-list.sh
@@ -48,7 +48,7 @@ fi
 # Resolve owner/repo to numeric ID (Woodpecker v3 API)
 REPO_ID=$(wp_resolve_repo_id "$REPO") || exit 1
-response=$(curl -sk -w "\n%{http_code}" \
+response=$(curl -sS -w "\n%{http_code}" \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  "${WOODPECKER_URL}/api/repos/${REPO_ID}/pipelines?perPage=${LIMIT}")
--- a/packages/mosaic/framework/tools/woodpecker/pipeline-status.sh
+++ b/packages/mosaic/framework/tools/woodpecker/pipeline-status.sh
@@ -50,7 +50,7 @@ REPO_ID=$(wp_resolve_repo_id "$REPO") || exit 1
 _wp_fetch() {
  local ep="$1"
  local resp http_code body
-  resp=$(curl -sk -w "\n%{http_code}" \
+  resp=$(curl -sS -w "\n%{http_code}" \
    -H "Authorization: Bearer $WOODPECKER_TOKEN" \
    "$ep")
  http_code=$(echo "$resp" | tail -n1)
--- a/packages/mosaic/framework/tools/woodpecker/pipeline-trigger.sh
+++ b/packages/mosaic/framework/tools/woodpecker/pipeline-trigger.sh
@@ -46,7 +46,7 @@ REPO_ID=$(wp_resolve_repo_id "$REPO") || exit 1
 echo "Triggering pipeline for $REPO on branch $BRANCH..."
-response=$(curl -sk -w "\n%{http_code}" -X POST \
+response=$(curl -sS -w "\n%{http_code}" -X POST \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg b "$BRANCH" '{branch: $b}')" \
--- a/packages/types/src/index.ts
+++ b/packages/types/src/index.ts
@@ -6,3 +6,4 @@ export * from './provider/index.js';
 export * from './routing/index.js';
 export * from './commands/index.js';
 export * from './federation/index.js';
 export * from './reflection/index.js';
--- a/packages/types/src/reflection/tests/reflection.spec.ts
+++ b/packages/types/src/reflection/tests/reflection.spec.ts
@@ -0,0 +1,146 @@
 /**
 * Unit tests for the reflection.v1 schema + self-report boundary.
 *
 * The runtime source of truth is the zod schema set in `reflection.ts`. The
 * class-validator `ReflectionSelfReportDto` is the NestJS-side boundary type
 * (exercised under the gateway app's reflect-metadata runtime, mirroring how
 * `chat.dto.ts` is tested in apps/gateway); here we validate the self-report
 * input with its zod counterpart, which is what the Stop hook actually uses.
 *
 * Coverage:
 *  - REVIEW_SURFACES canonical ordering (the enum both zod + JSON Schema mirror)
 *  - ReflectionV1Schema accepts a fully-populated record
 *  - ReflectionV1Schema accepts a degraded record (self-report fields null)
 *  - ReflectionV1Schema rejects bad schema literal / out-of-range confidence / bad surface
 *  - ReflectionSelfReportSchema accepts valid + empty, rejects bad input
 */
 import { describe, expect, it } from 'vitest';
 import {
  REVIEW_SURFACES,
  ReflectionV1Schema,
  ReflectionSelfReportSchema,
  type ReflectionV1,
 } from '../index.js';
 const baseMechanical = {
  schema: 'reflection.v1' as const,
  task_ref: 'stack#544',
  agent: 'claude',
  session_id: 'sess-abc',
  timestamp: '2026-06-16T00:00:00.000Z',
  repo: 'stack',
  risk: {
    needs_review: true,
    score: 1.0,
    surface: 'auth' as const,
    reason: 'auth surface (weight 1) in: src/auth.ts',
  },
  files_changed: ['src/auth.ts'],
  provenance: {
    source: 'stop-hook' as const,
    reflection_attempt: 1,
    degraded: false,
    reflection_mode: 'solo' as const,
  },
 };
 describe('REVIEW_SURFACES', () => {
  it('keeps the canonical most→least-sensitive ordering', () => {
    expect(REVIEW_SURFACES).toEqual([
      'auth',
      'data',
      'infra',
      'build',
      'ui',
      'test',
      'docs',
      'none',
    ]);
  });
 });
 describe('ReflectionV1Schema', () => {
  it('accepts a fully-populated record', () => {
    const rec: ReflectionV1 = {
      ...baseMechanical,
      confidence: 0.7,
      most_likely_wrong: { surface: 'auth', description: 'token refresh untested' },
      known_not_in_diff: 'manual QA only on the happy path',
    };
    expect(() => ReflectionV1Schema.parse(rec)).not.toThrow();
  });
  it('accepts a degraded record with null self-report fields', () => {
    const rec: ReflectionV1 = {
      ...baseMechanical,
      confidence: null,
      most_likely_wrong: null,
      known_not_in_diff: null,
      provenance: { ...baseMechanical.provenance, degraded: true },
    };
    expect(() => ReflectionV1Schema.parse(rec)).not.toThrow();
  });
  it('rejects a wrong schema literal', () => {
    const bad = {
      ...baseMechanical,
      schema: 'reflection.v2',
      confidence: null,
      most_likely_wrong: null,
      known_not_in_diff: null,
    };
    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
  });
  it('rejects out-of-range confidence', () => {
    const bad = {
      ...baseMechanical,
      confidence: 1.5,
      most_likely_wrong: null,
      known_not_in_diff: null,
    };
    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
  });
  it('rejects an unknown surface', () => {
    const bad = {
      ...baseMechanical,
      risk: { ...baseMechanical.risk, surface: 'network' },
      confidence: null,
      most_likely_wrong: null,
      known_not_in_diff: null,
    };
    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
  });
 });
 describe('ReflectionSelfReportSchema', () => {
  it('accepts a valid self-report', () => {
    const ok = ReflectionSelfReportSchema.safeParse({
      confidence: 0.8,
      most_likely_wrong: {
        surface: 'data',
        description: 'migration not run against prod-sized data',
      },
      known_not_in_diff: 'rollback path untested',
    });
    expect(ok.success).toBe(true);
  });
  it('accepts an empty self-report (all optional)', () => {
    expect(ReflectionSelfReportSchema.safeParse({}).success).toBe(true);
  });
  it('rejects confidence above 1', () => {
    expect(ReflectionSelfReportSchema.safeParse({ confidence: 2 }).success).toBe(false);
  });
  it('rejects an unknown most_likely_wrong.surface', () => {
    const res = ReflectionSelfReportSchema.safeParse({
      most_likely_wrong: { surface: 'network', description: 'x' },
    });
    expect(res.success).toBe(false);
  });
 });
--- a/packages/types/src/reflection/index.ts
+++ b/packages/types/src/reflection/index.ts
@@ -0,0 +1,30 @@
 /**
 * Agent reflection (v1) — public barrel.
 *
 * reflection.ts      — zod schemas (runtime source of truth) + inferred types
 * reflection.dto.ts  — class-validator DTO for the agent self-report input
 */
 export {
  REVIEW_SURFACES,
  ReviewSurfaceSchema,
  MostLikelyWrongSchema,
  ReflectionRiskSchema,
  ReflectionModeSchema,
  ReflectionProvenanceSchema,
  ReflectionSelfReportSchema,
  ReflectionV1Schema,
  REFLECTION_SCHEMA_ID,
 } from './reflection.js';
 export type {
  ReviewSurface,
  MostLikelyWrong,
  ReflectionRisk,
  ReflectionMode,
  ReflectionProvenance,
  ReflectionSelfReport,
  ReflectionV1,
 } from './reflection.js';
 export { MostLikelyWrongDto, ReflectionSelfReportDto } from './reflection.dto.js';
--- a/packages/types/src/reflection/reflection.dto.ts
+++ b/packages/types/src/reflection/reflection.dto.ts
@@ -0,0 +1,55 @@
 /**
 * Reflection self-report DTO — class-validator boundary.
 *
 * Validates the agent-supplied self-report input (the optional
 * `$REFLECTION_INPUT` file, default `<repo>/.mosaic/reflection-input.json`)
 * before it is merged into a `reflection.v1` record. This is the only
 * externally-authored input on the reflection path, so it gets a DTO per the
 * Mosaic module-boundary rule.
 *
 * Class-validator only (no class-transformer `@Type`) — matching `chat.dto.ts`
 * — so the module is safe to import without a `reflect-metadata` shim. Deep
 * nested validation of `most_likely_wrong` is owned by the zod
 * `ReflectionSelfReportSchema` in `reflection.ts`, which is what the Stop hook
 * actually enforces at runtime.
 */
 import {
  IsIn,
  IsNumber,
  IsObject,
  IsOptional,
  IsString,
  Max,
  Min,
  MaxLength,
 } from 'class-validator';
 import { REVIEW_SURFACES } from './reflection.js';
 /** Shape of `most_likely_wrong`; validated structurally by zod at runtime. */
 export class MostLikelyWrongDto {
  @IsIn(REVIEW_SURFACES as unknown as string[])
  surface!: string;
  @IsString()
  @MaxLength(4_000)
  description!: string;
 }
 export class ReflectionSelfReportDto {
  @IsOptional()
  @IsNumber()
  @Min(0)
  @Max(1)
  confidence?: number;
  @IsOptional()
  @IsObject()
  most_likely_wrong?: MostLikelyWrongDto;
  @IsOptional()
  @IsString()
  @MaxLength(8_000)
  known_not_in_diff?: string;
 }
--- a/packages/types/src/reflection/reflection.ts
+++ b/packages/types/src/reflection/reflection.ts
@@ -0,0 +1,90 @@
 /**
 * Agent reflection (v1) — wire schema.
 *
 * Runtime source of truth for the `reflection.v1` sidecar emitted at end-of-run
 * by the Stop hook (design §10 step 1). The JSON Schema artifact at
 * `@mosaicstack/macp` `src/schemas/reflection.v1.schema.json` is the documented
 * contract; this zod schema is the executable one and MUST agree with it.
 *
 * Field provenance:
 *   - MECHANICAL  (risk, files_changed, ids, provenance): written by the hook.
 *   - SELF-REPORTED (confidence, most_likely_wrong, known_not_in_diff): merged
 *     from an optional agent-supplied input; null when absent.
 *
 * Pure — no NestJS, no DB, no Node-only APIs. Safe for browser/edge.
 */
 import { z } from 'zod';
 /** Review surfaces, ordered most- to least-sensitive. Mirrors macp risk-floor. */
 export const REVIEW_SURFACES = [
  'auth',
  'data',
  'infra',
  'build',
  'ui',
  'test',
  'docs',
  'none',
 ] as const;
 export const ReviewSurfaceSchema = z.enum(REVIEW_SURFACES);
 export type ReviewSurface = z.infer<typeof ReviewSurfaceSchema>;
 /** SELF-REPORTED: the single most-likely way the work is wrong. */
 export const MostLikelyWrongSchema = z.object({
  surface: ReviewSurfaceSchema,
  description: z.string(),
 });
 export type MostLikelyWrong = z.infer<typeof MostLikelyWrongSchema>;
 /** MECHANICAL: output of the diff risk-floor (see `@mosaicstack/macp`). */
 export const ReflectionRiskSchema = z.object({
  needs_review: z.boolean(),
  score: z.number().min(0).max(1),
  surface: ReviewSurfaceSchema,
  reason: z.string(),
 });
 export type ReflectionRisk = z.infer<typeof ReflectionRiskSchema>;
 export const ReflectionModeSchema = z.enum(['off', 'solo', 'orchestrated']);
 export type ReflectionMode = z.infer<typeof ReflectionModeSchema>;
 export const ReflectionProvenanceSchema = z.object({
  source: z.literal('stop-hook'),
  reflection_attempt: z.number().int().min(1),
  degraded: z.boolean(),
  reflection_mode: ReflectionModeSchema,
 });
 export type ReflectionProvenance = z.infer<typeof ReflectionProvenanceSchema>;
 /**
 * The self-reported half of a reflection. Supplied by the agent out-of-band
 * (e.g. `<repo>/.mosaic/reflection-input.json`) and merged by the hook. All
 * fields optional; missing fields become `null` in the assembled record.
 */
 export const ReflectionSelfReportSchema = z.object({
  confidence: z.number().min(0).max(1).nullable().optional(),
  most_likely_wrong: MostLikelyWrongSchema.nullable().optional(),
  known_not_in_diff: z.string().nullable().optional(),
 });
 export type ReflectionSelfReport = z.infer<typeof ReflectionSelfReportSchema>;
 /** The full assembled `reflection.v1` sidecar. */
 export const ReflectionV1Schema = z.object({
  schema: z.literal('reflection.v1'),
  task_ref: z.string(),
  agent: z.string(),
  session_id: z.string(),
  timestamp: z.string(),
  repo: z.string(),
  confidence: z.number().min(0).max(1).nullable(),
  most_likely_wrong: MostLikelyWrongSchema.nullable(),
  known_not_in_diff: z.string().nullable(),
  risk: ReflectionRiskSchema,
  files_changed: z.array(z.string()),
  provenance: ReflectionProvenanceSchema,
 });
 export type ReflectionV1 = z.infer<typeof ReflectionV1Schema>;
 export const REFLECTION_SCHEMA_ID = 'reflection.v1' as const;
--- a/scripts/analysis/reflect-board-history.sh
+++ b/scripts/analysis/reflect-board-history.sh
@@ -0,0 +1,111 @@
 #!/usr/bin/env bash
 # reflect-board-history.sh — Phase-0 experiment P3 (outcome detectability)
 #
 # Question: for completed tasks, how often does a machine-detectable
 # correct/wrong outcome signal appear within a follow-up window (default 30d)?
 # If the base rate is too low, predicted-vs-actual calibration (design §7) has
 # nothing to score against, so the kernel should capture caveat-notes only.
 #
 # Method: consume a board/task export (JSONL, one task object per line) OR fall
 # back to scanning the git history of a `data/` task directory. For each task
 # that reached a "done"-like state, decide whether a later signal marks it
 # correct or wrong (reopen, revert, follow-up "fix"/"regression", explicit
 # outcome field). Emit the detectable-outcome base rate. HARNESS + RUBRIC.
 #
 # Usage:
 #   scripts/analysis/reflect-board-history.sh --jsonl FILE [--window-days N] [--json|--md]
 #   scripts/analysis/reflect-board-history.sh --data-dir DIR [--window-days N] [--json|--md]
 #
 # JSONL fields used (best-effort): .id .status .completed_at .outcome
 #   .reopened_at .followups[] (free-form). Missing fields are tolerated.
 #
 # Requirements: jq (for --jsonl), git (for --data-dir), awk.
 #
 # PRE-REGISTERED KILL CONDITION:
 #   detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop;
 #   capture caveat-notes only.
 set -euo pipefail
 JSONL=""
 DATA_DIR=""
 WINDOW_DAYS=30
 FORMAT="json"
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --jsonl) JSONL="$2"; shift 2 ;;
    --data-dir) DATA_DIR="$2"; shift 2 ;;
    --window-days) WINDOW_DAYS="$2"; shift 2 ;;
    --json) FORMAT="json"; shift ;;
    --md) FORMAT="md"; shift ;;
    -h|--help) sed -n '2,32p' "$0"; exit 0 ;;
    *) echo "unknown arg: $1" >&2; exit 2 ;;
  esac
 done
 KILL_CONDITION='detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop'
 echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
 done_total=0
 detectable=0
 if [[ -n "$JSONL" ]]; then
  command -v jq >/dev/null 2>&1 || { echo "jq required for --jsonl" >&2; exit 3; }
  [[ -r "$JSONL" ]] || { echo "cannot read $JSONL" >&2; exit 3; }
  # Count done tasks and those with a machine-detectable outcome signal.
  done_total="$(jq -rs '[.[] | select((.status // "") | test("done|complete|closed"; "i"))] | length' "$JSONL" 2>/dev/null || echo 0)"
  detectable="$(jq -rs '
    [ .[]
      | select((.status // "") | test("done|complete|closed"; "i"))
      | select(
          (.outcome // null) != null
          or (.reopened_at // null) != null
          or ((.followups // []) | length) > 0
        )
    ] | length' "$JSONL" 2>/dev/null || echo 0)"
 elif [[ -n "$DATA_DIR" ]]; then
  command -v git >/dev/null 2>&1 || { echo "git required for --data-dir" >&2; exit 3; }
  [[ -d "$DATA_DIR" ]] || { echo "no such dir: $DATA_DIR" >&2; exit 3; }
  # Proxy: a task file later touched by a commit whose subject signals a
  # correction is a "detectable outcome".
  while IFS= read -r file; do
    [[ -z "$file" ]] && continue
    done_total=$((done_total + 1))
    if git -C "$DATA_DIR" log --since="${WINDOW_DAYS} days ago" --pretty='%s' -- "$file" 2>/dev/null \
         | grep -qiE 'reopen|revert|fix|regression|wrong|incorrect|redo'; then
      detectable=$((detectable + 1))
    fi
  done < <(find "$DATA_DIR" -type f -name '*.json' 2>/dev/null)
 else
  echo "provide --jsonl FILE or --data-dir DIR" >&2
  exit 2
 fi
 rate="$(awk "BEGIN{ if ($done_total==0) print \"0.0\"; else printf \"%.1f\", 100*$detectable/$done_total }")"
 verdict="$(awk "BEGIN{print ($rate < 20.0) ? \"KILL §7 — caveat-notes only\" : \"signal present — proceed\"}")"
 if [[ "$FORMAT" == "md" ]]; then
  cat <<EOF
 ## P3 — outcome detectability
 - done-like tasks: **${done_total}**
 - with machine-detectable outcome (window ${WINDOW_DAYS}d): **${detectable}**
 - base rate: **${rate}%**
 - kill condition: ${KILL_CONDITION}
 - verdict: **${verdict}**
 EOF
 else
  awk -v dt="$done_total" -v d="$detectable" -v r="$rate" -v w="$WINDOW_DAYS" \
      -v v="$verdict" -v kc="$KILL_CONDITION" 'BEGIN{
    printf "{\n"
    printf "  \"experiment\": \"P3-board-history\",\n"
    printf "  \"window_days\": %d,\n", w
    printf "  \"done_tasks\": %d,\n", dt
    printf "  \"detectable_outcomes\": %d,\n", d
    printf "  \"base_rate_pct\": %s,\n", r
    printf "  \"kill_condition\": \"%s\",\n", kc
    printf "  \"verdict\": \"%s\"\n", v
    printf "}\n"
  }'
 fi
--- a/scripts/analysis/reflect-calibration.sh
+++ b/scripts/analysis/reflect-calibration.sh
@@ -0,0 +1,117 @@
 #!/usr/bin/env bash
 # reflect-calibration.sh — Phase-0 experiment P1 (confidence signal)
 #
 # Question: does an agent's self-reported confidence discriminate correct from
 # incorrect work — especially on the self-rated-HIGH subset, where a closed
 # loop would actually trust it? If confidence ≈ chance on the high subset, the
 # signal is useless and design §7–§8 should not be built.
 #
 # Method: consume a labelled corpus — JSONL of {confidence: 0..1, correct:
 # true|false}. Compute discrimination as ROC AUC over all rows, plus the
 # correct-rate (lift) on the high-confidence subset (>= threshold), and compare
 # to the pre-registered chance baseline (the overall correct-rate). HARNESS +
 # RUBRIC; the labelled corpus is supplied later.
 #
 # Usage:
 #   scripts/analysis/reflect-calibration.sh --jsonl FILE [--high 0.8] [--json|--md]
 #
 # Requirements: jq, awk.
 #
 # PRE-REGISTERED KILL CONDITION:
 #   AUC <= 0.60 OR high-subset lift <= +5pp over base rate
 #   ⇒ confidence is not a usable routing signal; do NOT build §7–§8.
 set -euo pipefail
 JSONL=""
 HIGH=0.8
 FORMAT="json"
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --jsonl) JSONL="$2"; shift 2 ;;
    --high) HIGH="$2"; shift 2 ;;
    --json) FORMAT="json"; shift ;;
    --md) FORMAT="md"; shift ;;
    -h|--help) sed -n '2,27p' "$0"; exit 0 ;;
    *) echo "unknown arg: $1" >&2; exit 2 ;;
  esac
 done
 KILL_CONDITION='AUC <= 0.60 OR high-subset lift <= +5pp ⇒ do NOT build §7–§8'
 echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
 command -v jq >/dev/null 2>&1 || { echo "jq required" >&2; exit 3; }
 [[ -r "$JSONL" ]] || { echo "provide a readable --jsonl FILE" >&2; exit 2; }
 # Normalise to "<confidence> <0|1>" rows; tolerate bad lines.
 ROWS="$(jq -rs '
  [ .[] | select((.confidence|type)=="number") |
    "\(.confidence) \((.correct==true) | if . then 1 else 0 end)" ]
  | .[]' "$JSONL" 2>/dev/null || true)"
 if [[ -z "$ROWS" ]]; then
  echo '{ "experiment": "P1-calibration", "error": "no usable rows" }'
  exit 0
 fi
 # AUC via the Mann–Whitney U relation (rank-based); base rate; high-subset lift.
 read -r N POS BASE AUC HIGH_N HIGH_CORRECT HIGH_RATE LIFT <<EOF
 $(printf '%s\n' "$ROWS" | awk -v high="$HIGH" '
  { c=$1; y=$2; conf[NR]=c; lab[NR]=y; n++;
    if (y==1) pos++; else neg++;
    if (c>=high) { hn++; if (y==1) hc++ } }
  END{
    base = (n>0)? pos/n : 0;
    # Rank-sum AUC: average ranks (ties → average rank).
    # sort indices by confidence
    for (i=1;i<=n;i++) idx[i]=i;
    for (i=1;i<=n;i++) for (j=i+1;j<=n;j++) if (conf[idx[i]]>conf[idx[j]]) { t=idx[i]; idx[i]=idx[j]; idx[j]=t }
    i=1;
    while (i<=n) {
      j=i; while (j<n && conf[idx[j+1]]==conf[idx[i]]) j++;
      avg=(i+j)/2.0;
      for (k=i;k<=j;k++) rank[idx[k]]=avg;
      i=j+1;
    }
    rsum=0; for (i=1;i<=n;i++) if (lab[i]==1) rsum+=rank[i];
    if (pos>0 && neg>0) auc=(rsum - pos*(pos+1)/2.0)/(pos*neg); else auc=0.5;
    hrate=(hn>0)? hc/hn : 0;
    lift=hrate-base;
    printf "%d %d %.4f %.4f %d %d %.4f %.4f", n, pos, base, auc, hn, hc, hrate, lift
  }')
 EOF
 verdict="$(awk -v auc="$AUC" -v lift="$LIFT" 'BEGIN{
  print (auc <= 0.60 || lift <= 0.05) ? "KILL §7–§8 — confidence not usable" : "signal present — proceed"
 }')"
 if [[ "$FORMAT" == "md" ]]; then
  cat <<EOF
 ## P1 — confidence calibration
 - rows: **${N}** (positives ${POS}) · base correct-rate **$(awk "BEGIN{printf \"%.1f\", 100*${BASE}}")%**
 - ROC AUC: **${AUC}**
 - high-confidence subset (>= ${HIGH}): n=${HIGH_N}, correct=${HIGH_CORRECT}, rate=$(awk "BEGIN{printf \"%.1f\", 100*${HIGH_RATE}}")%
 - lift over base: **$(awk "BEGIN{printf \"%+.1f\", 100*${LIFT}}")pp**
 - kill condition: ${KILL_CONDITION}
 - verdict: **${verdict}**
 EOF
 else
  awk -v n="$N" -v pos="$POS" -v base="$BASE" -v auc="$AUC" -v hn="$HIGH_N" \
      -v hc="$HIGH_CORRECT" -v hr="$HIGH_RATE" -v lift="$LIFT" -v high="$HIGH" \
      -v v="$verdict" -v kc="$KILL_CONDITION" 'BEGIN{
    printf "{\n"
    printf "  \"experiment\": \"P1-calibration\",\n"
    printf "  \"rows\": %d,\n", n
    printf "  \"positives\": %d,\n", pos
    printf "  \"base_rate\": %.4f,\n", base
    printf "  \"auc\": %.4f,\n", auc
    printf "  \"high_threshold\": %s,\n", high
    printf "  \"high_subset\": { \"n\": %d, \"correct\": %d, \"rate\": %.4f },\n", hn, hc, hr
    printf "  \"lift_over_base\": %.4f,\n", lift
    printf "  \"kill_condition\": \"%s\",\n", kc
    printf "  \"verdict\": \"%s\"\n", v
    printf "}\n"
  }'
 fi
--- a/scripts/analysis/reflect-git-history.sh
+++ b/scripts/analysis/reflect-git-history.sh
@@ -0,0 +1,110 @@
 #!/usr/bin/env bash
 # reflect-git-history.sh — Phase-0 experiment P2 ("only-self-reflection" bucket)
 #
 # Question: of the failures visible in git history, what fraction would ONLY
 # have been caught by end-of-run self-reflection — i.e. NOT by CI and NOT by
 # independent human review? If that bucket is near-empty, the closed
 # calibration / skill-synthesis loop (design §7–§8) is not worth building.
 #
 # Method: scan `git log` over a window for failure signals (reverts, and
 # fix:/hotfix commits landing shortly after a feature merge). Classify each by
 # the gate most likely to have caught it, using a pre-registered heuristic.
 # This is a HARNESS + RUBRIC; the classifier is deliberately simple and the
 # real corpus/labelling is wired later. It emits a structured tally.
 #
 # Usage:
 #   scripts/analysis/reflect-git-history.sh [--repo PATH] [--since SINCE] [--json|--md]
 #
 # Options:
 #   --repo PATH   repo to analyse (default: current repo)
 #   --since SINCE git log --since value (default: "6 months ago")
 #   --json        emit JSON (default)
 #   --md          emit markdown
 #
 # Requirements: git, awk.
 #
 # PRE-REGISTERED KILL CONDITION:
 #   bucket "only_self_reflection" is near-empty (< 10% of classified failures)
 #   ⇒ do NOT build design §7–§8 (closed loop). Caveat-notes capture only.
 set -euo pipefail
 REPO="."
 SINCE="6 months ago"
 FORMAT="json"
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --repo) REPO="$2"; shift 2 ;;
    --since) SINCE="$2"; shift 2 ;;
    --json) FORMAT="json"; shift ;;
    --md) FORMAT="md"; shift ;;
    -h|--help) sed -n '2,30p' "$0"; exit 0 ;;
    *) echo "unknown arg: $1" >&2; exit 2 ;;
  esac
 done
 KILL_CONDITION='bucket only_self_reflection < 10% of classified failures ⇒ do NOT build §7–§8'
 echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
 command -v git >/dev/null 2>&1 || { echo "git required" >&2; exit 3; }
 # Collect candidate failure commits: reverts + fix/hotfix subjects.
 mapfile -t LINES < <(
  git -C "$REPO" log --since="$SINCE" --pretty='%H%x09%s' 2>/dev/null \
    | grep -iE 'revert|hotfix|hot-fix|regression|fix(\(|:|!| )' || true
 )
 total=0; ci=0; human=0; selfonly=0
 for line in "${LINES[@]}"; do
  [[ -z "$line" ]] && continue
  subj="${line#*$'\t'}"
  total=$((total + 1))
  # Pre-registered classification heuristic (gate most likely to have caught it):
  #   - build/test/lint/type/ci signals → CI would have caught it
  #   - security/auth/permission/data/migration → human review would flag it
  #   - everything else (logic/UX/assumption/edge) → only-self-reflection bucket
  if printf '%s' "$subj" | grep -qiE 'test|lint|type|build|ci|compile|typo'; then
    ci=$((ci + 1))
  elif printf '%s' "$subj" | grep -qiE 'security|auth|permission|rbac|secret|migration|data|sql|injection'; then
    human=$((human + 1))
  else
    selfonly=$((selfonly + 1))
  fi
 done
 pct() { awk "BEGIN{ if ($2==0) print \"0.0\"; else printf \"%.1f\", 100*$1/$2 }"; }
 self_pct="$(pct "$selfonly" "$total")"
 verdict="$(awk "BEGIN{print ($self_pct < 10.0) ? \"KILL §7–§8\" : \"signal present — proceed to deeper labelling\"}")"
 if [[ "$FORMAT" == "md" ]]; then
  cat <<EOF
 ## P2 — git-history failure-gate attribution
 - window: \`${SINCE}\` · repo: \`${REPO}\`
 - classified failures: **${total}**
 | gate | count | share |
 |---|---:|---:|
 | CI would catch | ${ci} | $(pct "$ci" "$total")% |
 | human review would catch | ${human} | $(pct "$human" "$total")% |
 | only-self-reflection | ${selfonly} | ${self_pct}% |
 - kill condition: ${KILL_CONDITION}
 - verdict: **${verdict}**
 EOF
 else
  awk -v t="$total" -v c="$ci" -v h="$human" -v s="$selfonly" -v sp="$self_pct" \
      -v v="$verdict" -v since="$SINCE" -v repo="$REPO" -v kc="$KILL_CONDITION" 'BEGIN{
    printf "{\n"
    printf "  \"experiment\": \"P2-git-history\",\n"
    printf "  \"repo\": \"%s\",\n", repo
    printf "  \"since\": \"%s\",\n", since
    printf "  \"classified_failures\": %d,\n", t
    printf "  \"buckets\": { \"ci\": %d, \"human_review\": %d, \"only_self_reflection\": %d },\n", c, h, s
    printf "  \"only_self_reflection_pct\": %s,\n", sp
    printf "  \"kill_condition\": \"%s\",\n", kc
    printf "  \"verdict\": \"%s\"\n", v
    printf "}\n"
  }'
 fi
Author	SHA1	Message	Date
Hermes Agent	9e8a9cfa8d	fix(pr-ci-wait): CI-history primary tier — close webhook-lag false-green (#550 ) All checks were successful ci/woodpecker/push/ci Pipeline was successful Details ci/woodpecker/pr/ci Pipeline was successful Details F-06 follow-up per Mos ruling. The no-CI fast-exit was a pure empty-poll streak (NO_CI_MAX×interval ≈ 45s), so a slow-to-register pipeline (webhook/queue lag) looked like 'no CI' and could false-green a merge gate before the pipeline existed. Two-tier no-CI determination: - PRIMARY: probe the repo's DEFAULT BRANCH commit status once at startup. If it has CI history, the repo runs CI → an empty status on the PR head means the pipeline has not REGISTERED yet → never fast-green; poll until it registers or timeout (both safe). Closes the webhook-lag false-green. - SECONDARY: the empty-poll streak fast-exit now applies ONLY to genuinely CI-less repos (default branch also has no CI history). Preserves the original no-CI win. - Probe failure → conservative REPO_HAS_CI=1 (assume CI; wait-then-timeout beats false-green). All early returns are explicit 'return 0' + guarded call so the probe can never abort under set -e. Verified: bash -n + shellcheck clean; behavioral harness covers established-repo (stays 1), CI-less (→0), empty-branch/probe-fail (conservative 1), and the no-status gate (has-CI never fast-greens, CI-less fast-exits). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Kt2D8TsnDwhtzEAPijsNmR	2026-06-18 14:18:32 -05:00
Hermes Agent	b90aec2024	fix(framework/tools): wrapper hardening — TLS validation, cred-path fallback, no-CI fast-exit (#550 ) Some checks failed ci/woodpecker/push/ci Pipeline was canceled Details ci/woodpecker/pr/ci Pipeline was canceled Details F-03: validate TLS by default. New _mosaic_tls_opt helper in _lib/credentials.sh returns -k only for private-network IP literals (trusted LAN) or an explicit MOSAIC_INSECURE_TLS opt-in; generic mosaic_http/_post/_patch helpers now use `curl -sS $_tls` instead of `curl -sk`. Woodpecker scripts (_lib.sh, pipeline-status/list/trigger.sh) talk only to the two public/valid CI hosts, so `-sk` is changed to `-sS` (straight -k removal, no helper). F-02: credentials.sh resolves MOSAIC_CREDENTIALS_FILE via a fallback chain — env first, then ~/.config/mosaic/credentials.json, then the legacy ~/src/jarvis-brain/credentials.json retained as final fallback so the running fleet keeps working. F-06: pr-ci-wait.sh distinguishes a genuine no-CI condition (empty state AND no statuses) as a new `no-status` state and fast-exits 0 after 3 consecutive empty polls with a clear "no CI configured" message. Repos that DO have pipelines are unaffected — any pipeline signal resets the streak and pending still waits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Kt2D8TsnDwhtzEAPijsNmR	2026-06-18 14:02:43 -05:00
jason.woltje	b8807e60df	feat(agent-reflection): durable kernel — reflection.v1 capture + risk-floor + Phase-0 (#545 ) All checks were successful ci/woodpecker/push/ci Pipeline was successful Details ci/woodpecker/push/publish Pipeline was successful Details	2026-06-16 21:35:40 +00:00
jason.woltje	c461380a4a	feat(mosaic-as): agent registration + scoped/revocable tokens (US-007) (#541 ) All checks were successful ci/woodpecker/push/ci Pipeline was successful Details ci/woodpecker/push/publish Pipeline was successful Details	2026-06-16 01:10:44 +00:00