From b76666166e9d73345835e0fb68934b30003475e0 Mon Sep 17 00:00:00 2001
From: Hermes Agent <hermes@web1.uscllc.com>
Date: Tue, 16 Jun 2026 15:55:15 -0500
Subject: [PATCH] =?UTF-8?q?feat(agent-reflection):=20durable=20kernel=20?=
 =?UTF-8?q?=E2=80=94=20reflection.v1=20capture=20+=20risk-floor=20+=20Phas?=
 =?UTF-8?q?e-0=20(#544)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Build the durable kernel of the agent reflection loop. Passive end-of-run
capture of the doer's end-state as structured `reflection.v1` data, plus a
deterministic diff review risk-floor. The closed calibration/skill-synthesis
loop (design §7–§8) stays gated behind Phase-0 experiments P1/P2/P3.

- packages/macp: evaluateRiskFloor (pure, deterministic surface classifier)
  + reflection.v1 JSON Schema; 15 unit tests.
- packages/types: reflection.v1 zod schemas + self-report DTO; 10 unit tests.
- framework: fail-closed Stop hook (reflect-stop-hook.sh) writing the sidecar,
  registered as hooks.Stop in runtime/claude/settings.json. Strict no-op unless
  REFLECTION_MODE=solo|orchestrated; never blocks or fails a session.
- scripts/analysis: P1/P2/P3 experiment harnesses with pre-registered kill
  conditions and structured output.

Mechanical fields (risk, files_changed, ids, provenance) are written by the
hook; self-report fields (confidence, most_likely_wrong, known_not_in_diff) are
merged from an optional $REFLECTION_INPUT, else null + provenance.degraded=true.

Independent review remediations: empty/all-.mosaic diff still writes a sidecar
(grep no-match no longer aborts); session_id sanitized before path use.

Refs #544

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/plans/agent-reflection-loop-PRD.md       | 173 +++++++++++++++
 docs/scratchpads/544-agent-reflection-loop.md |  55 +++++
 docs/tasks/544-agent-reflection-loop.md       |  67 ++++++
 packages/macp/src/index.ts                    |   5 +
 packages/macp/src/risk-floor.spec.ts          |  87 ++++++++
 packages/macp/src/risk-floor.ts               | 138 ++++++++++++
 .../src/schemas/reflection.v1.schema.json     | 105 ++++++++++
 .../framework/runtime/claude/settings.json    |  11 +
 .../framework/tools/qa/reflect-stop-hook.sh   | 197 ++++++++++++++++++
 packages/types/src/index.ts                   |   1 +
 .../reflection/__tests__/reflection.spec.ts   | 146 +++++++++++++
 packages/types/src/reflection/index.ts        |  30 +++
 .../types/src/reflection/reflection.dto.ts    |  55 +++++
 packages/types/src/reflection/reflection.ts   |  90 ++++++++
 scripts/analysis/reflect-board-history.sh     | 111 ++++++++++
 scripts/analysis/reflect-calibration.sh       | 117 +++++++++++
 scripts/analysis/reflect-git-history.sh       | 110 ++++++++++
 17 files changed, 1498 insertions(+)
 create mode 100644 docs/plans/agent-reflection-loop-PRD.md
 create mode 100644 docs/scratchpads/544-agent-reflection-loop.md
 create mode 100644 docs/tasks/544-agent-reflection-loop.md
 create mode 100644 packages/macp/src/risk-floor.spec.ts
 create mode 100644 packages/macp/src/risk-floor.ts
 create mode 100644 packages/macp/src/schemas/reflection.v1.schema.json
 create mode 100755 packages/mosaic/framework/tools/qa/reflect-stop-hook.sh
 create mode 100644 packages/types/src/reflection/__tests__/reflection.spec.ts
 create mode 100644 packages/types/src/reflection/index.ts
 create mode 100644 packages/types/src/reflection/reflection.dto.ts
 create mode 100644 packages/types/src/reflection/reflection.ts
 create mode 100755 scripts/analysis/reflect-board-history.sh
 create mode 100755 scripts/analysis/reflect-calibration.sh
 create mode 100755 scripts/analysis/reflect-git-history.sh

diff --git a/docs/plans/agent-reflection-loop-PRD.md b/docs/plans/agent-reflection-loop-PRD.md
new file mode 100644
index 0000000..114b2b0
--- /dev/null
+++ b/docs/plans/agent-reflection-loop-PRD.md
@@ -0,0 +1,173 @@
+# PRD — Agent Reflection Loop (durable kernel)
+
+**Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544)
+**Source design:** jarvis-brain `docs/planning/AGENT-REFLECTION-LOOP.md` (commit df6576fc, debate-hardened v2)
+**Status:** in-progress
+**Scope rule:** Build the **durable kernel** only. The closed calibration/skill-synthesis loop
+(design §7–§8) is **gated** behind Phase-0 experiments P1/P2/P3 and is explicitly out of scope here.
+
+---
+
+## 1. Problem
+
+At end-of-run an agent holds context that never reaches the diff or the "done" message —
+assumptions, shortcuts, untested paths, the single most-likely way the work is wrong. That context
+is what a lead/human needs to judge trust, and it evaporates when the session ends. Capture it
+mechanically as **structured data** (`reflection.v1`), and derive a **review risk-floor** from the
+change surface so risky diffs are flagged for independent review.
+
+## 2. Non-goals (gated on Phase-0)
+
+- No closed calibration loop (predicted-vs-actual scoring as a routing input).
+- No skill synthesis.
+- No automated reviewer routing/dispatch. The kernel **writes** the sidecar; pickup is future work.
+
+## 3. Components & exact placement (main-branch truth)
+
+| #   | Component            | Path                                                                                             | Mirror                              |
+| --- | -------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------- |
+| a   | Stop hook (capture)  | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh`                                        | `tools/qa/prevent-memory-write.sh`  |
+| a   | Hook registration    | `packages/mosaic/framework/runtime/claude/settings.json` (`hooks.Stop`)                          | existing `PreToolUse`/`PostToolUse` |
+| b   | JSON Schema          | `packages/macp/src/schemas/reflection.v1.schema.json`                                            | `schemas/task.schema.json`          |
+| b   | TS types (zod) + DTO | `packages/types/src/reflection/{index.ts,reflection.dto.ts}` + re-export from `src/index.ts`     | `packages/types/src/federation/*`   |
+| c   | Diff risk-floor      | `packages/macp/src/risk-floor.ts` (+ `__tests__/risk-floor.test.ts`, export from `src/index.ts`) | `packages/macp/src/gate-runner.ts`  |
+| d   | Phase-0 scripts      | `scripts/analysis/reflect-{git-history,board-history,calibration}.sh`                            | `scripts/publish-npmjs.sh`          |
+
+**Activation note (deliberate deviation):** the `settings-overlays/` directory has **no merge
+mechanism** (referenced only in docs), so a hooks overlay there would be inert. The Stop hook is
+registered in the canonical `runtime/claude/settings.json` — the same file the `mosaic` launcher
+reflects into `~/.claude/settings.json` (verified byte-identical hooks live there). Still fully
+vendored in-repo.
+
+## 4. `reflection.v1` schema (authoritative field list)
+
+```jsonc
+{
+  "schema": "reflection.v1", // literal
+  "task_ref": "string", // canonical task ref; kernel derives from REFLECTION_TASK_REF or repo+branch
+  "agent": "string", // persona/runtime id (REFLECTION_AGENT or "unknown")
+  "session_id": "string", // from Stop payload session_id, else "unknown"
+  "timestamp": "string", // ISO-8601 UTC
+  "repo": "string", // repo root basename
+  "confidence": 0.0, // FLOAT [0,1] — SELF-REPORTED (optional; null if not supplied)
+  "most_likely_wrong": {
+    // SELF-REPORTED (optional)
+    "surface": "auth|data|infra|ui|build|test|docs|none",
+    "description": "string",
+  },
+  "known_not_in_diff": "string|null", // SELF-REPORTED: "what I know that isn't visible in the diff"
+  "risk": {
+    // MECHANICAL — from risk-floor
+    "needs_review": true,
+    "score": 0.0, // [0,1]
+    "surface": "auth|data|infra|ui|build|test|docs|none",
+    "reason": "string",
+  },
+  "files_changed": ["string"], // MECHANICAL — git diff name-only
+  "provenance": {
+    "source": "stop-hook",
+    "reflection_attempt": 1,
+    "degraded": false, // true if self-report inputs missing/unreadable
+    "reflection_mode": "off|solo|orchestrated",
+  },
+}
+```
+
+**Mechanical vs self-reported.** A bash Stop hook cannot author the agent's self-assessment. The
+hook populates the **mechanical** fields deterministically (risk, files_changed, provenance, ids).
+The **self-reported** fields are read from an optional agent-supplied input file
+(`$REFLECTION_INPUT`, default `<repo>/.mosaic/reflection-input.json`) and merged if present;
+absent/unreadable → those fields null and `provenance.degraded=true`. This realizes the design's
+"hook is a pre-seed, not the asker" (§4).
+
+## 5. Stop hook behavior (fail-closed, non-blocking)
+
+1. Read Stop payload JSON from stdin.
+2. **Fail-closed:** if `REFLECTION_MODE` is unset or `off` → `exit 0` immediately (strict no-op). This
+   is the global-registration safety guarantee.
+3. **Sentinel guard:** if `<sidecar>.lock` exists → `exit 0` (prevents re-fire loops). Create it,
+   `trap` cleanup.
+4. Determine output dir: `$REFLECTION_DIR` else `<repo>/.mosaic/reflections/`. `mkdir -p`.
+5. Compute mechanical fields: `git diff --name-only` (HEAD + staged + worktree, best-effort),
+   call risk-floor logic (inline bash port OR `node -e` into `@mosaicstack/macp` — see §6), session
+   ids from payload + env.
+6. Merge optional `$REFLECTION_INPUT` self-report if readable JSON.
+7. Write `reflection.v1` to a temp file, `mv` (atomic) to `<dir>/<session>-<ts>.reflection.json`.
+8. Always `exit 0`. **Never** emit a `decision` field (Stop hooks are observational).
+
+Hook must never fail the session: wrap risky steps, default to `degraded:true` on any error, exit 0.
+
+## 6. Risk-floor (`packages/macp/src/risk-floor.ts`)
+
+Pure, deterministic, no IO. Single source of truth for the verdict; the hook calls it via
+`node --input-type=module -e` (importing the built package) **or**, to avoid a node dependency in the
+hook path, the hook ports the same surface table. **Decision:** implement the canonical logic in TS
+(tested), and have the hook shell out to node when available, else fall back to a minimal inline
+classifier flagged `degraded:true`. (Keep the TS the authority; the inline path is a safety net.)
+
+```ts
+export type ReviewSurface = 'auth' | 'data' | 'infra' | 'ui' | 'build' | 'test' | 'docs' | 'none';
+export interface RiskFloorInput {
+  filesChanged: string[];
+  insertions?: number;
+  deletions?: number;
+}
+export interface RiskFloorVerdict {
+  needs_review: boolean;
+  score: number;
+  surface: ReviewSurface;
+  reason: string;
+}
+export function evaluateRiskFloor(input: RiskFloorInput): RiskFloorVerdict;
+```
+
+Surface classification by path regex (first match wins, highest-risk surface dominates):
+
+- `auth` (weight 1.0): `auth`, `login`, `session`, `token`, `permission`, `rbac`, `credential`, `secret`
+- `data` (0.9): `migration`, `prisma`, `schema`, `\.sql`, `entity`, `repository`, `seed`
+- `infra` (0.85): `docker`, `\.woodpecker`, `compose`, `traefik`, `deploy`, `helm`, `k8s`, `terraform`
+- `build` (0.6): `package.json`, `tsconfig`, `turbo.json`, `pnpm-`, `\.config\.`, `eslint`, `vite`
+- `ui` (0.4): `\.tsx`, `\.css`, `components/`, `apps/web/`
+- `test` (0.2): `\.spec\.`, `\.test\.`, `__tests__/`
+- `docs` (0.1): `\.md`, `docs/`
+- `none` (0.0): anything else
+
+`needs_review = score >= THRESHOLD` (default `0.5`, overridable). `reason` names the files+surface
+that tripped it. **Subordinate to CI:** this is a _floor_ (minimum review requirement) only;
+consumers MUST treat CI/tests as authoritative above the floor (precedence: CI/tests > human merge >
+reviewer verdict > self-reflection). Documented in the module header.
+
+## 7. Phase-0 experiment scripts (`scripts/analysis/`)
+
+Offline, no-infra bash. Each script: `#!/usr/bin/env bash`, `set -euo pipefail`, header `Usage:` +
+`Requirements:`, flag parsing, **prints its pre-registered kill condition**, emits structured
+(JSON/markdown) output. They are harnesses + rubrics — real corpora are wired later.
+
+- `reflect-git-history.sh` (**P2** — only-self-reflection bucket): scan `git log` for failure signals
+  (reverts, `fix:`/`hotfix` shortly after a feature merge) over a window; classify each by which gate
+  would catch it (CI / human-review / only-self-reflection) via a pre-registered heuristic; tally.
+  Kill: bucket-3 near-empty → no §7/§8.
+- `reflect-board-history.sh` (**P3** — outcome detectability): given a task/board export (or the
+  git history of `data/` task files), measure the fraction of completed tasks with a
+  machine-detectable correct/wrong signal within 30 days. Kill: base-rate < 20% → caveat-notes only.
+- `reflect-calibration.sh` (**P1** — confidence signal): consume a labeled corpus (JSONL of
+  `{confidence, correct}`), compute discrimination (AUC/lift) on the self-rated-high subset, print
+  the metric vs the pre-registered chance threshold. Kill: AUC ≈ chance on the high subset → no §7/§8.
+
+## 8. CI / quality gates
+
+- TS packages: `pnpm typecheck` (tsc --noEmit), `pnpm lint` (eslint), `pnpm format:check`
+  (prettier), `pnpm test` (vitest). ESM, NodeNext, `.js` import specifiers, `*.dto.ts` at boundaries.
+- New files in existing packages need no CI config change; add ≥1 vitest spec per new TS module.
+- Bash scripts/hook are dev/runtime tooling, not CI-built; keep them `shellcheck`-clean.
+
+## 9. Acceptance criteria
+
+1. `REFLECTION_MODE` unset → hook is a strict no-op (`exit 0`, no file written). **(test)**
+2. With `REFLECTION_MODE=solo`, hook writes a schema-valid `reflection.v1` with correct mechanical
+   fields; self-report merged when `$REFLECTION_INPUT` present, `degraded:true` when absent.
+3. `evaluateRiskFloor` deterministic across all surfaces; unit-tested incl. auth/data/infra → review,
+   docs/test → no review, empty → `none`/no review.
+4. `reflection.v1` zod type + JSON Schema agree; sidecar validates against the schema.
+5. Phase-0 scripts run offline, print kill conditions, emit structured output, shellcheck-clean.
+6. `pnpm typecheck && pnpm lint && pnpm format:check && pnpm test` green; independent review passed.
diff --git a/docs/scratchpads/544-agent-reflection-loop.md b/docs/scratchpads/544-agent-reflection-loop.md
new file mode 100644
index 0000000..fd7f569
--- /dev/null
+++ b/docs/scratchpads/544-agent-reflection-loop.md
@@ -0,0 +1,55 @@
+# Scratchpad — #544 Agent Reflection Loop (durable kernel)
+
+**Started:** 2026-06-16 · **Branch:** `feat/agent-reflection-loop` · **Base:** `main` @ c461380
+
+## Goal
+
+Bake the durable kernel of the agent reflection loop into the Mosaic Stack
+monorepo through full delivery gates. Kernel only; closed loop (§7–§8) gated on
+Phase-0. Authoritative spec: `docs/plans/agent-reflection-loop-PRD.md`. Task
+breakdown: `docs/tasks/544-agent-reflection-loop.md`.
+
+## Timeline / decisions
+
+- Mapped house style against `main` truth (the earlier recon had mapped a dirty
+  feature branch and returned non-existent paths; re-cloned `main` clean).
+- macp uses co-located `*.spec.ts`; types uses `src/<mod>/{*.ts, *.dto.ts, __tests__/*.spec.ts}`.
+- zod v4 + class-validator/class-transformer present in `@mosaicstack/types`;
+  `packages/types/tsconfig.json` enables `experimentalDecorators`/`emitDecoratorMetadata`.
+- **Gotcha (fixed):** `class-transformer`'s `@Type` calls `Reflect.getMetadata`
+  at module-load time; the types vitest env has no `reflect-metadata`, so any test
+  importing the reflection barrel crashed on import. `chat.dto.ts` avoids this by
+  using class-validator only. Fix: dropped `@Type`/`@ValidateNested` from the DTO;
+  zod owns deep nested validation.
+- **Gotcha (fixed):** Stop hook `EXIT` trap referenced a `main`-local `lock` →
+  `unbound variable` under `set -u` at exit. Promoted to a global `LOCKFILE`.
+- **Gotcha (fixed):** the hook's own lock + `.mosaic/` scratch leaked into
+  `files_changed`. Excluded `^\.mosaic/` from the change-surface scan.
+
+## Verification evidence
+
+- macp: typecheck OK, lint OK, **88 tests pass** (15 new risk-floor).
+- types: typecheck OK, lint OK, **64 tests pass** (10 new reflection).
+- Root: `pnpm typecheck` (41 tasks), `pnpm lint` (23), `pnpm format:check`, `pnpm build` (23) — all green.
+- Stop hook smoke (throwaway git repo): TEST1 no-op (mode unset, 0 files);
+  TEST2 solo degraded, `.mosaic/` excluded, auth→needs_review; TEST3 self-report
+  merged, degraded=false; TEST4 lock suppresses re-fire. All pass, always exit 0.
+- shellcheck clean: hook + `reflect-{git-history,board-history,calibration}.sh`.
+- Phase-0 smoke: P2 on this repo (142 failures classified), P1 AUC=0.875 on a
+  synthetic fixture, P3 base-rate on a synthetic board — all emit structured output
+  - kill conditions.
+
+## Open risks / follow-ups
+
+- Full `pnpm test` (DB-bound packages) validated via CI's postgres service, not
+  locally; affected packages (macp, types) are DB-independent and green here.
+- sequential-thinking MCP was registered mid-session (effective next session);
+  this session compensated with the written PRD as the planning artifact.
+- Phase-0 corpora are not yet wired — scripts are harnesses + pre-registered
+  rubrics (P1/P2/P3 tasks tracked in jarvis-brain `agent-reflection-loop` project).
+
+## Gate status
+
+- [x] PRD authored · [x] issue #544 created + linked · [x] code + tests
+- [x] local gates green · [ ] independent code review · [ ] PR opened
+- [ ] CI terminal green · [ ] merged to main · [ ] issue closed
diff --git a/docs/tasks/544-agent-reflection-loop.md b/docs/tasks/544-agent-reflection-loop.md
new file mode 100644
index 0000000..4c07553
--- /dev/null
+++ b/docs/tasks/544-agent-reflection-loop.md
@@ -0,0 +1,67 @@
+# 544: Agent Reflection Loop — durable kernel
+
+**Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544)
+**PRD:** [`docs/plans/agent-reflection-loop-PRD.md`](../plans/agent-reflection-loop-PRD.md)
+**Branch:** `feat/agent-reflection-loop`
+
+## Context
+
+Build the **durable kernel** of the agent reflection loop: passive end-of-run
+capture of the doer's end-state as structured `reflection.v1` data, plus a
+deterministic diff **review risk-floor**. The closed calibration / skill-synthesis
+loop (design §7–§8) stays **gated** behind Phase-0 experiments P1/P2/P3 and is
+explicitly out of scope here. Source design: jarvis-brain
+`docs/planning/AGENT-REFLECTION-LOOP.md` (debate-hardened v2).
+
+Scope rule, non-goals, the full `reflection.v1` field list, and acceptance
+criteria live in the PRD. This file is the task breakdown + status.
+
+## Work items
+
+| #   | Item                                                  | Path                                                      | Status |
+| --- | ----------------------------------------------------- | --------------------------------------------------------- | ------ |
+| 1   | Diff risk-floor (pure, deterministic) + unit tests    | `packages/macp/src/risk-floor.ts`, `risk-floor.spec.ts`   | done   |
+| 2   | `reflection.v1` JSON Schema (documented contract)     | `packages/macp/src/schemas/reflection.v1.schema.json`     | done   |
+| 3   | `reflection.v1` zod schemas + self-report DTO + tests | `packages/types/src/reflection/*`                         | done   |
+| 4   | Stop hook (fail-closed capture)                       | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh` | done   |
+| 5   | Hook registration (`hooks.Stop`)                      | `packages/mosaic/framework/runtime/claude/settings.json`  | done   |
+| 6   | Phase-0 experiment harnesses (P1/P2/P3)               | `scripts/analysis/reflect-*.sh`                           | done   |
+
+## Design decisions (this implementation)
+
+- **Mechanical vs self-reported split.** A bash Stop hook cannot author the
+  agent's self-assessment, so it writes the mechanical fields (risk-floor verdict,
+  `files_changed`, ids, provenance) and merges an optional agent-supplied
+  `$REFLECTION_INPUT` self-report; absent/unreadable ⇒ those fields `null` and
+  `provenance.degraded = true`.
+- **Risk-floor authority.** `evaluateRiskFloor` (TS, tested) is the source of
+  truth. The hook ports the same surface table inline to avoid a node/build
+  dependency on the hook path; the two are documented as kept in sync.
+- **Hook registration deviation.** `settings-overlays/` has no merge mechanism
+  (docs-only), so a hooks overlay there would be inert. The Stop hook is
+  registered in the canonical `runtime/claude/settings.json` — the same file the
+  `mosaic` launcher reflects into `~/.claude/settings.json`. Still vendored in-repo.
+- **DTO without class-transformer.** `reflection.dto.ts` uses class-validator only
+  (no `@Type`), matching `chat.dto.ts`, so the module imports without a
+  `reflect-metadata` shim in the types-package test env. Deep nested validation is
+  owned by the zod `ReflectionSelfReportSchema` (the runtime authority the hook uses).
+- **`.mosaic/` excluded** from the change surface — it is agent scratch
+  (reflections, locks, self-report input), not part of the diff under review.
+
+## Verification
+
+- `pnpm --filter @mosaicstack/macp test` → 88 passed (15 new risk-floor).
+- `pnpm --filter @mosaicstack/types test` → 64 passed (10 new reflection).
+- Root `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, `pnpm build` → green.
+- Stop hook smoke: fail-closed no-op (mode unset), solo capture (degraded),
+  self-report merge (degraded=false), re-fire lock guard — all pass.
+- All bash (hook + 3 Phase-0 scripts) shellcheck-clean; Phase-0 scripts emit
+  structured JSON/markdown and print their pre-registered kill conditions.
+
+## Activation (post-merge, deployment concern — not a blocker)
+
+The Stop hook only activates when a launcher/profile sets
+`REFLECTION_MODE=solo|orchestrated`; unset/`off` is a strict no-op, so global
+registration is safe. `framework/install.sh` rsyncs the hook into
+`~/.config/mosaic/tools/qa/`, and the `mosaic` launcher reflects the updated
+`settings.json` (`hooks.Stop`) into `~/.claude/settings.json`.
diff --git a/packages/macp/src/index.ts b/packages/macp/src/index.ts
index 073c886..a510b9a 100644
--- a/packages/macp/src/index.ts
+++ b/packages/macp/src/index.ts
@@ -39,6 +39,11 @@ export { normalizeGate, runShell, countAIFindings, runGate, runGates } from './g
 
 export type { NormalizedGate } from './gate-runner.js';
 
+// Risk-floor (agent reflection loop — diff review classifier)
+export { evaluateRiskFloor, DEFAULT_RISK_THRESHOLD } from './risk-floor.js';
+
+export type { ReviewSurface, RiskFloorInput, RiskFloorVerdict } from './risk-floor.js';
+
 // Event emitter
 export { nowISO, appendEvent, emitEvent } from './event-emitter.js';
 
diff --git a/packages/macp/src/risk-floor.spec.ts b/packages/macp/src/risk-floor.spec.ts
new file mode 100644
index 0000000..32e3ac8
--- /dev/null
+++ b/packages/macp/src/risk-floor.spec.ts
@@ -0,0 +1,87 @@
+import { describe, expect, it } from 'vitest';
+
+import { DEFAULT_RISK_THRESHOLD, evaluateRiskFloor, type ReviewSurface } from './risk-floor.js';
+
+describe('evaluateRiskFloor', () => {
+  it('returns a no-review "none" verdict for an empty diff', () => {
+    const v = evaluateRiskFloor({ filesChanged: [] });
+    expect(v).toEqual({
+      needs_review: false,
+      score: 0,
+      surface: 'none',
+      reason: 'no files changed',
+    });
+  });
+
+  it('ignores empty/non-string entries', () => {
+    const v = evaluateRiskFloor({ filesChanged: ['', '   ' as unknown as string].filter(Boolean) });
+    // only the whitespace string survives the Boolean filter; it classifies to none
+    expect(v.surface).toBe('none');
+    expect(v.needs_review).toBe(false);
+  });
+
+  it.each<[string, string, ReviewSurface, boolean]>([
+    ['auth', 'apps/api/src/auth/session.guard.ts', 'auth', true],
+    ['data', 'packages/db/migrations/0007_add_users.sql', 'data', true],
+    ['infra', '.woodpecker/deploy.yml', 'infra', true],
+    ['build', 'packages/types/tsconfig.json', 'build', true],
+    ['ui', 'apps/web/src/components/Button.tsx', 'ui', false],
+    ['test', 'packages/macp/src/risk-floor.spec.ts', 'test', false],
+    ['docs', 'docs/plans/agent-reflection-loop-PRD.md', 'docs', false],
+    ['none', 'README', 'none', false],
+  ])(
+    'classifies a single %s file → surface=%s needs_review=%s',
+    (_label, file, surface, needsReview) => {
+      const v = evaluateRiskFloor({ filesChanged: [file] });
+      expect(v.surface).toBe(surface);
+      expect(v.needs_review).toBe(needsReview);
+      expect(v.reason).toContain(
+        file === 'README' ? 'no sensitive surface' : surface === 'none' ? '' : surface,
+      );
+    },
+  );
+
+  it('lets the highest-risk surface dominate a mixed diff', () => {
+    const v = evaluateRiskFloor({
+      filesChanged: [
+        'docs/readme.md',
+        'apps/web/src/components/Nav.tsx',
+        'apps/api/src/auth/token.service.ts',
+      ],
+    });
+    expect(v.surface).toBe('auth');
+    expect(v.score).toBe(1.0);
+    expect(v.needs_review).toBe(true);
+    expect(v.reason).toContain('token.service.ts');
+    expect(v.reason).not.toContain('readme.md');
+  });
+
+  it('names every file that ties at the dominant surface', () => {
+    const v = evaluateRiskFloor({
+      filesChanged: ['src/login.ts', 'src/permission-check.ts'],
+    });
+    expect(v.surface).toBe('auth');
+    expect(v.reason).toContain('src/login.ts');
+    expect(v.reason).toContain('src/permission-check.ts');
+  });
+
+  it('treats docs+test-only diffs as below the floor', () => {
+    const v = evaluateRiskFloor({
+      filesChanged: ['docs/guide.md', 'packages/x/src/x.test.ts'],
+    });
+    expect(v.needs_review).toBe(false);
+    expect(v.surface).toBe('test'); // higher weight than docs
+  });
+
+  it('honors a custom threshold', () => {
+    const docsOnly = { filesChanged: ['docs/guide.md'] };
+    expect(evaluateRiskFloor(docsOnly, 0.05).needs_review).toBe(true);
+    expect(evaluateRiskFloor(docsOnly, DEFAULT_RISK_THRESHOLD).needs_review).toBe(false);
+  });
+
+  it('is deterministic across call order', () => {
+    const a = evaluateRiskFloor({ filesChanged: ['a.md', 'auth/x.ts', 'b.tsx'] });
+    const b = evaluateRiskFloor({ filesChanged: ['b.tsx', 'a.md', 'auth/x.ts'] });
+    expect(a).toEqual(b);
+  });
+});
diff --git a/packages/macp/src/risk-floor.ts b/packages/macp/src/risk-floor.ts
new file mode 100644
index 0000000..5a87d5f
--- /dev/null
+++ b/packages/macp/src/risk-floor.ts
@@ -0,0 +1,138 @@
+/**
+ * Diff risk-floor — deterministic review-need classifier.
+ *
+ * Given the set of changed files in a diff, derive a *minimum* review
+ * requirement ("floor") from the change surface. This is the mechanical half
+ * of the agent reflection loop (design §6): risky surfaces (auth, data, infra)
+ * trip a review requirement regardless of what the agent self-reports.
+ *
+ * Precedence (authoritative ordering, see design §5):
+ *   CI/tests  >  human merge  >  reviewer verdict  >  self-reflection
+ * This module sits at the *floor*. It NEVER overrides CI or a human; a
+ * `needs_review: false` verdict means "no surface tripped the floor", not
+ * "safe to merge". Consumers MUST keep CI/tests authoritative above it.
+ *
+ * Pure and deterministic: no IO, no clock, no randomness. Same input → same
+ * verdict. Safe to call from a Stop hook via `node -e` or to port inline.
+ */
+
+/** Review surfaces, ordered most- to least-sensitive. */
+export type ReviewSurface = 'auth' | 'data' | 'infra' | 'build' | 'ui' | 'test' | 'docs' | 'none';
+
+export interface RiskFloorInput {
+  /** Paths of changed files, repo-relative. Order-insensitive. */
+  filesChanged: string[];
+  /** Optional diff size signals; reserved for future weighting. */
+  insertions?: number;
+  deletions?: number;
+}
+
+export interface RiskFloorVerdict {
+  /** True when the change surface meets/exceeds the review threshold. */
+  needs_review: boolean;
+  /** Aggregate risk score in [0, 1] — the max surface weight across files. */
+  score: number;
+  /** The dominant (highest-weight) surface across all changed files. */
+  surface: ReviewSurface;
+  /** Human-readable explanation naming the surface and tripping files. */
+  reason: string;
+}
+
+/** Default review threshold; `score >= THRESHOLD` ⇒ `needs_review`. */
+export const DEFAULT_RISK_THRESHOLD = 0.5;
+
+interface SurfaceRule {
+  surface: ReviewSurface;
+  weight: number;
+  /** Case-insensitive regex matched against the file path. */
+  pattern: RegExp;
+}
+
+/**
+ * Surface classification rules, evaluated highest-weight first. The first
+ * rule whose pattern matches a path classifies that file; the file's surface
+ * is the highest-risk surface it matches (rules are pre-sorted by weight).
+ */
+const SURFACE_RULES: readonly SurfaceRule[] = [
+  {
+    surface: 'auth',
+    weight: 1.0,
+    pattern: /auth|login|session|token|permission|rbac|credential|secret/i,
+  },
+  {
+    surface: 'data',
+    weight: 0.9,
+    pattern: /migration|prisma|schema|\.sql|entity|repository|seed/i,
+  },
+  {
+    surface: 'infra',
+    weight: 0.85,
+    pattern: /docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform/i,
+  },
+  {
+    surface: 'build',
+    weight: 0.6,
+    pattern: /package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite/i,
+  },
+  { surface: 'ui', weight: 0.4, pattern: /\.tsx|\.css|components\/|apps\/web\// },
+  { surface: 'test', weight: 0.2, pattern: /\.spec\.|\.test\.|__tests__\// },
+  { surface: 'docs', weight: 0.1, pattern: /\.md$|docs\// },
+];
+
+const NONE_WEIGHT = 0.0;
+
+/** Classify a single path to its highest-risk surface and weight. */
+function classify(path: string): { surface: ReviewSurface; weight: number } {
+  for (const rule of SURFACE_RULES) {
+    if (rule.pattern.test(path)) {
+      return { surface: rule.surface, weight: rule.weight };
+    }
+  }
+  return { surface: 'none', weight: NONE_WEIGHT };
+}
+
+/**
+ * Evaluate the review risk-floor for a diff.
+ *
+ * @param input         changed files (+ optional size signals)
+ * @param threshold     review cutoff; defaults to {@link DEFAULT_RISK_THRESHOLD}
+ */
+export function evaluateRiskFloor(
+  input: RiskFloorInput,
+  threshold: number = DEFAULT_RISK_THRESHOLD,
+): RiskFloorVerdict {
+  const files = (input.filesChanged ?? []).filter((f) => typeof f === 'string' && f.length > 0);
+
+  if (files.length === 0) {
+    return {
+      needs_review: false,
+      score: 0,
+      surface: 'none',
+      reason: 'no files changed',
+    };
+  }
+
+  let topSurface: ReviewSurface = 'none';
+  let topWeight = NONE_WEIGHT;
+  const tripping: string[] = [];
+
+  for (const file of files) {
+    const { surface, weight } = classify(file);
+    if (weight > topWeight) {
+      topWeight = weight;
+      topSurface = surface;
+      tripping.length = 0;
+      tripping.push(file);
+    } else if (weight === topWeight && surface === topSurface && surface !== 'none') {
+      tripping.push(file);
+    }
+  }
+
+  const needs_review = topWeight >= threshold;
+  const reason =
+    topSurface === 'none'
+      ? `no sensitive surface in ${files.length} changed file(s)`
+      : `${topSurface} surface (weight ${topWeight}) in: ${tripping.join(', ')}`;
+
+  return { needs_review, score: topWeight, surface: topSurface, reason };
+}
diff --git a/packages/macp/src/schemas/reflection.v1.schema.json b/packages/macp/src/schemas/reflection.v1.schema.json
new file mode 100644
index 0000000..a320411
--- /dev/null
+++ b/packages/macp/src/schemas/reflection.v1.schema.json
@@ -0,0 +1,105 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://mosaicstack.dev/schemas/reflection/reflection.v1.schema.json",
+  "title": "Agent Reflection (v1)",
+  "description": "End-of-run reflection sidecar. Mechanical fields are written by the Stop hook; self-reported fields are merged from an optional agent-supplied input and are null when absent (provenance.degraded=true).",
+  "type": "object",
+  "required": [
+    "schema",
+    "task_ref",
+    "agent",
+    "session_id",
+    "timestamp",
+    "repo",
+    "risk",
+    "files_changed",
+    "provenance"
+  ],
+  "properties": {
+    "schema": {
+      "const": "reflection.v1"
+    },
+    "task_ref": {
+      "type": "string",
+      "description": "Canonical task ref; derived from REFLECTION_TASK_REF or repo+branch."
+    },
+    "agent": {
+      "type": "string",
+      "description": "Persona/runtime id (REFLECTION_AGENT or 'unknown')."
+    },
+    "session_id": {
+      "type": "string",
+      "description": "From the Stop payload session_id, else 'unknown'."
+    },
+    "timestamp": {
+      "type": "string",
+      "format": "date-time",
+      "description": "ISO-8601 UTC capture time."
+    },
+    "repo": {
+      "type": "string",
+      "description": "Repo root basename."
+    },
+    "confidence": {
+      "type": ["number", "null"],
+      "minimum": 0,
+      "maximum": 1,
+      "description": "SELF-REPORTED. Agent's overall confidence; null when not supplied."
+    },
+    "most_likely_wrong": {
+      "type": ["object", "null"],
+      "description": "SELF-REPORTED. The single most-likely way the work is wrong.",
+      "required": ["surface", "description"],
+      "properties": {
+        "surface": { "$ref": "#/$defs/surface" },
+        "description": { "type": "string" }
+      },
+      "additionalProperties": false
+    },
+    "known_not_in_diff": {
+      "type": ["string", "null"],
+      "description": "SELF-REPORTED. What the agent knows that isn't visible in the diff."
+    },
+    "risk": {
+      "type": "object",
+      "description": "MECHANICAL. Output of the diff risk-floor.",
+      "required": ["needs_review", "score", "surface", "reason"],
+      "properties": {
+        "needs_review": { "type": "boolean" },
+        "score": { "type": "number", "minimum": 0, "maximum": 1 },
+        "surface": { "$ref": "#/$defs/surface" },
+        "reason": { "type": "string" }
+      },
+      "additionalProperties": false
+    },
+    "files_changed": {
+      "type": "array",
+      "items": { "type": "string" },
+      "description": "MECHANICAL. git diff name-only."
+    },
+    "provenance": {
+      "type": "object",
+      "required": ["source", "reflection_attempt", "degraded", "reflection_mode"],
+      "properties": {
+        "source": { "const": "stop-hook" },
+        "reflection_attempt": { "type": "integer", "minimum": 1 },
+        "degraded": {
+          "type": "boolean",
+          "description": "True when self-report inputs were missing/unreadable."
+        },
+        "reflection_mode": {
+          "type": "string",
+          "enum": ["off", "solo", "orchestrated"]
+        }
+      },
+      "additionalProperties": false
+    }
+  },
+  "additionalProperties": false,
+  "$defs": {
+    "surface": {
+      "type": "string",
+      "enum": ["auth", "data", "infra", "build", "ui", "test", "docs", "none"]
+    }
+  }
+}
diff --git a/packages/mosaic/framework/runtime/claude/settings.json b/packages/mosaic/framework/runtime/claude/settings.json
index 557fcbc..0318d9e 100644
--- a/packages/mosaic/framework/runtime/claude/settings.json
+++ b/packages/mosaic/framework/runtime/claude/settings.json
@@ -34,6 +34,17 @@
           }
         ]
       }
+    ],
+    "Stop": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "~/.config/mosaic/tools/qa/reflect-stop-hook.sh",
+            "timeout": 15
+          }
+        ]
+      }
     ]
   },
   "enabledPlugins": {
diff --git a/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh b/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh
new file mode 100755
index 0000000..41fbd2d
--- /dev/null
+++ b/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh
@@ -0,0 +1,197 @@
+#!/usr/bin/env bash
+# reflect-stop-hook.sh — Stop hook (agent reflection loop, durable kernel)
+#
+# At end-of-run, capture the doer's end-state as a structured `reflection.v1`
+# sidecar: the mechanical diff risk-floor plus any self-report the agent left
+# behind. This is the passive capture half of the design (§10 step 1). It does
+# NOT route, score, or gate — it only writes the sidecar; pickup is future work.
+#
+# FAIL-CLOSED: if REFLECTION_MODE is unset or "off", this is a strict no-op.
+# Global registration is therefore safe; the feature only activates when a
+# launcher/profile explicitly sets REFLECTION_MODE=solo|orchestrated.
+#
+# NON-BLOCKING: Stop hooks are observational. This script NEVER emits a
+# `decision` field and ALWAYS exits 0 — it can never fail or stall a session.
+#
+# Environment contract:
+#   REFLECTION_MODE            off|solo|orchestrated   (default: off → no-op)
+#   REFLECTION_DIR             output dir              (default: <repo>/.mosaic/reflections)
+#   REFLECTION_INPUT           self-report JSON        (default: <repo>/.mosaic/reflection-input.json)
+#   REFLECTION_TASK_REF        canonical task ref      (default: <repo>#<branch>)
+#   REFLECTION_AGENT           persona/runtime id      (default: unknown)
+#   REFLECTION_RISK_THRESHOLD  review cutoff [0,1]     (default: 0.5)
+#
+# Risk-floor surface table is kept in sync with the authoritative TS
+# implementation at packages/macp/src/risk-floor.ts (evaluateRiskFloor).
+#
+# Exit codes: always 0 (observational hook).
+
+set -euo pipefail
+
+# ---- fail-closed gate -------------------------------------------------------
+MODE="${REFLECTION_MODE:-off}"
+if [[ "$MODE" != "solo" && "$MODE" != "orchestrated" ]]; then
+  exit 0
+fi
+
+# Read the Stop payload (best-effort; never required).
+INPUT="$(cat || true)"
+
+# Sentinel lock path (global so the EXIT trap can clean it after main returns).
+LOCKFILE=""
+trap 'rm -f "${LOCKFILE:-}" 2>/dev/null || true' EXIT
+
+main() {
+  command -v jq >/dev/null 2>&1 || return 0   # no jq → silently no-op
+
+  local session_id payload_cwd repo_dir repo_name branch task_ref agent
+  session_id="$(printf '%s' "$INPUT" | jq -r '.session_id // "unknown"' 2>/dev/null || echo unknown)"
+  # Sanitize: session_id is interpolated into file/lock paths — allow safe
+  # filename chars only (defends against ../ or / in the payload).
+  session_id="${session_id//[^a-zA-Z0-9_-]/}"
+  session_id="${session_id:-unknown}"
+  payload_cwd="$(printf '%s' "$INPUT" | jq -r '.cwd // empty' 2>/dev/null || true)"
+
+  # Resolve repo root: prefer git toplevel from the payload cwd, else PWD.
+  local start_dir="${payload_cwd:-$PWD}"
+  repo_dir="$(git -C "$start_dir" rev-parse --show-toplevel 2>/dev/null || echo "$start_dir")"
+  repo_name="$(basename "$repo_dir")"
+  branch="$(git -C "$repo_dir" rev-parse --abbrev-ref HEAD 2>/dev/null || echo detached)"
+
+  task_ref="${REFLECTION_TASK_REF:-${repo_name}#${branch}}"
+  agent="${REFLECTION_AGENT:-unknown}"
+
+  # ---- sentinel guard: avoid re-fire loops --------------------------------
+  local out_dir lock
+  out_dir="${REFLECTION_DIR:-${repo_dir}/.mosaic/reflections}"
+  mkdir -p "$out_dir" 2>/dev/null || return 0
+  lock="${out_dir}/.${session_id}.lock"
+  if [[ -e "$lock" ]]; then
+    return 0
+  fi
+  : > "$lock" 2>/dev/null || true
+  LOCKFILE="$lock"
+
+  # ---- mechanical: changed files ------------------------------------------
+  # Union of committed-vs-HEAD~ is out of scope; capture the working surface:
+  # staged + unstaged + untracked, best-effort.
+  # Exclude .mosaic/ (agent scratch: reflections, locks, self-report input) —
+  # it is tooling state, not part of the diff under review.
+  local files
+  files="$(
+    {
+      git -C "$repo_dir" diff --name-only HEAD 2>/dev/null || true
+      git -C "$repo_dir" diff --name-only --staged 2>/dev/null || true
+      git -C "$repo_dir" ls-files --others --exclude-standard 2>/dev/null || true
+    } | sed '/^$/d' | grep -v '^\.mosaic/' | sort -u || true
+  )"
+
+  # ---- mechanical: risk-floor (inline port of evaluateRiskFloor) ----------
+  local threshold="${REFLECTION_RISK_THRESHOLD:-0.5}"
+  local top_surface="none" top_weight="0.0" tripping=""
+  local f surface weight
+  while IFS= read -r f; do
+    [[ -z "$f" ]] && continue
+    surface="$(classify_surface "$f")"
+    weight="$(surface_weight "$surface")"
+    if awk "BEGIN{exit !($weight > $top_weight)}"; then
+      top_weight="$weight"; top_surface="$surface"; tripping="$f"
+    elif [[ "$surface" == "$top_surface" && "$surface" != "none" ]] && awk "BEGIN{exit !($weight == $top_weight)}"; then
+      tripping="${tripping:+$tripping, }$f"
+    fi
+  done <<< "$files"
+
+  local needs_review reason file_count
+  file_count="$(printf '%s\n' "$files" | sed '/^$/d' | wc -l | tr -d ' ')"
+  if awk "BEGIN{exit !($top_weight >= $threshold)}"; then needs_review=true; else needs_review=false; fi
+  if [[ "$top_surface" == "none" ]]; then
+    if [[ "$file_count" -eq 0 ]]; then reason="no files changed"; else reason="no sensitive surface in ${file_count} changed file(s)"; fi
+  else
+    reason="${top_surface} surface (weight ${top_weight}) in: ${tripping}"
+  fi
+
+  # ---- self-report merge (optional) ---------------------------------------
+  local input_file degraded self_json
+  input_file="${REFLECTION_INPUT:-${repo_dir}/.mosaic/reflection-input.json}"
+  degraded=true
+  self_json='{"confidence":null,"most_likely_wrong":null,"known_not_in_diff":null}'
+  if [[ -r "$input_file" ]] && jq -e . "$input_file" >/dev/null 2>&1; then
+    self_json="$(jq '{
+      confidence: (.confidence // null),
+      most_likely_wrong: (.most_likely_wrong // null),
+      known_not_in_diff: (.known_not_in_diff // null)
+    }' "$input_file" 2>/dev/null || echo "$self_json")"
+    degraded=false
+  fi
+
+  # ---- assemble + atomic write --------------------------------------------
+  local ts files_json record tmp final
+  ts="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
+  files_json="$(printf '%s\n' "$files" | jq -R . | jq -s 'map(select(length>0))')"
+
+  record="$(jq -n \
+    --arg task_ref "$task_ref" \
+    --arg agent "$agent" \
+    --arg session_id "$session_id" \
+    --arg ts "$ts" \
+    --arg repo "$repo_name" \
+    --argjson needs_review "$needs_review" \
+    --argjson score "$top_weight" \
+    --arg surface "$top_surface" \
+    --arg reason "$reason" \
+    --argjson files "$files_json" \
+    --argjson self "$self_json" \
+    --argjson degraded "$degraded" \
+    --arg mode "$MODE" \
+    '{
+      schema: "reflection.v1",
+      task_ref: $task_ref,
+      agent: $agent,
+      session_id: $session_id,
+      timestamp: $ts,
+      repo: $repo,
+      confidence: $self.confidence,
+      most_likely_wrong: $self.most_likely_wrong,
+      known_not_in_diff: $self.known_not_in_diff,
+      risk: { needs_review: $needs_review, score: $score, surface: $surface, reason: $reason },
+      files_changed: $files,
+      provenance: { source: "stop-hook", reflection_attempt: 1, degraded: $degraded, reflection_mode: $mode }
+    }' 2>/dev/null || true)"
+
+  [[ -z "$record" ]] && return 0
+
+  final="${out_dir}/${session_id}-${ts//[:]/}.reflection.json"
+  tmp="${final}.tmp"
+  printf '%s\n' "$record" > "$tmp" 2>/dev/null || return 0
+  mv -f "$tmp" "$final" 2>/dev/null || true
+}
+
+# classify_surface PATH → surface name (highest-risk match wins, mirrors TS)
+classify_surface() {
+  local p="$1"
+  if printf '%s' "$p" | grep -qiE 'auth|login|session|token|permission|rbac|credential|secret'; then echo auth; return; fi
+  if printf '%s' "$p" | grep -qiE 'migration|prisma|schema|\.sql|entity|repository|seed'; then echo data; return; fi
+  if printf '%s' "$p" | grep -qiE 'docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform'; then echo infra; return; fi
+  if printf '%s' "$p" | grep -qiE 'package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite'; then echo build; return; fi
+  if printf '%s' "$p" | grep -qE '\.tsx|\.css|components/|apps/web/'; then echo ui; return; fi
+  if printf '%s' "$p" | grep -qE '\.spec\.|\.test\.|__tests__/'; then echo test; return; fi
+  if printf '%s' "$p" | grep -qE '\.md$|docs/'; then echo docs; return; fi
+  echo none
+}
+
+# surface_weight SURFACE → numeric weight (mirrors TS SURFACE_RULES)
+surface_weight() {
+  case "$1" in
+    auth) echo 1.0 ;;
+    data) echo 0.9 ;;
+    infra) echo 0.85 ;;
+    build) echo 0.6 ;;
+    ui) echo 0.4 ;;
+    test) echo 0.2 ;;
+    docs) echo 0.1 ;;
+    *) echo 0.0 ;;
+  esac
+}
+
+main || true
+exit 0
diff --git a/packages/types/src/index.ts b/packages/types/src/index.ts
index 49ae520..d35b52c 100644
--- a/packages/types/src/index.ts
+++ b/packages/types/src/index.ts
@@ -6,3 +6,4 @@ export * from './provider/index.js';
 export * from './routing/index.js';
 export * from './commands/index.js';
 export * from './federation/index.js';
+export * from './reflection/index.js';
diff --git a/packages/types/src/reflection/__tests__/reflection.spec.ts b/packages/types/src/reflection/__tests__/reflection.spec.ts
new file mode 100644
index 0000000..6d6ff54
--- /dev/null
+++ b/packages/types/src/reflection/__tests__/reflection.spec.ts
@@ -0,0 +1,146 @@
+/**
+ * Unit tests for the reflection.v1 schema + self-report boundary.
+ *
+ * The runtime source of truth is the zod schema set in `reflection.ts`. The
+ * class-validator `ReflectionSelfReportDto` is the NestJS-side boundary type
+ * (exercised under the gateway app's reflect-metadata runtime, mirroring how
+ * `chat.dto.ts` is tested in apps/gateway); here we validate the self-report
+ * input with its zod counterpart, which is what the Stop hook actually uses.
+ *
+ * Coverage:
+ *  - REVIEW_SURFACES canonical ordering (the enum both zod + JSON Schema mirror)
+ *  - ReflectionV1Schema accepts a fully-populated record
+ *  - ReflectionV1Schema accepts a degraded record (self-report fields null)
+ *  - ReflectionV1Schema rejects bad schema literal / out-of-range confidence / bad surface
+ *  - ReflectionSelfReportSchema accepts valid + empty, rejects bad input
+ */
+
+import { describe, expect, it } from 'vitest';
+
+import {
+  REVIEW_SURFACES,
+  ReflectionV1Schema,
+  ReflectionSelfReportSchema,
+  type ReflectionV1,
+} from '../index.js';
+
+const baseMechanical = {
+  schema: 'reflection.v1' as const,
+  task_ref: 'stack#544',
+  agent: 'claude',
+  session_id: 'sess-abc',
+  timestamp: '2026-06-16T00:00:00.000Z',
+  repo: 'stack',
+  risk: {
+    needs_review: true,
+    score: 1.0,
+    surface: 'auth' as const,
+    reason: 'auth surface (weight 1) in: src/auth.ts',
+  },
+  files_changed: ['src/auth.ts'],
+  provenance: {
+    source: 'stop-hook' as const,
+    reflection_attempt: 1,
+    degraded: false,
+    reflection_mode: 'solo' as const,
+  },
+};
+
+describe('REVIEW_SURFACES', () => {
+  it('keeps the canonical most→least-sensitive ordering', () => {
+    expect(REVIEW_SURFACES).toEqual([
+      'auth',
+      'data',
+      'infra',
+      'build',
+      'ui',
+      'test',
+      'docs',
+      'none',
+    ]);
+  });
+});
+
+describe('ReflectionV1Schema', () => {
+  it('accepts a fully-populated record', () => {
+    const rec: ReflectionV1 = {
+      ...baseMechanical,
+      confidence: 0.7,
+      most_likely_wrong: { surface: 'auth', description: 'token refresh untested' },
+      known_not_in_diff: 'manual QA only on the happy path',
+    };
+    expect(() => ReflectionV1Schema.parse(rec)).not.toThrow();
+  });
+
+  it('accepts a degraded record with null self-report fields', () => {
+    const rec: ReflectionV1 = {
+      ...baseMechanical,
+      confidence: null,
+      most_likely_wrong: null,
+      known_not_in_diff: null,
+      provenance: { ...baseMechanical.provenance, degraded: true },
+    };
+    expect(() => ReflectionV1Schema.parse(rec)).not.toThrow();
+  });
+
+  it('rejects a wrong schema literal', () => {
+    const bad = {
+      ...baseMechanical,
+      schema: 'reflection.v2',
+      confidence: null,
+      most_likely_wrong: null,
+      known_not_in_diff: null,
+    };
+    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
+  });
+
+  it('rejects out-of-range confidence', () => {
+    const bad = {
+      ...baseMechanical,
+      confidence: 1.5,
+      most_likely_wrong: null,
+      known_not_in_diff: null,
+    };
+    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
+  });
+
+  it('rejects an unknown surface', () => {
+    const bad = {
+      ...baseMechanical,
+      risk: { ...baseMechanical.risk, surface: 'network' },
+      confidence: null,
+      most_likely_wrong: null,
+      known_not_in_diff: null,
+    };
+    expect(() => ReflectionV1Schema.parse(bad)).toThrow();
+  });
+});
+
+describe('ReflectionSelfReportSchema', () => {
+  it('accepts a valid self-report', () => {
+    const ok = ReflectionSelfReportSchema.safeParse({
+      confidence: 0.8,
+      most_likely_wrong: {
+        surface: 'data',
+        description: 'migration not run against prod-sized data',
+      },
+      known_not_in_diff: 'rollback path untested',
+    });
+    expect(ok.success).toBe(true);
+  });
+
+  it('accepts an empty self-report (all optional)', () => {
+    expect(ReflectionSelfReportSchema.safeParse({}).success).toBe(true);
+  });
+
+  it('rejects confidence above 1', () => {
+    expect(ReflectionSelfReportSchema.safeParse({ confidence: 2 }).success).toBe(false);
+  });
+
+  it('rejects an unknown most_likely_wrong.surface', () => {
+    const res = ReflectionSelfReportSchema.safeParse({
+      most_likely_wrong: { surface: 'network', description: 'x' },
+    });
+    expect(res.success).toBe(false);
+  });
+});
diff --git a/packages/types/src/reflection/index.ts b/packages/types/src/reflection/index.ts
new file mode 100644
index 0000000..67f9f6e
--- /dev/null
+++ b/packages/types/src/reflection/index.ts
@@ -0,0 +1,30 @@
+/**
+ * Agent reflection (v1) — public barrel.
+ *
+ * reflection.ts      — zod schemas (runtime source of truth) + inferred types
+ * reflection.dto.ts  — class-validator DTO for the agent self-report input
+ */
+
+export {
+  REVIEW_SURFACES,
+  ReviewSurfaceSchema,
+  MostLikelyWrongSchema,
+  ReflectionRiskSchema,
+  ReflectionModeSchema,
+  ReflectionProvenanceSchema,
+  ReflectionSelfReportSchema,
+  ReflectionV1Schema,
+  REFLECTION_SCHEMA_ID,
+} from './reflection.js';
+
+export type {
+  ReviewSurface,
+  MostLikelyWrong,
+  ReflectionRisk,
+  ReflectionMode,
+  ReflectionProvenance,
+  ReflectionSelfReport,
+  ReflectionV1,
+} from './reflection.js';
+
+export { MostLikelyWrongDto, ReflectionSelfReportDto } from './reflection.dto.js';
diff --git a/packages/types/src/reflection/reflection.dto.ts b/packages/types/src/reflection/reflection.dto.ts
new file mode 100644
index 0000000..9f63bbf
--- /dev/null
+++ b/packages/types/src/reflection/reflection.dto.ts
@@ -0,0 +1,55 @@
+/**
+ * Reflection self-report DTO — class-validator boundary.
+ *
+ * Validates the agent-supplied self-report input (the optional
+ * `$REFLECTION_INPUT` file, default `<repo>/.mosaic/reflection-input.json`)
+ * before it is merged into a `reflection.v1` record. This is the only
+ * externally-authored input on the reflection path, so it gets a DTO per the
+ * Mosaic module-boundary rule.
+ *
+ * Class-validator only (no class-transformer `@Type`) — matching `chat.dto.ts`
+ * — so the module is safe to import without a `reflect-metadata` shim. Deep
+ * nested validation of `most_likely_wrong` is owned by the zod
+ * `ReflectionSelfReportSchema` in `reflection.ts`, which is what the Stop hook
+ * actually enforces at runtime.
+ */
+
+import {
+  IsIn,
+  IsNumber,
+  IsObject,
+  IsOptional,
+  IsString,
+  Max,
+  Min,
+  MaxLength,
+} from 'class-validator';
+
+import { REVIEW_SURFACES } from './reflection.js';
+
+/** Shape of `most_likely_wrong`; validated structurally by zod at runtime. */
+export class MostLikelyWrongDto {
+  @IsIn(REVIEW_SURFACES as unknown as string[])
+  surface!: string;
+
+  @IsString()
+  @MaxLength(4_000)
+  description!: string;
+}
+
+export class ReflectionSelfReportDto {
+  @IsOptional()
+  @IsNumber()
+  @Min(0)
+  @Max(1)
+  confidence?: number;
+
+  @IsOptional()
+  @IsObject()
+  most_likely_wrong?: MostLikelyWrongDto;
+
+  @IsOptional()
+  @IsString()
+  @MaxLength(8_000)
+  known_not_in_diff?: string;
+}
diff --git a/packages/types/src/reflection/reflection.ts b/packages/types/src/reflection/reflection.ts
new file mode 100644
index 0000000..0d4bdae
--- /dev/null
+++ b/packages/types/src/reflection/reflection.ts
@@ -0,0 +1,90 @@
+/**
+ * Agent reflection (v1) — wire schema.
+ *
+ * Runtime source of truth for the `reflection.v1` sidecar emitted at end-of-run
+ * by the Stop hook (design §10 step 1). The JSON Schema artifact at
+ * `@mosaicstack/macp` `src/schemas/reflection.v1.schema.json` is the documented
+ * contract; this zod schema is the executable one and MUST agree with it.
+ *
+ * Field provenance:
+ *   - MECHANICAL  (risk, files_changed, ids, provenance): written by the hook.
+ *   - SELF-REPORTED (confidence, most_likely_wrong, known_not_in_diff): merged
+ *     from an optional agent-supplied input; null when absent.
+ *
+ * Pure — no NestJS, no DB, no Node-only APIs. Safe for browser/edge.
+ */
+
+import { z } from 'zod';
+
+/** Review surfaces, ordered most- to least-sensitive. Mirrors macp risk-floor. */
+export const REVIEW_SURFACES = [
+  'auth',
+  'data',
+  'infra',
+  'build',
+  'ui',
+  'test',
+  'docs',
+  'none',
+] as const;
+
+export const ReviewSurfaceSchema = z.enum(REVIEW_SURFACES);
+export type ReviewSurface = z.infer<typeof ReviewSurfaceSchema>;
+
+/** SELF-REPORTED: the single most-likely way the work is wrong. */
+export const MostLikelyWrongSchema = z.object({
+  surface: ReviewSurfaceSchema,
+  description: z.string(),
+});
+export type MostLikelyWrong = z.infer<typeof MostLikelyWrongSchema>;
+
+/** MECHANICAL: output of the diff risk-floor (see `@mosaicstack/macp`). */
+export const ReflectionRiskSchema = z.object({
+  needs_review: z.boolean(),
+  score: z.number().min(0).max(1),
+  surface: ReviewSurfaceSchema,
+  reason: z.string(),
+});
+export type ReflectionRisk = z.infer<typeof ReflectionRiskSchema>;
+
+export const ReflectionModeSchema = z.enum(['off', 'solo', 'orchestrated']);
+export type ReflectionMode = z.infer<typeof ReflectionModeSchema>;
+
+export const ReflectionProvenanceSchema = z.object({
+  source: z.literal('stop-hook'),
+  reflection_attempt: z.number().int().min(1),
+  degraded: z.boolean(),
+  reflection_mode: ReflectionModeSchema,
+});
+export type ReflectionProvenance = z.infer<typeof ReflectionProvenanceSchema>;
+
+/**
+ * The self-reported half of a reflection. Supplied by the agent out-of-band
+ * (e.g. `<repo>/.mosaic/reflection-input.json`) and merged by the hook. All
+ * fields optional; missing fields become `null` in the assembled record.
+ */
+export const ReflectionSelfReportSchema = z.object({
+  confidence: z.number().min(0).max(1).nullable().optional(),
+  most_likely_wrong: MostLikelyWrongSchema.nullable().optional(),
+  known_not_in_diff: z.string().nullable().optional(),
+});
+export type ReflectionSelfReport = z.infer<typeof ReflectionSelfReportSchema>;
+
+/** The full assembled `reflection.v1` sidecar. */
+export const ReflectionV1Schema = z.object({
+  schema: z.literal('reflection.v1'),
+  task_ref: z.string(),
+  agent: z.string(),
+  session_id: z.string(),
+  timestamp: z.string(),
+  repo: z.string(),
+  confidence: z.number().min(0).max(1).nullable(),
+  most_likely_wrong: MostLikelyWrongSchema.nullable(),
+  known_not_in_diff: z.string().nullable(),
+  risk: ReflectionRiskSchema,
+  files_changed: z.array(z.string()),
+  provenance: ReflectionProvenanceSchema,
+});
+export type ReflectionV1 = z.infer<typeof ReflectionV1Schema>;
+
+export const REFLECTION_SCHEMA_ID = 'reflection.v1' as const;
diff --git a/scripts/analysis/reflect-board-history.sh b/scripts/analysis/reflect-board-history.sh
new file mode 100755
index 0000000..d982dc5
--- /dev/null
+++ b/scripts/analysis/reflect-board-history.sh
@@ -0,0 +1,111 @@
+#!/usr/bin/env bash
+# reflect-board-history.sh — Phase-0 experiment P3 (outcome detectability)
+#
+# Question: for completed tasks, how often does a machine-detectable
+# correct/wrong outcome signal appear within a follow-up window (default 30d)?
+# If the base rate is too low, predicted-vs-actual calibration (design §7) has
+# nothing to score against, so the kernel should capture caveat-notes only.
+#
+# Method: consume a board/task export (JSONL, one task object per line) OR fall
+# back to scanning the git history of a `data/` task directory. For each task
+# that reached a "done"-like state, decide whether a later signal marks it
+# correct or wrong (reopen, revert, follow-up "fix"/"regression", explicit
+# outcome field). Emit the detectable-outcome base rate. HARNESS + RUBRIC.
+#
+# Usage:
+#   scripts/analysis/reflect-board-history.sh --jsonl FILE [--window-days N] [--json|--md]
+#   scripts/analysis/reflect-board-history.sh --data-dir DIR [--window-days N] [--json|--md]
+#
+# JSONL fields used (best-effort): .id .status .completed_at .outcome
+#   .reopened_at .followups[] (free-form). Missing fields are tolerated.
+#
+# Requirements: jq (for --jsonl), git (for --data-dir), awk.
+#
+# PRE-REGISTERED KILL CONDITION:
+#   detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop;
+#   capture caveat-notes only.
+
+set -euo pipefail
+
+JSONL=""
+DATA_DIR=""
+WINDOW_DAYS=30
+FORMAT="json"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --jsonl) JSONL="$2"; shift 2 ;;
+    --data-dir) DATA_DIR="$2"; shift 2 ;;
+    --window-days) WINDOW_DAYS="$2"; shift 2 ;;
+    --json) FORMAT="json"; shift ;;
+    --md) FORMAT="md"; shift ;;
+    -h|--help) sed -n '2,32p' "$0"; exit 0 ;;
+    *) echo "unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+KILL_CONDITION='detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop'
+echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
+
+done_total=0
+detectable=0
+
+if [[ -n "$JSONL" ]]; then
+  command -v jq >/dev/null 2>&1 || { echo "jq required for --jsonl" >&2; exit 3; }
+  [[ -r "$JSONL" ]] || { echo "cannot read $JSONL" >&2; exit 3; }
+  # Count done tasks and those with a machine-detectable outcome signal.
+  done_total="$(jq -rs '[.[] | select((.status // "") | test("done|complete|closed"; "i"))] | length' "$JSONL" 2>/dev/null || echo 0)"
+  detectable="$(jq -rs '
+    [ .[]
+      | select((.status // "") | test("done|complete|closed"; "i"))
+      | select(
+          (.outcome // null) != null
+          or (.reopened_at // null) != null
+          or ((.followups // []) | length) > 0
+        )
+    ] | length' "$JSONL" 2>/dev/null || echo 0)"
+elif [[ -n "$DATA_DIR" ]]; then
+  command -v git >/dev/null 2>&1 || { echo "git required for --data-dir" >&2; exit 3; }
+  [[ -d "$DATA_DIR" ]] || { echo "no such dir: $DATA_DIR" >&2; exit 3; }
+  # Proxy: a task file later touched by a commit whose subject signals a
+  # correction is a "detectable outcome".
+  while IFS= read -r file; do
+    [[ -z "$file" ]] && continue
+    done_total=$((done_total + 1))
+    if git -C "$DATA_DIR" log --since="${WINDOW_DAYS} days ago" --pretty='%s' -- "$file" 2>/dev/null \
+         | grep -qiE 'reopen|revert|fix|regression|wrong|incorrect|redo'; then
+      detectable=$((detectable + 1))
+    fi
+  done < <(find "$DATA_DIR" -type f -name '*.json' 2>/dev/null)
+else
+  echo "provide --jsonl FILE or --data-dir DIR" >&2
+  exit 2
+fi
+
+rate="$(awk "BEGIN{ if ($done_total==0) print \"0.0\"; else printf \"%.1f\", 100*$detectable/$done_total }")"
+verdict="$(awk "BEGIN{print ($rate < 20.0) ? \"KILL §7 — caveat-notes only\" : \"signal present — proceed\"}")"
+
+if [[ "$FORMAT" == "md" ]]; then
+  cat <<EOF
+## P3 — outcome detectability
+
+- done-like tasks: **${done_total}**
+- with machine-detectable outcome (window ${WINDOW_DAYS}d): **${detectable}**
+- base rate: **${rate}%**
+- kill condition: ${KILL_CONDITION}
+- verdict: **${verdict}**
+EOF
+else
+  awk -v dt="$done_total" -v d="$detectable" -v r="$rate" -v w="$WINDOW_DAYS" \
+      -v v="$verdict" -v kc="$KILL_CONDITION" 'BEGIN{
+    printf "{\n"
+    printf "  \"experiment\": \"P3-board-history\",\n"
+    printf "  \"window_days\": %d,\n", w
+    printf "  \"done_tasks\": %d,\n", dt
+    printf "  \"detectable_outcomes\": %d,\n", d
+    printf "  \"base_rate_pct\": %s,\n", r
+    printf "  \"kill_condition\": \"%s\",\n", kc
+    printf "  \"verdict\": \"%s\"\n", v
+    printf "}\n"
+  }'
+fi
diff --git a/scripts/analysis/reflect-calibration.sh b/scripts/analysis/reflect-calibration.sh
new file mode 100755
index 0000000..3738761
--- /dev/null
+++ b/scripts/analysis/reflect-calibration.sh
@@ -0,0 +1,117 @@
+#!/usr/bin/env bash
+# reflect-calibration.sh — Phase-0 experiment P1 (confidence signal)
+#
+# Question: does an agent's self-reported confidence discriminate correct from
+# incorrect work — especially on the self-rated-HIGH subset, where a closed
+# loop would actually trust it? If confidence ≈ chance on the high subset, the
+# signal is useless and design §7–§8 should not be built.
+#
+# Method: consume a labelled corpus — JSONL of {confidence: 0..1, correct:
+# true|false}. Compute discrimination as ROC AUC over all rows, plus the
+# correct-rate (lift) on the high-confidence subset (>= threshold), and compare
+# to the pre-registered chance baseline (the overall correct-rate). HARNESS +
+# RUBRIC; the labelled corpus is supplied later.
+#
+# Usage:
+#   scripts/analysis/reflect-calibration.sh --jsonl FILE [--high 0.8] [--json|--md]
+#
+# Requirements: jq, awk.
+#
+# PRE-REGISTERED KILL CONDITION:
+#   AUC <= 0.60 OR high-subset lift <= +5pp over base rate
+#   ⇒ confidence is not a usable routing signal; do NOT build §7–§8.
+
+set -euo pipefail
+
+JSONL=""
+HIGH=0.8
+FORMAT="json"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --jsonl) JSONL="$2"; shift 2 ;;
+    --high) HIGH="$2"; shift 2 ;;
+    --json) FORMAT="json"; shift ;;
+    --md) FORMAT="md"; shift ;;
+    -h|--help) sed -n '2,27p' "$0"; exit 0 ;;
+    *) echo "unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+KILL_CONDITION='AUC <= 0.60 OR high-subset lift <= +5pp ⇒ do NOT build §7–§8'
+echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
+
+command -v jq >/dev/null 2>&1 || { echo "jq required" >&2; exit 3; }
+[[ -r "$JSONL" ]] || { echo "provide a readable --jsonl FILE" >&2; exit 2; }
+
+# Normalise to "<confidence> <0|1>" rows; tolerate bad lines.
+ROWS="$(jq -rs '
+  [ .[] | select((.confidence|type)=="number") |
+    "\(.confidence) \((.correct==true) | if . then 1 else 0 end)" ]
+  | .[]' "$JSONL" 2>/dev/null || true)"
+
+if [[ -z "$ROWS" ]]; then
+  echo '{ "experiment": "P1-calibration", "error": "no usable rows" }'
+  exit 0
+fi
+
+# AUC via the Mann–Whitney U relation (rank-based); base rate; high-subset lift.
+read -r N POS BASE AUC HIGH_N HIGH_CORRECT HIGH_RATE LIFT <<EOF
+$(printf '%s\n' "$ROWS" | awk -v high="$HIGH" '
+  { c=$1; y=$2; conf[NR]=c; lab[NR]=y; n++;
+    if (y==1) pos++; else neg++;
+    if (c>=high) { hn++; if (y==1) hc++ } }
+  END{
+    base = (n>0)? pos/n : 0;
+    # Rank-sum AUC: average ranks (ties → average rank).
+    # sort indices by confidence
+    for (i=1;i<=n;i++) idx[i]=i;
+    for (i=1;i<=n;i++) for (j=i+1;j<=n;j++) if (conf[idx[i]]>conf[idx[j]]) { t=idx[i]; idx[i]=idx[j]; idx[j]=t }
+    i=1;
+    while (i<=n) {
+      j=i; while (j<n && conf[idx[j+1]]==conf[idx[i]]) j++;
+      avg=(i+j)/2.0;
+      for (k=i;k<=j;k++) rank[idx[k]]=avg;
+      i=j+1;
+    }
+    rsum=0; for (i=1;i<=n;i++) if (lab[i]==1) rsum+=rank[i];
+    if (pos>0 && neg>0) auc=(rsum - pos*(pos+1)/2.0)/(pos*neg); else auc=0.5;
+    hrate=(hn>0)? hc/hn : 0;
+    lift=hrate-base;
+    printf "%d %d %.4f %.4f %d %d %.4f %.4f", n, pos, base, auc, hn, hc, hrate, lift
+  }')
+EOF
+
+verdict="$(awk -v auc="$AUC" -v lift="$LIFT" 'BEGIN{
+  print (auc <= 0.60 || lift <= 0.05) ? "KILL §7–§8 — confidence not usable" : "signal present — proceed"
+}')"
+
+if [[ "$FORMAT" == "md" ]]; then
+  cat <<EOF
+## P1 — confidence calibration
+
+- rows: **${N}** (positives ${POS}) · base correct-rate **$(awk "BEGIN{printf \"%.1f\", 100*${BASE}}")%**
+- ROC AUC: **${AUC}**
+- high-confidence subset (>= ${HIGH}): n=${HIGH_N}, correct=${HIGH_CORRECT}, rate=$(awk "BEGIN{printf \"%.1f\", 100*${HIGH_RATE}}")%
+- lift over base: **$(awk "BEGIN{printf \"%+.1f\", 100*${LIFT}}")pp**
+- kill condition: ${KILL_CONDITION}
+- verdict: **${verdict}**
+EOF
+else
+  awk -v n="$N" -v pos="$POS" -v base="$BASE" -v auc="$AUC" -v hn="$HIGH_N" \
+      -v hc="$HIGH_CORRECT" -v hr="$HIGH_RATE" -v lift="$LIFT" -v high="$HIGH" \
+      -v v="$verdict" -v kc="$KILL_CONDITION" 'BEGIN{
+    printf "{\n"
+    printf "  \"experiment\": \"P1-calibration\",\n"
+    printf "  \"rows\": %d,\n", n
+    printf "  \"positives\": %d,\n", pos
+    printf "  \"base_rate\": %.4f,\n", base
+    printf "  \"auc\": %.4f,\n", auc
+    printf "  \"high_threshold\": %s,\n", high
+    printf "  \"high_subset\": { \"n\": %d, \"correct\": %d, \"rate\": %.4f },\n", hn, hc, hr
+    printf "  \"lift_over_base\": %.4f,\n", lift
+    printf "  \"kill_condition\": \"%s\",\n", kc
+    printf "  \"verdict\": \"%s\"\n", v
+    printf "}\n"
+  }'
+fi
diff --git a/scripts/analysis/reflect-git-history.sh b/scripts/analysis/reflect-git-history.sh
new file mode 100755
index 0000000..129a2bd
--- /dev/null
+++ b/scripts/analysis/reflect-git-history.sh
@@ -0,0 +1,110 @@
+#!/usr/bin/env bash
+# reflect-git-history.sh — Phase-0 experiment P2 ("only-self-reflection" bucket)
+#
+# Question: of the failures visible in git history, what fraction would ONLY
+# have been caught by end-of-run self-reflection — i.e. NOT by CI and NOT by
+# independent human review? If that bucket is near-empty, the closed
+# calibration / skill-synthesis loop (design §7–§8) is not worth building.
+#
+# Method: scan `git log` over a window for failure signals (reverts, and
+# fix:/hotfix commits landing shortly after a feature merge). Classify each by
+# the gate most likely to have caught it, using a pre-registered heuristic.
+# This is a HARNESS + RUBRIC; the classifier is deliberately simple and the
+# real corpus/labelling is wired later. It emits a structured tally.
+#
+# Usage:
+#   scripts/analysis/reflect-git-history.sh [--repo PATH] [--since SINCE] [--json|--md]
+#
+# Options:
+#   --repo PATH   repo to analyse (default: current repo)
+#   --since SINCE git log --since value (default: "6 months ago")
+#   --json        emit JSON (default)
+#   --md          emit markdown
+#
+# Requirements: git, awk.
+#
+# PRE-REGISTERED KILL CONDITION:
+#   bucket "only_self_reflection" is near-empty (< 10% of classified failures)
+#   ⇒ do NOT build design §7–§8 (closed loop). Caveat-notes capture only.
+
+set -euo pipefail
+
+REPO="."
+SINCE="6 months ago"
+FORMAT="json"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --repo) REPO="$2"; shift 2 ;;
+    --since) SINCE="$2"; shift 2 ;;
+    --json) FORMAT="json"; shift ;;
+    --md) FORMAT="md"; shift ;;
+    -h|--help) sed -n '2,30p' "$0"; exit 0 ;;
+    *) echo "unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+KILL_CONDITION='bucket only_self_reflection < 10% of classified failures ⇒ do NOT build §7–§8'
+echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2
+
+command -v git >/dev/null 2>&1 || { echo "git required" >&2; exit 3; }
+
+# Collect candidate failure commits: reverts + fix/hotfix subjects.
+mapfile -t LINES < <(
+  git -C "$REPO" log --since="$SINCE" --pretty='%H%x09%s' 2>/dev/null \
+    | grep -iE 'revert|hotfix|hot-fix|regression|fix(\(|:|!| )' || true
+)
+
+total=0; ci=0; human=0; selfonly=0
+for line in "${LINES[@]}"; do
+  [[ -z "$line" ]] && continue
+  subj="${line#*$'\t'}"
+  total=$((total + 1))
+  # Pre-registered classification heuristic (gate most likely to have caught it):
+  #   - build/test/lint/type/ci signals → CI would have caught it
+  #   - security/auth/permission/data/migration → human review would flag it
+  #   - everything else (logic/UX/assumption/edge) → only-self-reflection bucket
+  if printf '%s' "$subj" | grep -qiE 'test|lint|type|build|ci|compile|typo'; then
+    ci=$((ci + 1))
+  elif printf '%s' "$subj" | grep -qiE 'security|auth|permission|rbac|secret|migration|data|sql|injection'; then
+    human=$((human + 1))
+  else
+    selfonly=$((selfonly + 1))
+  fi
+done
+
+pct() { awk "BEGIN{ if ($2==0) print \"0.0\"; else printf \"%.1f\", 100*$1/$2 }"; }
+self_pct="$(pct "$selfonly" "$total")"
+verdict="$(awk "BEGIN{print ($self_pct < 10.0) ? \"KILL §7–§8\" : \"signal present — proceed to deeper labelling\"}")"
+
+if [[ "$FORMAT" == "md" ]]; then
+  cat <<EOF
+## P2 — git-history failure-gate attribution
+
+- window: \`${SINCE}\` · repo: \`${REPO}\`
+- classified failures: **${total}**
+
+| gate | count | share |
+|---|---:|---:|
+| CI would catch | ${ci} | $(pct "$ci" "$total")% |
+| human review would catch | ${human} | $(pct "$human" "$total")% |
+| only-self-reflection | ${selfonly} | ${self_pct}% |
+
+- kill condition: ${KILL_CONDITION}
+- verdict: **${verdict}**
+EOF
+else
+  awk -v t="$total" -v c="$ci" -v h="$human" -v s="$selfonly" -v sp="$self_pct" \
+      -v v="$verdict" -v since="$SINCE" -v repo="$REPO" -v kc="$KILL_CONDITION" 'BEGIN{
+    printf "{\n"
+    printf "  \"experiment\": \"P2-git-history\",\n"
+    printf "  \"repo\": \"%s\",\n", repo
+    printf "  \"since\": \"%s\",\n", since
+    printf "  \"classified_failures\": %d,\n", t
+    printf "  \"buckets\": { \"ci\": %d, \"human_review\": %d, \"only_self_reflection\": %d },\n", c, h, s
+    printf "  \"only_self_reflection_pct\": %s,\n", sp
+    printf "  \"kill_condition\": \"%s\",\n", kc
+    printf "  \"verdict\": \"%s\"\n", v
+    printf "}\n"
+  }'
+fi
-- 
2.49.1