From b76666166e9d73345835e0fb68934b30003475e0 Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Tue, 16 Jun 2026 15:55:15 -0500 Subject: [PATCH] =?UTF-8?q?feat(agent-reflection):=20durable=20kernel=20?= =?UTF-8?q?=E2=80=94=20reflection.v1=20capture=20+=20risk-floor=20+=20Phas?= =?UTF-8?q?e-0=20(#544)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Build the durable kernel of the agent reflection loop. Passive end-of-run capture of the doer's end-state as structured `reflection.v1` data, plus a deterministic diff review risk-floor. The closed calibration/skill-synthesis loop (design §7–§8) stays gated behind Phase-0 experiments P1/P2/P3. - packages/macp: evaluateRiskFloor (pure, deterministic surface classifier) + reflection.v1 JSON Schema; 15 unit tests. - packages/types: reflection.v1 zod schemas + self-report DTO; 10 unit tests. - framework: fail-closed Stop hook (reflect-stop-hook.sh) writing the sidecar, registered as hooks.Stop in runtime/claude/settings.json. Strict no-op unless REFLECTION_MODE=solo|orchestrated; never blocks or fails a session. - scripts/analysis: P1/P2/P3 experiment harnesses with pre-registered kill conditions and structured output. Mechanical fields (risk, files_changed, ids, provenance) are written by the hook; self-report fields (confidence, most_likely_wrong, known_not_in_diff) are merged from an optional $REFLECTION_INPUT, else null + provenance.degraded=true. Independent review remediations: empty/all-.mosaic diff still writes a sidecar (grep no-match no longer aborts); session_id sanitized before path use. Refs #544 Co-Authored-By: Claude Opus 4.8 --- docs/plans/agent-reflection-loop-PRD.md | 173 +++++++++++++++ docs/scratchpads/544-agent-reflection-loop.md | 55 +++++ docs/tasks/544-agent-reflection-loop.md | 67 ++++++ packages/macp/src/index.ts | 5 + packages/macp/src/risk-floor.spec.ts | 87 ++++++++ packages/macp/src/risk-floor.ts | 138 ++++++++++++ .../src/schemas/reflection.v1.schema.json | 105 ++++++++++ .../framework/runtime/claude/settings.json | 11 + .../framework/tools/qa/reflect-stop-hook.sh | 197 ++++++++++++++++++ packages/types/src/index.ts | 1 + .../reflection/__tests__/reflection.spec.ts | 146 +++++++++++++ packages/types/src/reflection/index.ts | 30 +++ .../types/src/reflection/reflection.dto.ts | 55 +++++ packages/types/src/reflection/reflection.ts | 90 ++++++++ scripts/analysis/reflect-board-history.sh | 111 ++++++++++ scripts/analysis/reflect-calibration.sh | 117 +++++++++++ scripts/analysis/reflect-git-history.sh | 110 ++++++++++ 17 files changed, 1498 insertions(+) create mode 100644 docs/plans/agent-reflection-loop-PRD.md create mode 100644 docs/scratchpads/544-agent-reflection-loop.md create mode 100644 docs/tasks/544-agent-reflection-loop.md create mode 100644 packages/macp/src/risk-floor.spec.ts create mode 100644 packages/macp/src/risk-floor.ts create mode 100644 packages/macp/src/schemas/reflection.v1.schema.json create mode 100755 packages/mosaic/framework/tools/qa/reflect-stop-hook.sh create mode 100644 packages/types/src/reflection/__tests__/reflection.spec.ts create mode 100644 packages/types/src/reflection/index.ts create mode 100644 packages/types/src/reflection/reflection.dto.ts create mode 100644 packages/types/src/reflection/reflection.ts create mode 100755 scripts/analysis/reflect-board-history.sh create mode 100755 scripts/analysis/reflect-calibration.sh create mode 100755 scripts/analysis/reflect-git-history.sh diff --git a/docs/plans/agent-reflection-loop-PRD.md b/docs/plans/agent-reflection-loop-PRD.md new file mode 100644 index 0000000..114b2b0 --- /dev/null +++ b/docs/plans/agent-reflection-loop-PRD.md @@ -0,0 +1,173 @@ +# PRD — Agent Reflection Loop (durable kernel) + +**Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544) +**Source design:** jarvis-brain `docs/planning/AGENT-REFLECTION-LOOP.md` (commit df6576fc, debate-hardened v2) +**Status:** in-progress +**Scope rule:** Build the **durable kernel** only. The closed calibration/skill-synthesis loop +(design §7–§8) is **gated** behind Phase-0 experiments P1/P2/P3 and is explicitly out of scope here. + +--- + +## 1. Problem + +At end-of-run an agent holds context that never reaches the diff or the "done" message — +assumptions, shortcuts, untested paths, the single most-likely way the work is wrong. That context +is what a lead/human needs to judge trust, and it evaporates when the session ends. Capture it +mechanically as **structured data** (`reflection.v1`), and derive a **review risk-floor** from the +change surface so risky diffs are flagged for independent review. + +## 2. Non-goals (gated on Phase-0) + +- No closed calibration loop (predicted-vs-actual scoring as a routing input). +- No skill synthesis. +- No automated reviewer routing/dispatch. The kernel **writes** the sidecar; pickup is future work. + +## 3. Components & exact placement (main-branch truth) + +| # | Component | Path | Mirror | +| --- | -------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------- | +| a | Stop hook (capture) | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh` | `tools/qa/prevent-memory-write.sh` | +| a | Hook registration | `packages/mosaic/framework/runtime/claude/settings.json` (`hooks.Stop`) | existing `PreToolUse`/`PostToolUse` | +| b | JSON Schema | `packages/macp/src/schemas/reflection.v1.schema.json` | `schemas/task.schema.json` | +| b | TS types (zod) + DTO | `packages/types/src/reflection/{index.ts,reflection.dto.ts}` + re-export from `src/index.ts` | `packages/types/src/federation/*` | +| c | Diff risk-floor | `packages/macp/src/risk-floor.ts` (+ `__tests__/risk-floor.test.ts`, export from `src/index.ts`) | `packages/macp/src/gate-runner.ts` | +| d | Phase-0 scripts | `scripts/analysis/reflect-{git-history,board-history,calibration}.sh` | `scripts/publish-npmjs.sh` | + +**Activation note (deliberate deviation):** the `settings-overlays/` directory has **no merge +mechanism** (referenced only in docs), so a hooks overlay there would be inert. The Stop hook is +registered in the canonical `runtime/claude/settings.json` — the same file the `mosaic` launcher +reflects into `~/.claude/settings.json` (verified byte-identical hooks live there). Still fully +vendored in-repo. + +## 4. `reflection.v1` schema (authoritative field list) + +```jsonc +{ + "schema": "reflection.v1", // literal + "task_ref": "string", // canonical task ref; kernel derives from REFLECTION_TASK_REF or repo+branch + "agent": "string", // persona/runtime id (REFLECTION_AGENT or "unknown") + "session_id": "string", // from Stop payload session_id, else "unknown" + "timestamp": "string", // ISO-8601 UTC + "repo": "string", // repo root basename + "confidence": 0.0, // FLOAT [0,1] — SELF-REPORTED (optional; null if not supplied) + "most_likely_wrong": { + // SELF-REPORTED (optional) + "surface": "auth|data|infra|ui|build|test|docs|none", + "description": "string", + }, + "known_not_in_diff": "string|null", // SELF-REPORTED: "what I know that isn't visible in the diff" + "risk": { + // MECHANICAL — from risk-floor + "needs_review": true, + "score": 0.0, // [0,1] + "surface": "auth|data|infra|ui|build|test|docs|none", + "reason": "string", + }, + "files_changed": ["string"], // MECHANICAL — git diff name-only + "provenance": { + "source": "stop-hook", + "reflection_attempt": 1, + "degraded": false, // true if self-report inputs missing/unreadable + "reflection_mode": "off|solo|orchestrated", + }, +} +``` + +**Mechanical vs self-reported.** A bash Stop hook cannot author the agent's self-assessment. The +hook populates the **mechanical** fields deterministically (risk, files_changed, provenance, ids). +The **self-reported** fields are read from an optional agent-supplied input file +(`$REFLECTION_INPUT`, default `/.mosaic/reflection-input.json`) and merged if present; +absent/unreadable → those fields null and `provenance.degraded=true`. This realizes the design's +"hook is a pre-seed, not the asker" (§4). + +## 5. Stop hook behavior (fail-closed, non-blocking) + +1. Read Stop payload JSON from stdin. +2. **Fail-closed:** if `REFLECTION_MODE` is unset or `off` → `exit 0` immediately (strict no-op). This + is the global-registration safety guarantee. +3. **Sentinel guard:** if `.lock` exists → `exit 0` (prevents re-fire loops). Create it, + `trap` cleanup. +4. Determine output dir: `$REFLECTION_DIR` else `/.mosaic/reflections/`. `mkdir -p`. +5. Compute mechanical fields: `git diff --name-only` (HEAD + staged + worktree, best-effort), + call risk-floor logic (inline bash port OR `node -e` into `@mosaicstack/macp` — see §6), session + ids from payload + env. +6. Merge optional `$REFLECTION_INPUT` self-report if readable JSON. +7. Write `reflection.v1` to a temp file, `mv` (atomic) to `/-.reflection.json`. +8. Always `exit 0`. **Never** emit a `decision` field (Stop hooks are observational). + +Hook must never fail the session: wrap risky steps, default to `degraded:true` on any error, exit 0. + +## 6. Risk-floor (`packages/macp/src/risk-floor.ts`) + +Pure, deterministic, no IO. Single source of truth for the verdict; the hook calls it via +`node --input-type=module -e` (importing the built package) **or**, to avoid a node dependency in the +hook path, the hook ports the same surface table. **Decision:** implement the canonical logic in TS +(tested), and have the hook shell out to node when available, else fall back to a minimal inline +classifier flagged `degraded:true`. (Keep the TS the authority; the inline path is a safety net.) + +```ts +export type ReviewSurface = 'auth' | 'data' | 'infra' | 'ui' | 'build' | 'test' | 'docs' | 'none'; +export interface RiskFloorInput { + filesChanged: string[]; + insertions?: number; + deletions?: number; +} +export interface RiskFloorVerdict { + needs_review: boolean; + score: number; + surface: ReviewSurface; + reason: string; +} +export function evaluateRiskFloor(input: RiskFloorInput): RiskFloorVerdict; +``` + +Surface classification by path regex (first match wins, highest-risk surface dominates): + +- `auth` (weight 1.0): `auth`, `login`, `session`, `token`, `permission`, `rbac`, `credential`, `secret` +- `data` (0.9): `migration`, `prisma`, `schema`, `\.sql`, `entity`, `repository`, `seed` +- `infra` (0.85): `docker`, `\.woodpecker`, `compose`, `traefik`, `deploy`, `helm`, `k8s`, `terraform` +- `build` (0.6): `package.json`, `tsconfig`, `turbo.json`, `pnpm-`, `\.config\.`, `eslint`, `vite` +- `ui` (0.4): `\.tsx`, `\.css`, `components/`, `apps/web/` +- `test` (0.2): `\.spec\.`, `\.test\.`, `__tests__/` +- `docs` (0.1): `\.md`, `docs/` +- `none` (0.0): anything else + +`needs_review = score >= THRESHOLD` (default `0.5`, overridable). `reason` names the files+surface +that tripped it. **Subordinate to CI:** this is a _floor_ (minimum review requirement) only; +consumers MUST treat CI/tests as authoritative above the floor (precedence: CI/tests > human merge > +reviewer verdict > self-reflection). Documented in the module header. + +## 7. Phase-0 experiment scripts (`scripts/analysis/`) + +Offline, no-infra bash. Each script: `#!/usr/bin/env bash`, `set -euo pipefail`, header `Usage:` + +`Requirements:`, flag parsing, **prints its pre-registered kill condition**, emits structured +(JSON/markdown) output. They are harnesses + rubrics — real corpora are wired later. + +- `reflect-git-history.sh` (**P2** — only-self-reflection bucket): scan `git log` for failure signals + (reverts, `fix:`/`hotfix` shortly after a feature merge) over a window; classify each by which gate + would catch it (CI / human-review / only-self-reflection) via a pre-registered heuristic; tally. + Kill: bucket-3 near-empty → no §7/§8. +- `reflect-board-history.sh` (**P3** — outcome detectability): given a task/board export (or the + git history of `data/` task files), measure the fraction of completed tasks with a + machine-detectable correct/wrong signal within 30 days. Kill: base-rate < 20% → caveat-notes only. +- `reflect-calibration.sh` (**P1** — confidence signal): consume a labeled corpus (JSONL of + `{confidence, correct}`), compute discrimination (AUC/lift) on the self-rated-high subset, print + the metric vs the pre-registered chance threshold. Kill: AUC ≈ chance on the high subset → no §7/§8. + +## 8. CI / quality gates + +- TS packages: `pnpm typecheck` (tsc --noEmit), `pnpm lint` (eslint), `pnpm format:check` + (prettier), `pnpm test` (vitest). ESM, NodeNext, `.js` import specifiers, `*.dto.ts` at boundaries. +- New files in existing packages need no CI config change; add ≥1 vitest spec per new TS module. +- Bash scripts/hook are dev/runtime tooling, not CI-built; keep them `shellcheck`-clean. + +## 9. Acceptance criteria + +1. `REFLECTION_MODE` unset → hook is a strict no-op (`exit 0`, no file written). **(test)** +2. With `REFLECTION_MODE=solo`, hook writes a schema-valid `reflection.v1` with correct mechanical + fields; self-report merged when `$REFLECTION_INPUT` present, `degraded:true` when absent. +3. `evaluateRiskFloor` deterministic across all surfaces; unit-tested incl. auth/data/infra → review, + docs/test → no review, empty → `none`/no review. +4. `reflection.v1` zod type + JSON Schema agree; sidecar validates against the schema. +5. Phase-0 scripts run offline, print kill conditions, emit structured output, shellcheck-clean. +6. `pnpm typecheck && pnpm lint && pnpm format:check && pnpm test` green; independent review passed. diff --git a/docs/scratchpads/544-agent-reflection-loop.md b/docs/scratchpads/544-agent-reflection-loop.md new file mode 100644 index 0000000..fd7f569 --- /dev/null +++ b/docs/scratchpads/544-agent-reflection-loop.md @@ -0,0 +1,55 @@ +# Scratchpad — #544 Agent Reflection Loop (durable kernel) + +**Started:** 2026-06-16 · **Branch:** `feat/agent-reflection-loop` · **Base:** `main` @ c461380 + +## Goal + +Bake the durable kernel of the agent reflection loop into the Mosaic Stack +monorepo through full delivery gates. Kernel only; closed loop (§7–§8) gated on +Phase-0. Authoritative spec: `docs/plans/agent-reflection-loop-PRD.md`. Task +breakdown: `docs/tasks/544-agent-reflection-loop.md`. + +## Timeline / decisions + +- Mapped house style against `main` truth (the earlier recon had mapped a dirty + feature branch and returned non-existent paths; re-cloned `main` clean). +- macp uses co-located `*.spec.ts`; types uses `src//{*.ts, *.dto.ts, __tests__/*.spec.ts}`. +- zod v4 + class-validator/class-transformer present in `@mosaicstack/types`; + `packages/types/tsconfig.json` enables `experimentalDecorators`/`emitDecoratorMetadata`. +- **Gotcha (fixed):** `class-transformer`'s `@Type` calls `Reflect.getMetadata` + at module-load time; the types vitest env has no `reflect-metadata`, so any test + importing the reflection barrel crashed on import. `chat.dto.ts` avoids this by + using class-validator only. Fix: dropped `@Type`/`@ValidateNested` from the DTO; + zod owns deep nested validation. +- **Gotcha (fixed):** Stop hook `EXIT` trap referenced a `main`-local `lock` → + `unbound variable` under `set -u` at exit. Promoted to a global `LOCKFILE`. +- **Gotcha (fixed):** the hook's own lock + `.mosaic/` scratch leaked into + `files_changed`. Excluded `^\.mosaic/` from the change-surface scan. + +## Verification evidence + +- macp: typecheck OK, lint OK, **88 tests pass** (15 new risk-floor). +- types: typecheck OK, lint OK, **64 tests pass** (10 new reflection). +- Root: `pnpm typecheck` (41 tasks), `pnpm lint` (23), `pnpm format:check`, `pnpm build` (23) — all green. +- Stop hook smoke (throwaway git repo): TEST1 no-op (mode unset, 0 files); + TEST2 solo degraded, `.mosaic/` excluded, auth→needs_review; TEST3 self-report + merged, degraded=false; TEST4 lock suppresses re-fire. All pass, always exit 0. +- shellcheck clean: hook + `reflect-{git-history,board-history,calibration}.sh`. +- Phase-0 smoke: P2 on this repo (142 failures classified), P1 AUC=0.875 on a + synthetic fixture, P3 base-rate on a synthetic board — all emit structured output + - kill conditions. + +## Open risks / follow-ups + +- Full `pnpm test` (DB-bound packages) validated via CI's postgres service, not + locally; affected packages (macp, types) are DB-independent and green here. +- sequential-thinking MCP was registered mid-session (effective next session); + this session compensated with the written PRD as the planning artifact. +- Phase-0 corpora are not yet wired — scripts are harnesses + pre-registered + rubrics (P1/P2/P3 tasks tracked in jarvis-brain `agent-reflection-loop` project). + +## Gate status + +- [x] PRD authored · [x] issue #544 created + linked · [x] code + tests +- [x] local gates green · [ ] independent code review · [ ] PR opened +- [ ] CI terminal green · [ ] merged to main · [ ] issue closed diff --git a/docs/tasks/544-agent-reflection-loop.md b/docs/tasks/544-agent-reflection-loop.md new file mode 100644 index 0000000..4c07553 --- /dev/null +++ b/docs/tasks/544-agent-reflection-loop.md @@ -0,0 +1,67 @@ +# 544: Agent Reflection Loop — durable kernel + +**Issue:** [#544](http://git.mosaicstack.dev/mosaicstack/stack/issues/544) +**PRD:** [`docs/plans/agent-reflection-loop-PRD.md`](../plans/agent-reflection-loop-PRD.md) +**Branch:** `feat/agent-reflection-loop` + +## Context + +Build the **durable kernel** of the agent reflection loop: passive end-of-run +capture of the doer's end-state as structured `reflection.v1` data, plus a +deterministic diff **review risk-floor**. The closed calibration / skill-synthesis +loop (design §7–§8) stays **gated** behind Phase-0 experiments P1/P2/P3 and is +explicitly out of scope here. Source design: jarvis-brain +`docs/planning/AGENT-REFLECTION-LOOP.md` (debate-hardened v2). + +Scope rule, non-goals, the full `reflection.v1` field list, and acceptance +criteria live in the PRD. This file is the task breakdown + status. + +## Work items + +| # | Item | Path | Status | +| --- | ----------------------------------------------------- | --------------------------------------------------------- | ------ | +| 1 | Diff risk-floor (pure, deterministic) + unit tests | `packages/macp/src/risk-floor.ts`, `risk-floor.spec.ts` | done | +| 2 | `reflection.v1` JSON Schema (documented contract) | `packages/macp/src/schemas/reflection.v1.schema.json` | done | +| 3 | `reflection.v1` zod schemas + self-report DTO + tests | `packages/types/src/reflection/*` | done | +| 4 | Stop hook (fail-closed capture) | `packages/mosaic/framework/tools/qa/reflect-stop-hook.sh` | done | +| 5 | Hook registration (`hooks.Stop`) | `packages/mosaic/framework/runtime/claude/settings.json` | done | +| 6 | Phase-0 experiment harnesses (P1/P2/P3) | `scripts/analysis/reflect-*.sh` | done | + +## Design decisions (this implementation) + +- **Mechanical vs self-reported split.** A bash Stop hook cannot author the + agent's self-assessment, so it writes the mechanical fields (risk-floor verdict, + `files_changed`, ids, provenance) and merges an optional agent-supplied + `$REFLECTION_INPUT` self-report; absent/unreadable ⇒ those fields `null` and + `provenance.degraded = true`. +- **Risk-floor authority.** `evaluateRiskFloor` (TS, tested) is the source of + truth. The hook ports the same surface table inline to avoid a node/build + dependency on the hook path; the two are documented as kept in sync. +- **Hook registration deviation.** `settings-overlays/` has no merge mechanism + (docs-only), so a hooks overlay there would be inert. The Stop hook is + registered in the canonical `runtime/claude/settings.json` — the same file the + `mosaic` launcher reflects into `~/.claude/settings.json`. Still vendored in-repo. +- **DTO without class-transformer.** `reflection.dto.ts` uses class-validator only + (no `@Type`), matching `chat.dto.ts`, so the module imports without a + `reflect-metadata` shim in the types-package test env. Deep nested validation is + owned by the zod `ReflectionSelfReportSchema` (the runtime authority the hook uses). +- **`.mosaic/` excluded** from the change surface — it is agent scratch + (reflections, locks, self-report input), not part of the diff under review. + +## Verification + +- `pnpm --filter @mosaicstack/macp test` → 88 passed (15 new risk-floor). +- `pnpm --filter @mosaicstack/types test` → 64 passed (10 new reflection). +- Root `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, `pnpm build` → green. +- Stop hook smoke: fail-closed no-op (mode unset), solo capture (degraded), + self-report merge (degraded=false), re-fire lock guard — all pass. +- All bash (hook + 3 Phase-0 scripts) shellcheck-clean; Phase-0 scripts emit + structured JSON/markdown and print their pre-registered kill conditions. + +## Activation (post-merge, deployment concern — not a blocker) + +The Stop hook only activates when a launcher/profile sets +`REFLECTION_MODE=solo|orchestrated`; unset/`off` is a strict no-op, so global +registration is safe. `framework/install.sh` rsyncs the hook into +`~/.config/mosaic/tools/qa/`, and the `mosaic` launcher reflects the updated +`settings.json` (`hooks.Stop`) into `~/.claude/settings.json`. diff --git a/packages/macp/src/index.ts b/packages/macp/src/index.ts index 073c886..a510b9a 100644 --- a/packages/macp/src/index.ts +++ b/packages/macp/src/index.ts @@ -39,6 +39,11 @@ export { normalizeGate, runShell, countAIFindings, runGate, runGates } from './g export type { NormalizedGate } from './gate-runner.js'; +// Risk-floor (agent reflection loop — diff review classifier) +export { evaluateRiskFloor, DEFAULT_RISK_THRESHOLD } from './risk-floor.js'; + +export type { ReviewSurface, RiskFloorInput, RiskFloorVerdict } from './risk-floor.js'; + // Event emitter export { nowISO, appendEvent, emitEvent } from './event-emitter.js'; diff --git a/packages/macp/src/risk-floor.spec.ts b/packages/macp/src/risk-floor.spec.ts new file mode 100644 index 0000000..32e3ac8 --- /dev/null +++ b/packages/macp/src/risk-floor.spec.ts @@ -0,0 +1,87 @@ +import { describe, expect, it } from 'vitest'; + +import { DEFAULT_RISK_THRESHOLD, evaluateRiskFloor, type ReviewSurface } from './risk-floor.js'; + +describe('evaluateRiskFloor', () => { + it('returns a no-review "none" verdict for an empty diff', () => { + const v = evaluateRiskFloor({ filesChanged: [] }); + expect(v).toEqual({ + needs_review: false, + score: 0, + surface: 'none', + reason: 'no files changed', + }); + }); + + it('ignores empty/non-string entries', () => { + const v = evaluateRiskFloor({ filesChanged: ['', ' ' as unknown as string].filter(Boolean) }); + // only the whitespace string survives the Boolean filter; it classifies to none + expect(v.surface).toBe('none'); + expect(v.needs_review).toBe(false); + }); + + it.each<[string, string, ReviewSurface, boolean]>([ + ['auth', 'apps/api/src/auth/session.guard.ts', 'auth', true], + ['data', 'packages/db/migrations/0007_add_users.sql', 'data', true], + ['infra', '.woodpecker/deploy.yml', 'infra', true], + ['build', 'packages/types/tsconfig.json', 'build', true], + ['ui', 'apps/web/src/components/Button.tsx', 'ui', false], + ['test', 'packages/macp/src/risk-floor.spec.ts', 'test', false], + ['docs', 'docs/plans/agent-reflection-loop-PRD.md', 'docs', false], + ['none', 'README', 'none', false], + ])( + 'classifies a single %s file → surface=%s needs_review=%s', + (_label, file, surface, needsReview) => { + const v = evaluateRiskFloor({ filesChanged: [file] }); + expect(v.surface).toBe(surface); + expect(v.needs_review).toBe(needsReview); + expect(v.reason).toContain( + file === 'README' ? 'no sensitive surface' : surface === 'none' ? '' : surface, + ); + }, + ); + + it('lets the highest-risk surface dominate a mixed diff', () => { + const v = evaluateRiskFloor({ + filesChanged: [ + 'docs/readme.md', + 'apps/web/src/components/Nav.tsx', + 'apps/api/src/auth/token.service.ts', + ], + }); + expect(v.surface).toBe('auth'); + expect(v.score).toBe(1.0); + expect(v.needs_review).toBe(true); + expect(v.reason).toContain('token.service.ts'); + expect(v.reason).not.toContain('readme.md'); + }); + + it('names every file that ties at the dominant surface', () => { + const v = evaluateRiskFloor({ + filesChanged: ['src/login.ts', 'src/permission-check.ts'], + }); + expect(v.surface).toBe('auth'); + expect(v.reason).toContain('src/login.ts'); + expect(v.reason).toContain('src/permission-check.ts'); + }); + + it('treats docs+test-only diffs as below the floor', () => { + const v = evaluateRiskFloor({ + filesChanged: ['docs/guide.md', 'packages/x/src/x.test.ts'], + }); + expect(v.needs_review).toBe(false); + expect(v.surface).toBe('test'); // higher weight than docs + }); + + it('honors a custom threshold', () => { + const docsOnly = { filesChanged: ['docs/guide.md'] }; + expect(evaluateRiskFloor(docsOnly, 0.05).needs_review).toBe(true); + expect(evaluateRiskFloor(docsOnly, DEFAULT_RISK_THRESHOLD).needs_review).toBe(false); + }); + + it('is deterministic across call order', () => { + const a = evaluateRiskFloor({ filesChanged: ['a.md', 'auth/x.ts', 'b.tsx'] }); + const b = evaluateRiskFloor({ filesChanged: ['b.tsx', 'a.md', 'auth/x.ts'] }); + expect(a).toEqual(b); + }); +}); diff --git a/packages/macp/src/risk-floor.ts b/packages/macp/src/risk-floor.ts new file mode 100644 index 0000000..5a87d5f --- /dev/null +++ b/packages/macp/src/risk-floor.ts @@ -0,0 +1,138 @@ +/** + * Diff risk-floor — deterministic review-need classifier. + * + * Given the set of changed files in a diff, derive a *minimum* review + * requirement ("floor") from the change surface. This is the mechanical half + * of the agent reflection loop (design §6): risky surfaces (auth, data, infra) + * trip a review requirement regardless of what the agent self-reports. + * + * Precedence (authoritative ordering, see design §5): + * CI/tests > human merge > reviewer verdict > self-reflection + * This module sits at the *floor*. It NEVER overrides CI or a human; a + * `needs_review: false` verdict means "no surface tripped the floor", not + * "safe to merge". Consumers MUST keep CI/tests authoritative above it. + * + * Pure and deterministic: no IO, no clock, no randomness. Same input → same + * verdict. Safe to call from a Stop hook via `node -e` or to port inline. + */ + +/** Review surfaces, ordered most- to least-sensitive. */ +export type ReviewSurface = 'auth' | 'data' | 'infra' | 'build' | 'ui' | 'test' | 'docs' | 'none'; + +export interface RiskFloorInput { + /** Paths of changed files, repo-relative. Order-insensitive. */ + filesChanged: string[]; + /** Optional diff size signals; reserved for future weighting. */ + insertions?: number; + deletions?: number; +} + +export interface RiskFloorVerdict { + /** True when the change surface meets/exceeds the review threshold. */ + needs_review: boolean; + /** Aggregate risk score in [0, 1] — the max surface weight across files. */ + score: number; + /** The dominant (highest-weight) surface across all changed files. */ + surface: ReviewSurface; + /** Human-readable explanation naming the surface and tripping files. */ + reason: string; +} + +/** Default review threshold; `score >= THRESHOLD` ⇒ `needs_review`. */ +export const DEFAULT_RISK_THRESHOLD = 0.5; + +interface SurfaceRule { + surface: ReviewSurface; + weight: number; + /** Case-insensitive regex matched against the file path. */ + pattern: RegExp; +} + +/** + * Surface classification rules, evaluated highest-weight first. The first + * rule whose pattern matches a path classifies that file; the file's surface + * is the highest-risk surface it matches (rules are pre-sorted by weight). + */ +const SURFACE_RULES: readonly SurfaceRule[] = [ + { + surface: 'auth', + weight: 1.0, + pattern: /auth|login|session|token|permission|rbac|credential|secret/i, + }, + { + surface: 'data', + weight: 0.9, + pattern: /migration|prisma|schema|\.sql|entity|repository|seed/i, + }, + { + surface: 'infra', + weight: 0.85, + pattern: /docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform/i, + }, + { + surface: 'build', + weight: 0.6, + pattern: /package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite/i, + }, + { surface: 'ui', weight: 0.4, pattern: /\.tsx|\.css|components\/|apps\/web\// }, + { surface: 'test', weight: 0.2, pattern: /\.spec\.|\.test\.|__tests__\// }, + { surface: 'docs', weight: 0.1, pattern: /\.md$|docs\// }, +]; + +const NONE_WEIGHT = 0.0; + +/** Classify a single path to its highest-risk surface and weight. */ +function classify(path: string): { surface: ReviewSurface; weight: number } { + for (const rule of SURFACE_RULES) { + if (rule.pattern.test(path)) { + return { surface: rule.surface, weight: rule.weight }; + } + } + return { surface: 'none', weight: NONE_WEIGHT }; +} + +/** + * Evaluate the review risk-floor for a diff. + * + * @param input changed files (+ optional size signals) + * @param threshold review cutoff; defaults to {@link DEFAULT_RISK_THRESHOLD} + */ +export function evaluateRiskFloor( + input: RiskFloorInput, + threshold: number = DEFAULT_RISK_THRESHOLD, +): RiskFloorVerdict { + const files = (input.filesChanged ?? []).filter((f) => typeof f === 'string' && f.length > 0); + + if (files.length === 0) { + return { + needs_review: false, + score: 0, + surface: 'none', + reason: 'no files changed', + }; + } + + let topSurface: ReviewSurface = 'none'; + let topWeight = NONE_WEIGHT; + const tripping: string[] = []; + + for (const file of files) { + const { surface, weight } = classify(file); + if (weight > topWeight) { + topWeight = weight; + topSurface = surface; + tripping.length = 0; + tripping.push(file); + } else if (weight === topWeight && surface === topSurface && surface !== 'none') { + tripping.push(file); + } + } + + const needs_review = topWeight >= threshold; + const reason = + topSurface === 'none' + ? `no sensitive surface in ${files.length} changed file(s)` + : `${topSurface} surface (weight ${topWeight}) in: ${tripping.join(', ')}`; + + return { needs_review, score: topWeight, surface: topSurface, reason }; +} diff --git a/packages/macp/src/schemas/reflection.v1.schema.json b/packages/macp/src/schemas/reflection.v1.schema.json new file mode 100644 index 0000000..a320411 --- /dev/null +++ b/packages/macp/src/schemas/reflection.v1.schema.json @@ -0,0 +1,105 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://mosaicstack.dev/schemas/reflection/reflection.v1.schema.json", + "title": "Agent Reflection (v1)", + "description": "End-of-run reflection sidecar. Mechanical fields are written by the Stop hook; self-reported fields are merged from an optional agent-supplied input and are null when absent (provenance.degraded=true).", + "type": "object", + "required": [ + "schema", + "task_ref", + "agent", + "session_id", + "timestamp", + "repo", + "risk", + "files_changed", + "provenance" + ], + "properties": { + "schema": { + "const": "reflection.v1" + }, + "task_ref": { + "type": "string", + "description": "Canonical task ref; derived from REFLECTION_TASK_REF or repo+branch." + }, + "agent": { + "type": "string", + "description": "Persona/runtime id (REFLECTION_AGENT or 'unknown')." + }, + "session_id": { + "type": "string", + "description": "From the Stop payload session_id, else 'unknown'." + }, + "timestamp": { + "type": "string", + "format": "date-time", + "description": "ISO-8601 UTC capture time." + }, + "repo": { + "type": "string", + "description": "Repo root basename." + }, + "confidence": { + "type": ["number", "null"], + "minimum": 0, + "maximum": 1, + "description": "SELF-REPORTED. Agent's overall confidence; null when not supplied." + }, + "most_likely_wrong": { + "type": ["object", "null"], + "description": "SELF-REPORTED. The single most-likely way the work is wrong.", + "required": ["surface", "description"], + "properties": { + "surface": { "$ref": "#/$defs/surface" }, + "description": { "type": "string" } + }, + "additionalProperties": false + }, + "known_not_in_diff": { + "type": ["string", "null"], + "description": "SELF-REPORTED. What the agent knows that isn't visible in the diff." + }, + "risk": { + "type": "object", + "description": "MECHANICAL. Output of the diff risk-floor.", + "required": ["needs_review", "score", "surface", "reason"], + "properties": { + "needs_review": { "type": "boolean" }, + "score": { "type": "number", "minimum": 0, "maximum": 1 }, + "surface": { "$ref": "#/$defs/surface" }, + "reason": { "type": "string" } + }, + "additionalProperties": false + }, + "files_changed": { + "type": "array", + "items": { "type": "string" }, + "description": "MECHANICAL. git diff name-only." + }, + "provenance": { + "type": "object", + "required": ["source", "reflection_attempt", "degraded", "reflection_mode"], + "properties": { + "source": { "const": "stop-hook" }, + "reflection_attempt": { "type": "integer", "minimum": 1 }, + "degraded": { + "type": "boolean", + "description": "True when self-report inputs were missing/unreadable." + }, + "reflection_mode": { + "type": "string", + "enum": ["off", "solo", "orchestrated"] + } + }, + "additionalProperties": false + } + }, + "additionalProperties": false, + "$defs": { + "surface": { + "type": "string", + "enum": ["auth", "data", "infra", "build", "ui", "test", "docs", "none"] + } + } +} diff --git a/packages/mosaic/framework/runtime/claude/settings.json b/packages/mosaic/framework/runtime/claude/settings.json index 557fcbc..0318d9e 100644 --- a/packages/mosaic/framework/runtime/claude/settings.json +++ b/packages/mosaic/framework/runtime/claude/settings.json @@ -34,6 +34,17 @@ } ] } + ], + "Stop": [ + { + "hooks": [ + { + "type": "command", + "command": "~/.config/mosaic/tools/qa/reflect-stop-hook.sh", + "timeout": 15 + } + ] + } ] }, "enabledPlugins": { diff --git a/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh b/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh new file mode 100755 index 0000000..41fbd2d --- /dev/null +++ b/packages/mosaic/framework/tools/qa/reflect-stop-hook.sh @@ -0,0 +1,197 @@ +#!/usr/bin/env bash +# reflect-stop-hook.sh — Stop hook (agent reflection loop, durable kernel) +# +# At end-of-run, capture the doer's end-state as a structured `reflection.v1` +# sidecar: the mechanical diff risk-floor plus any self-report the agent left +# behind. This is the passive capture half of the design (§10 step 1). It does +# NOT route, score, or gate — it only writes the sidecar; pickup is future work. +# +# FAIL-CLOSED: if REFLECTION_MODE is unset or "off", this is a strict no-op. +# Global registration is therefore safe; the feature only activates when a +# launcher/profile explicitly sets REFLECTION_MODE=solo|orchestrated. +# +# NON-BLOCKING: Stop hooks are observational. This script NEVER emits a +# `decision` field and ALWAYS exits 0 — it can never fail or stall a session. +# +# Environment contract: +# REFLECTION_MODE off|solo|orchestrated (default: off → no-op) +# REFLECTION_DIR output dir (default: /.mosaic/reflections) +# REFLECTION_INPUT self-report JSON (default: /.mosaic/reflection-input.json) +# REFLECTION_TASK_REF canonical task ref (default: #) +# REFLECTION_AGENT persona/runtime id (default: unknown) +# REFLECTION_RISK_THRESHOLD review cutoff [0,1] (default: 0.5) +# +# Risk-floor surface table is kept in sync with the authoritative TS +# implementation at packages/macp/src/risk-floor.ts (evaluateRiskFloor). +# +# Exit codes: always 0 (observational hook). + +set -euo pipefail + +# ---- fail-closed gate ------------------------------------------------------- +MODE="${REFLECTION_MODE:-off}" +if [[ "$MODE" != "solo" && "$MODE" != "orchestrated" ]]; then + exit 0 +fi + +# Read the Stop payload (best-effort; never required). +INPUT="$(cat || true)" + +# Sentinel lock path (global so the EXIT trap can clean it after main returns). +LOCKFILE="" +trap 'rm -f "${LOCKFILE:-}" 2>/dev/null || true' EXIT + +main() { + command -v jq >/dev/null 2>&1 || return 0 # no jq → silently no-op + + local session_id payload_cwd repo_dir repo_name branch task_ref agent + session_id="$(printf '%s' "$INPUT" | jq -r '.session_id // "unknown"' 2>/dev/null || echo unknown)" + # Sanitize: session_id is interpolated into file/lock paths — allow safe + # filename chars only (defends against ../ or / in the payload). + session_id="${session_id//[^a-zA-Z0-9_-]/}" + session_id="${session_id:-unknown}" + payload_cwd="$(printf '%s' "$INPUT" | jq -r '.cwd // empty' 2>/dev/null || true)" + + # Resolve repo root: prefer git toplevel from the payload cwd, else PWD. + local start_dir="${payload_cwd:-$PWD}" + repo_dir="$(git -C "$start_dir" rev-parse --show-toplevel 2>/dev/null || echo "$start_dir")" + repo_name="$(basename "$repo_dir")" + branch="$(git -C "$repo_dir" rev-parse --abbrev-ref HEAD 2>/dev/null || echo detached)" + + task_ref="${REFLECTION_TASK_REF:-${repo_name}#${branch}}" + agent="${REFLECTION_AGENT:-unknown}" + + # ---- sentinel guard: avoid re-fire loops -------------------------------- + local out_dir lock + out_dir="${REFLECTION_DIR:-${repo_dir}/.mosaic/reflections}" + mkdir -p "$out_dir" 2>/dev/null || return 0 + lock="${out_dir}/.${session_id}.lock" + if [[ -e "$lock" ]]; then + return 0 + fi + : > "$lock" 2>/dev/null || true + LOCKFILE="$lock" + + # ---- mechanical: changed files ------------------------------------------ + # Union of committed-vs-HEAD~ is out of scope; capture the working surface: + # staged + unstaged + untracked, best-effort. + # Exclude .mosaic/ (agent scratch: reflections, locks, self-report input) — + # it is tooling state, not part of the diff under review. + local files + files="$( + { + git -C "$repo_dir" diff --name-only HEAD 2>/dev/null || true + git -C "$repo_dir" diff --name-only --staged 2>/dev/null || true + git -C "$repo_dir" ls-files --others --exclude-standard 2>/dev/null || true + } | sed '/^$/d' | grep -v '^\.mosaic/' | sort -u || true + )" + + # ---- mechanical: risk-floor (inline port of evaluateRiskFloor) ---------- + local threshold="${REFLECTION_RISK_THRESHOLD:-0.5}" + local top_surface="none" top_weight="0.0" tripping="" + local f surface weight + while IFS= read -r f; do + [[ -z "$f" ]] && continue + surface="$(classify_surface "$f")" + weight="$(surface_weight "$surface")" + if awk "BEGIN{exit !($weight > $top_weight)}"; then + top_weight="$weight"; top_surface="$surface"; tripping="$f" + elif [[ "$surface" == "$top_surface" && "$surface" != "none" ]] && awk "BEGIN{exit !($weight == $top_weight)}"; then + tripping="${tripping:+$tripping, }$f" + fi + done <<< "$files" + + local needs_review reason file_count + file_count="$(printf '%s\n' "$files" | sed '/^$/d' | wc -l | tr -d ' ')" + if awk "BEGIN{exit !($top_weight >= $threshold)}"; then needs_review=true; else needs_review=false; fi + if [[ "$top_surface" == "none" ]]; then + if [[ "$file_count" -eq 0 ]]; then reason="no files changed"; else reason="no sensitive surface in ${file_count} changed file(s)"; fi + else + reason="${top_surface} surface (weight ${top_weight}) in: ${tripping}" + fi + + # ---- self-report merge (optional) --------------------------------------- + local input_file degraded self_json + input_file="${REFLECTION_INPUT:-${repo_dir}/.mosaic/reflection-input.json}" + degraded=true + self_json='{"confidence":null,"most_likely_wrong":null,"known_not_in_diff":null}' + if [[ -r "$input_file" ]] && jq -e . "$input_file" >/dev/null 2>&1; then + self_json="$(jq '{ + confidence: (.confidence // null), + most_likely_wrong: (.most_likely_wrong // null), + known_not_in_diff: (.known_not_in_diff // null) + }' "$input_file" 2>/dev/null || echo "$self_json")" + degraded=false + fi + + # ---- assemble + atomic write -------------------------------------------- + local ts files_json record tmp final + ts="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)" + files_json="$(printf '%s\n' "$files" | jq -R . | jq -s 'map(select(length>0))')" + + record="$(jq -n \ + --arg task_ref "$task_ref" \ + --arg agent "$agent" \ + --arg session_id "$session_id" \ + --arg ts "$ts" \ + --arg repo "$repo_name" \ + --argjson needs_review "$needs_review" \ + --argjson score "$top_weight" \ + --arg surface "$top_surface" \ + --arg reason "$reason" \ + --argjson files "$files_json" \ + --argjson self "$self_json" \ + --argjson degraded "$degraded" \ + --arg mode "$MODE" \ + '{ + schema: "reflection.v1", + task_ref: $task_ref, + agent: $agent, + session_id: $session_id, + timestamp: $ts, + repo: $repo, + confidence: $self.confidence, + most_likely_wrong: $self.most_likely_wrong, + known_not_in_diff: $self.known_not_in_diff, + risk: { needs_review: $needs_review, score: $score, surface: $surface, reason: $reason }, + files_changed: $files, + provenance: { source: "stop-hook", reflection_attempt: 1, degraded: $degraded, reflection_mode: $mode } + }' 2>/dev/null || true)" + + [[ -z "$record" ]] && return 0 + + final="${out_dir}/${session_id}-${ts//[:]/}.reflection.json" + tmp="${final}.tmp" + printf '%s\n' "$record" > "$tmp" 2>/dev/null || return 0 + mv -f "$tmp" "$final" 2>/dev/null || true +} + +# classify_surface PATH → surface name (highest-risk match wins, mirrors TS) +classify_surface() { + local p="$1" + if printf '%s' "$p" | grep -qiE 'auth|login|session|token|permission|rbac|credential|secret'; then echo auth; return; fi + if printf '%s' "$p" | grep -qiE 'migration|prisma|schema|\.sql|entity|repository|seed'; then echo data; return; fi + if printf '%s' "$p" | grep -qiE 'docker|\.woodpecker|compose|traefik|deploy|helm|k8s|terraform'; then echo infra; return; fi + if printf '%s' "$p" | grep -qiE 'package\.json|tsconfig|turbo\.json|pnpm-|\.config\.|eslint|vite'; then echo build; return; fi + if printf '%s' "$p" | grep -qE '\.tsx|\.css|components/|apps/web/'; then echo ui; return; fi + if printf '%s' "$p" | grep -qE '\.spec\.|\.test\.|__tests__/'; then echo test; return; fi + if printf '%s' "$p" | grep -qE '\.md$|docs/'; then echo docs; return; fi + echo none +} + +# surface_weight SURFACE → numeric weight (mirrors TS SURFACE_RULES) +surface_weight() { + case "$1" in + auth) echo 1.0 ;; + data) echo 0.9 ;; + infra) echo 0.85 ;; + build) echo 0.6 ;; + ui) echo 0.4 ;; + test) echo 0.2 ;; + docs) echo 0.1 ;; + *) echo 0.0 ;; + esac +} + +main || true +exit 0 diff --git a/packages/types/src/index.ts b/packages/types/src/index.ts index 49ae520..d35b52c 100644 --- a/packages/types/src/index.ts +++ b/packages/types/src/index.ts @@ -6,3 +6,4 @@ export * from './provider/index.js'; export * from './routing/index.js'; export * from './commands/index.js'; export * from './federation/index.js'; +export * from './reflection/index.js'; diff --git a/packages/types/src/reflection/__tests__/reflection.spec.ts b/packages/types/src/reflection/__tests__/reflection.spec.ts new file mode 100644 index 0000000..6d6ff54 --- /dev/null +++ b/packages/types/src/reflection/__tests__/reflection.spec.ts @@ -0,0 +1,146 @@ +/** + * Unit tests for the reflection.v1 schema + self-report boundary. + * + * The runtime source of truth is the zod schema set in `reflection.ts`. The + * class-validator `ReflectionSelfReportDto` is the NestJS-side boundary type + * (exercised under the gateway app's reflect-metadata runtime, mirroring how + * `chat.dto.ts` is tested in apps/gateway); here we validate the self-report + * input with its zod counterpart, which is what the Stop hook actually uses. + * + * Coverage: + * - REVIEW_SURFACES canonical ordering (the enum both zod + JSON Schema mirror) + * - ReflectionV1Schema accepts a fully-populated record + * - ReflectionV1Schema accepts a degraded record (self-report fields null) + * - ReflectionV1Schema rejects bad schema literal / out-of-range confidence / bad surface + * - ReflectionSelfReportSchema accepts valid + empty, rejects bad input + */ + +import { describe, expect, it } from 'vitest'; + +import { + REVIEW_SURFACES, + ReflectionV1Schema, + ReflectionSelfReportSchema, + type ReflectionV1, +} from '../index.js'; + +const baseMechanical = { + schema: 'reflection.v1' as const, + task_ref: 'stack#544', + agent: 'claude', + session_id: 'sess-abc', + timestamp: '2026-06-16T00:00:00.000Z', + repo: 'stack', + risk: { + needs_review: true, + score: 1.0, + surface: 'auth' as const, + reason: 'auth surface (weight 1) in: src/auth.ts', + }, + files_changed: ['src/auth.ts'], + provenance: { + source: 'stop-hook' as const, + reflection_attempt: 1, + degraded: false, + reflection_mode: 'solo' as const, + }, +}; + +describe('REVIEW_SURFACES', () => { + it('keeps the canonical most→least-sensitive ordering', () => { + expect(REVIEW_SURFACES).toEqual([ + 'auth', + 'data', + 'infra', + 'build', + 'ui', + 'test', + 'docs', + 'none', + ]); + }); +}); + +describe('ReflectionV1Schema', () => { + it('accepts a fully-populated record', () => { + const rec: ReflectionV1 = { + ...baseMechanical, + confidence: 0.7, + most_likely_wrong: { surface: 'auth', description: 'token refresh untested' }, + known_not_in_diff: 'manual QA only on the happy path', + }; + expect(() => ReflectionV1Schema.parse(rec)).not.toThrow(); + }); + + it('accepts a degraded record with null self-report fields', () => { + const rec: ReflectionV1 = { + ...baseMechanical, + confidence: null, + most_likely_wrong: null, + known_not_in_diff: null, + provenance: { ...baseMechanical.provenance, degraded: true }, + }; + expect(() => ReflectionV1Schema.parse(rec)).not.toThrow(); + }); + + it('rejects a wrong schema literal', () => { + const bad = { + ...baseMechanical, + schema: 'reflection.v2', + confidence: null, + most_likely_wrong: null, + known_not_in_diff: null, + }; + expect(() => ReflectionV1Schema.parse(bad)).toThrow(); + }); + + it('rejects out-of-range confidence', () => { + const bad = { + ...baseMechanical, + confidence: 1.5, + most_likely_wrong: null, + known_not_in_diff: null, + }; + expect(() => ReflectionV1Schema.parse(bad)).toThrow(); + }); + + it('rejects an unknown surface', () => { + const bad = { + ...baseMechanical, + risk: { ...baseMechanical.risk, surface: 'network' }, + confidence: null, + most_likely_wrong: null, + known_not_in_diff: null, + }; + expect(() => ReflectionV1Schema.parse(bad)).toThrow(); + }); +}); + +describe('ReflectionSelfReportSchema', () => { + it('accepts a valid self-report', () => { + const ok = ReflectionSelfReportSchema.safeParse({ + confidence: 0.8, + most_likely_wrong: { + surface: 'data', + description: 'migration not run against prod-sized data', + }, + known_not_in_diff: 'rollback path untested', + }); + expect(ok.success).toBe(true); + }); + + it('accepts an empty self-report (all optional)', () => { + expect(ReflectionSelfReportSchema.safeParse({}).success).toBe(true); + }); + + it('rejects confidence above 1', () => { + expect(ReflectionSelfReportSchema.safeParse({ confidence: 2 }).success).toBe(false); + }); + + it('rejects an unknown most_likely_wrong.surface', () => { + const res = ReflectionSelfReportSchema.safeParse({ + most_likely_wrong: { surface: 'network', description: 'x' }, + }); + expect(res.success).toBe(false); + }); +}); diff --git a/packages/types/src/reflection/index.ts b/packages/types/src/reflection/index.ts new file mode 100644 index 0000000..67f9f6e --- /dev/null +++ b/packages/types/src/reflection/index.ts @@ -0,0 +1,30 @@ +/** + * Agent reflection (v1) — public barrel. + * + * reflection.ts — zod schemas (runtime source of truth) + inferred types + * reflection.dto.ts — class-validator DTO for the agent self-report input + */ + +export { + REVIEW_SURFACES, + ReviewSurfaceSchema, + MostLikelyWrongSchema, + ReflectionRiskSchema, + ReflectionModeSchema, + ReflectionProvenanceSchema, + ReflectionSelfReportSchema, + ReflectionV1Schema, + REFLECTION_SCHEMA_ID, +} from './reflection.js'; + +export type { + ReviewSurface, + MostLikelyWrong, + ReflectionRisk, + ReflectionMode, + ReflectionProvenance, + ReflectionSelfReport, + ReflectionV1, +} from './reflection.js'; + +export { MostLikelyWrongDto, ReflectionSelfReportDto } from './reflection.dto.js'; diff --git a/packages/types/src/reflection/reflection.dto.ts b/packages/types/src/reflection/reflection.dto.ts new file mode 100644 index 0000000..9f63bbf --- /dev/null +++ b/packages/types/src/reflection/reflection.dto.ts @@ -0,0 +1,55 @@ +/** + * Reflection self-report DTO — class-validator boundary. + * + * Validates the agent-supplied self-report input (the optional + * `$REFLECTION_INPUT` file, default `/.mosaic/reflection-input.json`) + * before it is merged into a `reflection.v1` record. This is the only + * externally-authored input on the reflection path, so it gets a DTO per the + * Mosaic module-boundary rule. + * + * Class-validator only (no class-transformer `@Type`) — matching `chat.dto.ts` + * — so the module is safe to import without a `reflect-metadata` shim. Deep + * nested validation of `most_likely_wrong` is owned by the zod + * `ReflectionSelfReportSchema` in `reflection.ts`, which is what the Stop hook + * actually enforces at runtime. + */ + +import { + IsIn, + IsNumber, + IsObject, + IsOptional, + IsString, + Max, + Min, + MaxLength, +} from 'class-validator'; + +import { REVIEW_SURFACES } from './reflection.js'; + +/** Shape of `most_likely_wrong`; validated structurally by zod at runtime. */ +export class MostLikelyWrongDto { + @IsIn(REVIEW_SURFACES as unknown as string[]) + surface!: string; + + @IsString() + @MaxLength(4_000) + description!: string; +} + +export class ReflectionSelfReportDto { + @IsOptional() + @IsNumber() + @Min(0) + @Max(1) + confidence?: number; + + @IsOptional() + @IsObject() + most_likely_wrong?: MostLikelyWrongDto; + + @IsOptional() + @IsString() + @MaxLength(8_000) + known_not_in_diff?: string; +} diff --git a/packages/types/src/reflection/reflection.ts b/packages/types/src/reflection/reflection.ts new file mode 100644 index 0000000..0d4bdae --- /dev/null +++ b/packages/types/src/reflection/reflection.ts @@ -0,0 +1,90 @@ +/** + * Agent reflection (v1) — wire schema. + * + * Runtime source of truth for the `reflection.v1` sidecar emitted at end-of-run + * by the Stop hook (design §10 step 1). The JSON Schema artifact at + * `@mosaicstack/macp` `src/schemas/reflection.v1.schema.json` is the documented + * contract; this zod schema is the executable one and MUST agree with it. + * + * Field provenance: + * - MECHANICAL (risk, files_changed, ids, provenance): written by the hook. + * - SELF-REPORTED (confidence, most_likely_wrong, known_not_in_diff): merged + * from an optional agent-supplied input; null when absent. + * + * Pure — no NestJS, no DB, no Node-only APIs. Safe for browser/edge. + */ + +import { z } from 'zod'; + +/** Review surfaces, ordered most- to least-sensitive. Mirrors macp risk-floor. */ +export const REVIEW_SURFACES = [ + 'auth', + 'data', + 'infra', + 'build', + 'ui', + 'test', + 'docs', + 'none', +] as const; + +export const ReviewSurfaceSchema = z.enum(REVIEW_SURFACES); +export type ReviewSurface = z.infer; + +/** SELF-REPORTED: the single most-likely way the work is wrong. */ +export const MostLikelyWrongSchema = z.object({ + surface: ReviewSurfaceSchema, + description: z.string(), +}); +export type MostLikelyWrong = z.infer; + +/** MECHANICAL: output of the diff risk-floor (see `@mosaicstack/macp`). */ +export const ReflectionRiskSchema = z.object({ + needs_review: z.boolean(), + score: z.number().min(0).max(1), + surface: ReviewSurfaceSchema, + reason: z.string(), +}); +export type ReflectionRisk = z.infer; + +export const ReflectionModeSchema = z.enum(['off', 'solo', 'orchestrated']); +export type ReflectionMode = z.infer; + +export const ReflectionProvenanceSchema = z.object({ + source: z.literal('stop-hook'), + reflection_attempt: z.number().int().min(1), + degraded: z.boolean(), + reflection_mode: ReflectionModeSchema, +}); +export type ReflectionProvenance = z.infer; + +/** + * The self-reported half of a reflection. Supplied by the agent out-of-band + * (e.g. `/.mosaic/reflection-input.json`) and merged by the hook. All + * fields optional; missing fields become `null` in the assembled record. + */ +export const ReflectionSelfReportSchema = z.object({ + confidence: z.number().min(0).max(1).nullable().optional(), + most_likely_wrong: MostLikelyWrongSchema.nullable().optional(), + known_not_in_diff: z.string().nullable().optional(), +}); +export type ReflectionSelfReport = z.infer; + +/** The full assembled `reflection.v1` sidecar. */ +export const ReflectionV1Schema = z.object({ + schema: z.literal('reflection.v1'), + task_ref: z.string(), + agent: z.string(), + session_id: z.string(), + timestamp: z.string(), + repo: z.string(), + confidence: z.number().min(0).max(1).nullable(), + most_likely_wrong: MostLikelyWrongSchema.nullable(), + known_not_in_diff: z.string().nullable(), + risk: ReflectionRiskSchema, + files_changed: z.array(z.string()), + provenance: ReflectionProvenanceSchema, +}); +export type ReflectionV1 = z.infer; + +export const REFLECTION_SCHEMA_ID = 'reflection.v1' as const; diff --git a/scripts/analysis/reflect-board-history.sh b/scripts/analysis/reflect-board-history.sh new file mode 100755 index 0000000..d982dc5 --- /dev/null +++ b/scripts/analysis/reflect-board-history.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +# reflect-board-history.sh — Phase-0 experiment P3 (outcome detectability) +# +# Question: for completed tasks, how often does a machine-detectable +# correct/wrong outcome signal appear within a follow-up window (default 30d)? +# If the base rate is too low, predicted-vs-actual calibration (design §7) has +# nothing to score against, so the kernel should capture caveat-notes only. +# +# Method: consume a board/task export (JSONL, one task object per line) OR fall +# back to scanning the git history of a `data/` task directory. For each task +# that reached a "done"-like state, decide whether a later signal marks it +# correct or wrong (reopen, revert, follow-up "fix"/"regression", explicit +# outcome field). Emit the detectable-outcome base rate. HARNESS + RUBRIC. +# +# Usage: +# scripts/analysis/reflect-board-history.sh --jsonl FILE [--window-days N] [--json|--md] +# scripts/analysis/reflect-board-history.sh --data-dir DIR [--window-days N] [--json|--md] +# +# JSONL fields used (best-effort): .id .status .completed_at .outcome +# .reopened_at .followups[] (free-form). Missing fields are tolerated. +# +# Requirements: jq (for --jsonl), git (for --data-dir), awk. +# +# PRE-REGISTERED KILL CONDITION: +# detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop; +# capture caveat-notes only. + +set -euo pipefail + +JSONL="" +DATA_DIR="" +WINDOW_DAYS=30 +FORMAT="json" + +while [[ $# -gt 0 ]]; do + case "$1" in + --jsonl) JSONL="$2"; shift 2 ;; + --data-dir) DATA_DIR="$2"; shift 2 ;; + --window-days) WINDOW_DAYS="$2"; shift 2 ;; + --json) FORMAT="json"; shift ;; + --md) FORMAT="md"; shift ;; + -h|--help) sed -n '2,32p' "$0"; exit 0 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +KILL_CONDITION='detectable-outcome base rate < 20% ⇒ do NOT build §7 calibration loop' +echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2 + +done_total=0 +detectable=0 + +if [[ -n "$JSONL" ]]; then + command -v jq >/dev/null 2>&1 || { echo "jq required for --jsonl" >&2; exit 3; } + [[ -r "$JSONL" ]] || { echo "cannot read $JSONL" >&2; exit 3; } + # Count done tasks and those with a machine-detectable outcome signal. + done_total="$(jq -rs '[.[] | select((.status // "") | test("done|complete|closed"; "i"))] | length' "$JSONL" 2>/dev/null || echo 0)" + detectable="$(jq -rs ' + [ .[] + | select((.status // "") | test("done|complete|closed"; "i")) + | select( + (.outcome // null) != null + or (.reopened_at // null) != null + or ((.followups // []) | length) > 0 + ) + ] | length' "$JSONL" 2>/dev/null || echo 0)" +elif [[ -n "$DATA_DIR" ]]; then + command -v git >/dev/null 2>&1 || { echo "git required for --data-dir" >&2; exit 3; } + [[ -d "$DATA_DIR" ]] || { echo "no such dir: $DATA_DIR" >&2; exit 3; } + # Proxy: a task file later touched by a commit whose subject signals a + # correction is a "detectable outcome". + while IFS= read -r file; do + [[ -z "$file" ]] && continue + done_total=$((done_total + 1)) + if git -C "$DATA_DIR" log --since="${WINDOW_DAYS} days ago" --pretty='%s' -- "$file" 2>/dev/null \ + | grep -qiE 'reopen|revert|fix|regression|wrong|incorrect|redo'; then + detectable=$((detectable + 1)) + fi + done < <(find "$DATA_DIR" -type f -name '*.json' 2>/dev/null) +else + echo "provide --jsonl FILE or --data-dir DIR" >&2 + exit 2 +fi + +rate="$(awk "BEGIN{ if ($done_total==0) print \"0.0\"; else printf \"%.1f\", 100*$detectable/$done_total }")" +verdict="$(awk "BEGIN{print ($rate < 20.0) ? \"KILL §7 — caveat-notes only\" : \"signal present — proceed\"}")" + +if [[ "$FORMAT" == "md" ]]; then + cat <= threshold), and compare +# to the pre-registered chance baseline (the overall correct-rate). HARNESS + +# RUBRIC; the labelled corpus is supplied later. +# +# Usage: +# scripts/analysis/reflect-calibration.sh --jsonl FILE [--high 0.8] [--json|--md] +# +# Requirements: jq, awk. +# +# PRE-REGISTERED KILL CONDITION: +# AUC <= 0.60 OR high-subset lift <= +5pp over base rate +# ⇒ confidence is not a usable routing signal; do NOT build §7–§8. + +set -euo pipefail + +JSONL="" +HIGH=0.8 +FORMAT="json" + +while [[ $# -gt 0 ]]; do + case "$1" in + --jsonl) JSONL="$2"; shift 2 ;; + --high) HIGH="$2"; shift 2 ;; + --json) FORMAT="json"; shift ;; + --md) FORMAT="md"; shift ;; + -h|--help) sed -n '2,27p' "$0"; exit 0 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +KILL_CONDITION='AUC <= 0.60 OR high-subset lift <= +5pp ⇒ do NOT build §7–§8' +echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2 + +command -v jq >/dev/null 2>&1 || { echo "jq required" >&2; exit 3; } +[[ -r "$JSONL" ]] || { echo "provide a readable --jsonl FILE" >&2; exit 2; } + +# Normalise to " <0|1>" rows; tolerate bad lines. +ROWS="$(jq -rs ' + [ .[] | select((.confidence|type)=="number") | + "\(.confidence) \((.correct==true) | if . then 1 else 0 end)" ] + | .[]' "$JSONL" 2>/dev/null || true)" + +if [[ -z "$ROWS" ]]; then + echo '{ "experiment": "P1-calibration", "error": "no usable rows" }' + exit 0 +fi + +# AUC via the Mann–Whitney U relation (rank-based); base rate; high-subset lift. +read -r N POS BASE AUC HIGH_N HIGH_CORRECT HIGH_RATE LIFT <=high) { hn++; if (y==1) hc++ } } + END{ + base = (n>0)? pos/n : 0; + # Rank-sum AUC: average ranks (ties → average rank). + # sort indices by confidence + for (i=1;i<=n;i++) idx[i]=i; + for (i=1;i<=n;i++) for (j=i+1;j<=n;j++) if (conf[idx[i]]>conf[idx[j]]) { t=idx[i]; idx[i]=idx[j]; idx[j]=t } + i=1; + while (i<=n) { + j=i; while (j0 && neg>0) auc=(rsum - pos*(pos+1)/2.0)/(pos*neg); else auc=0.5; + hrate=(hn>0)? hc/hn : 0; + lift=hrate-base; + printf "%d %d %.4f %.4f %d %d %.4f %.4f", n, pos, base, auc, hn, hc, hrate, lift + }') +EOF + +verdict="$(awk -v auc="$AUC" -v lift="$LIFT" 'BEGIN{ + print (auc <= 0.60 || lift <= 0.05) ? "KILL §7–§8 — confidence not usable" : "signal present — proceed" +}')" + +if [[ "$FORMAT" == "md" ]]; then + cat <= ${HIGH}): n=${HIGH_N}, correct=${HIGH_CORRECT}, rate=$(awk "BEGIN{printf \"%.1f\", 100*${HIGH_RATE}}")% +- lift over base: **$(awk "BEGIN{printf \"%+.1f\", 100*${LIFT}}")pp** +- kill condition: ${KILL_CONDITION} +- verdict: **${verdict}** +EOF +else + awk -v n="$N" -v pos="$POS" -v base="$BASE" -v auc="$AUC" -v hn="$HIGH_N" \ + -v hc="$HIGH_CORRECT" -v hr="$HIGH_RATE" -v lift="$LIFT" -v high="$HIGH" \ + -v v="$verdict" -v kc="$KILL_CONDITION" 'BEGIN{ + printf "{\n" + printf " \"experiment\": \"P1-calibration\",\n" + printf " \"rows\": %d,\n", n + printf " \"positives\": %d,\n", pos + printf " \"base_rate\": %.4f,\n", base + printf " \"auc\": %.4f,\n", auc + printf " \"high_threshold\": %s,\n", high + printf " \"high_subset\": { \"n\": %d, \"correct\": %d, \"rate\": %.4f },\n", hn, hc, hr + printf " \"lift_over_base\": %.4f,\n", lift + printf " \"kill_condition\": \"%s\",\n", kc + printf " \"verdict\": \"%s\"\n", v + printf "}\n" + }' +fi diff --git a/scripts/analysis/reflect-git-history.sh b/scripts/analysis/reflect-git-history.sh new file mode 100755 index 0000000..129a2bd --- /dev/null +++ b/scripts/analysis/reflect-git-history.sh @@ -0,0 +1,110 @@ +#!/usr/bin/env bash +# reflect-git-history.sh — Phase-0 experiment P2 ("only-self-reflection" bucket) +# +# Question: of the failures visible in git history, what fraction would ONLY +# have been caught by end-of-run self-reflection — i.e. NOT by CI and NOT by +# independent human review? If that bucket is near-empty, the closed +# calibration / skill-synthesis loop (design §7–§8) is not worth building. +# +# Method: scan `git log` over a window for failure signals (reverts, and +# fix:/hotfix commits landing shortly after a feature merge). Classify each by +# the gate most likely to have caught it, using a pre-registered heuristic. +# This is a HARNESS + RUBRIC; the classifier is deliberately simple and the +# real corpus/labelling is wired later. It emits a structured tally. +# +# Usage: +# scripts/analysis/reflect-git-history.sh [--repo PATH] [--since SINCE] [--json|--md] +# +# Options: +# --repo PATH repo to analyse (default: current repo) +# --since SINCE git log --since value (default: "6 months ago") +# --json emit JSON (default) +# --md emit markdown +# +# Requirements: git, awk. +# +# PRE-REGISTERED KILL CONDITION: +# bucket "only_self_reflection" is near-empty (< 10% of classified failures) +# ⇒ do NOT build design §7–§8 (closed loop). Caveat-notes capture only. + +set -euo pipefail + +REPO="." +SINCE="6 months ago" +FORMAT="json" + +while [[ $# -gt 0 ]]; do + case "$1" in + --repo) REPO="$2"; shift 2 ;; + --since) SINCE="$2"; shift 2 ;; + --json) FORMAT="json"; shift ;; + --md) FORMAT="md"; shift ;; + -h|--help) sed -n '2,30p' "$0"; exit 0 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +KILL_CONDITION='bucket only_self_reflection < 10% of classified failures ⇒ do NOT build §7–§8' +echo "# pre-registered kill condition: ${KILL_CONDITION}" >&2 + +command -v git >/dev/null 2>&1 || { echo "git required" >&2; exit 3; } + +# Collect candidate failure commits: reverts + fix/hotfix subjects. +mapfile -t LINES < <( + git -C "$REPO" log --since="$SINCE" --pretty='%H%x09%s' 2>/dev/null \ + | grep -iE 'revert|hotfix|hot-fix|regression|fix(\(|:|!| )' || true +) + +total=0; ci=0; human=0; selfonly=0 +for line in "${LINES[@]}"; do + [[ -z "$line" ]] && continue + subj="${line#*$'\t'}" + total=$((total + 1)) + # Pre-registered classification heuristic (gate most likely to have caught it): + # - build/test/lint/type/ci signals → CI would have caught it + # - security/auth/permission/data/migration → human review would flag it + # - everything else (logic/UX/assumption/edge) → only-self-reflection bucket + if printf '%s' "$subj" | grep -qiE 'test|lint|type|build|ci|compile|typo'; then + ci=$((ci + 1)) + elif printf '%s' "$subj" | grep -qiE 'security|auth|permission|rbac|secret|migration|data|sql|injection'; then + human=$((human + 1)) + else + selfonly=$((selfonly + 1)) + fi +done + +pct() { awk "BEGIN{ if ($2==0) print \"0.0\"; else printf \"%.1f\", 100*$1/$2 }"; } +self_pct="$(pct "$selfonly" "$total")" +verdict="$(awk "BEGIN{print ($self_pct < 10.0) ? \"KILL §7–§8\" : \"signal present — proceed to deeper labelling\"}")" + +if [[ "$FORMAT" == "md" ]]; then + cat <