Files
stack/docs/scratchpads/362-auth-session-chain-debug.md

241 lines
9.7 KiB
Markdown

# 362 - Auth Session Chain Debug (Authentik -> BetterAuth -> API Guard)
## Context
- Date (UTC): 2026-02-19
- Environment under test: production domains
- Web: `https://app.mosaicstack.dev/login`
- API: `https://api.mosaicstack.dev`
- IdP: `https://auth.diversecanvas.com`
- Tooling: Playwright MCP + Chromium
## Problem Statement
Users can complete Authentik login and consent, but Mosaic web app returns to login and remains unauthenticated.
## Timeline and Evidence
1. Initial reproduction from web login:
- `POST /auth/sign-in/oauth2` returned `200` with Authentik authorize URL.
- Authentik login flow and consent screen loaded correctly.
2. First callback failure mode (before `jarvis` email fix):
- Callback ended at API error redirect with `error=email_is_missing`.
- Result URL: `https://api.mosaicstack.dev/?error=email_is_missing`.
3. User updated Authentik account:
- `jarvis` account email set to `jarvis@mosaic.local`.
- `email_is_missing` failure no longer occurs.
4. Current callback behavior (after email fix):
- `GET /auth/oauth2/callback/authentik?code=...&state=...` returns `302` to `https://app.mosaicstack.dev/`.
- Callback sets BetterAuth cookies:
- `__Secure-better-auth.state=...; Max-Age=0; ...`
- `__Secure-better-auth.session_token=...; Max-Age=604800; Path=/; HttpOnly; Secure; SameSite=Lax`
- Browser cookie jar confirms session cookie present for `api.mosaicstack.dev`.
5. Session validation mismatch (critical):
- BetterAuth direct session endpoint succeeds:
- `GET /auth/get-session` -> `200` with session payload.
- Guarded API session endpoint fails:
- `GET /auth/session` -> `401` with
`{"message":"Invalid or expired session", ...}`
- Reproduced repeatedly in same browser context immediately after callback.
## Config Sync Notes
User synced local files with deployed Portainer stack:
- `.env` updated with deployed values.
- `docker-compose.swarm.portainer.yml` changed:
- Removed `BETTER_AUTH_URL` env mapping from API service.
Observed auth behavior after sync:
- Improvement: removed `email_is_missing` callback error.
- Remaining failure: `/auth/session` still returns 401 despite valid BetterAuth cookie and successful `/auth/get-session`.
## Root Cause Hypothesis (Strong)
`AuthGuard` extracts BetterAuth session cookie token correctly, but `AuthService.verifySession()` validates it using `Authorization: Bearer <token>` instead of a BetterAuth cookie/header context.
Relevant code paths:
- `apps/api/src/auth/guards/auth.guard.ts`
- extracts `__Secure-better-auth.session_token` / `better-auth.session_token`
- `apps/api/src/auth/auth.service.ts`
- `verifySession()` calls `auth.api.getSession({ headers: { authorization: "Bearer ..." } })`
Why this matches evidence:
- `/auth/get-session` (native BetterAuth endpoint reading request cookie) succeeds.
- `/auth/session` (custom guard + verify path) fails for same browser session.
## Next Actions
1. Fix `verifySession()` to validate using BetterAuth-compatible cookie header candidates first, with bearer fallback for API clients.
2. Add/update unit tests in `auth.service.spec.ts` to cover cookie-first validation and bearer fallback.
3. Re-run targeted API auth tests.
4. Re-run Playwright auth chain to confirm:
- callback sets cookie
- `/auth/session` returns `200`
- web app transitions out of `/login`.
## Implementation Update (2026-02-19)
Completed items:
1. Updated backend session verification logic:
- File: `apps/api/src/auth/auth.service.ts`
- `verifySession()` now tries session resolution in this order:
- `cookie: __Secure-better-auth.session_token=<token>`
- `cookie: better-auth.session_token=<token>`
- `cookie: __Host-better-auth.session_token=<token>`
- `authorization: Bearer <token>` (fallback)
- Added helper methods:
- `buildSessionHeaderCandidates()`
- `isExpectedAuthError()`
2. Added/updated tests:
- File: `apps/api/src/auth/auth.service.spec.ts`
- Added RED->GREEN test:
- `should validate session token using secure BetterAuth cookie header`
- Updated fallback coverage test:
- `should fall back to Authorization header when cookie-based lookups miss`
3. Verification:
- Command: `pnpm --filter @mosaic/api test -- src/auth/auth.service.spec.ts`
- Result: pass (all tests green).
- Command: `pnpm --filter @mosaic/api lint`
- Result: pass.
Remaining step (requires deploy):
- Redeploy API with this patch and rerun live Playwright flow on `app.mosaicstack.dev` to confirm `/auth/session` returns `200` after callback.
## Playwright Re-Check (2026-02-19, later run)
Live flow evidence after previous deploy attempt:
1. OAuth callback succeeds:
- `GET https://api.mosaicstack.dev/auth/oauth2/callback/authentik?code=...&state=...` -> `302`
- Redirect target observed: `https://app.mosaicstack.dev/`
- Browser cookie jar includes:
- `__Secure-better-auth.session_token` on `api.mosaicstack.dev` (HttpOnly, Secure, SameSite=Lax)
2. Session bootstrap still fails immediately:
- `GET https://api.mosaicstack.dev/auth/session` -> `500`
- Response body shape:
- `{"success":false,"message":"An unexpected error occurred","errorId":"...","path":"/auth/session","statusCode":500}`
- Web app returns to login because session fetch fails.
3. Frontend version mismatch observed:
- Live `POST /auth/sign-in/oauth2` response from login flow still shows callback URL pointing to `/dashboard`.
- Current repository login page uses callback URL `/`.
- This indicates deployed web image is older than current `develop` code (or stale image tag in runtime).
## Additional Code Fix Applied Locally (pending push/deploy)
Refined cookie candidate construction in API session verification:
- File: `apps/api/src/auth/auth.service.ts`
- Removed URL-encoding of session token when constructing cookie headers.
- Cookie candidates now pass raw token value exactly as extracted from incoming cookie.
Why:
- BetterAuth cookie tokens can contain characters like `/`, `+`, and `=`.
- Re-encoding these values can mutate token bytes and cause lookup/parse failures.
Regression test added:
- File: `apps/api/src/auth/auth.service.spec.ts`
- `should preserve raw cookie token value without URL re-encoding`
## Deploy + Live Repro (after auth cookie fix deploy)
Deployment actions executed:
1. Pushed auth cookie fix commit to `develop`.
2. Waited for Woodpecker pipeline success (`mosaic/stack`, build `#514`).
3. On `10.1.1.90`:
- Ran `/home/localadmin/mosaic/pull_all.sh`.
- Updated swarm services to `:dev` images:
- `stack_api`
- `stack_web`
- `stack_coordinator`
- `stack_orchestrator`
- Verified service convergence.
Post-deploy behavior:
- Initial `/auth/session` without cookies now returns `401` (expected).
- OAuth callback succeeds and sets BetterAuth session cookie.
- `/auth/session` still fails after callback, now due to a new backend `500`.
## New Root Cause Discovered (RLS interceptor SQL)
Live `stack_api` logs showed:
- Auth guard successfully finds session cookie:
- `Session cookie found: __Secure-better-auth.session_token`
- Then failure inside RLS setup:
- PostgreSQL `42601` syntax error at or near `$1`
- Source: `RlsContextInterceptor` raw SQL while setting context vars
- Request ends as `500 Request processing failed` on `/auth/session`
Cause:
- `SET LOCAL app.current_user_id = ${userId}` became `SET LOCAL ... = $1` under parameterization.
- PostgreSQL does not accept bind placeholders in `SET` assignment syntax.
## RLS Fix Applied Locally (pending commit/deploy)
Files updated:
- `apps/api/src/common/interceptors/rls-context.interceptor.ts`
- Replaced `SET LOCAL` statements with parameter-safe, transaction-local calls:
- `SELECT set_config('app.current_user_id', ${userId}, true)`
- `SELECT set_config('app.current_workspace_id', ${workspaceId}, true)`
- Keeps transaction scoping (`true` => local to transaction).
- `apps/api/src/common/interceptors/rls-context.interceptor.spec.ts`
- Updated expected SQL template fragments to `set_config(...)`.
- `apps/api/src/common/interceptors/rls-context.integration.spec.ts`
- Updated integration expectations to `set_config(...)`.
## Deploy + Verify (RLS fix commit `8424a28`)
Pipeline and deploy sequence:
1. Commit `8424a28` pushed to `develop`.
2. Woodpecker pipeline `mosaic/stack#515` completed successfully.
3. Host deploy actions on `10.1.1.90`:
- Ran `/home/localadmin/mosaic/pull_all.sh`
- Updated swarm services (`stack_api`, `stack_web`, `stack_coordinator`, `stack_orchestrator`) to `:dev`
Observed issue after first restart:
- Playwright still reproduced `/auth/session` `500` after Authentik callback.
- `stack_api` logs still showed old RLS SQL failure (`SET LOCAL ... $1`), indicating runtime image drift/stale task.
Resolution:
1. Checked host image digest for API:
- `git.mosaicstack.dev/mosaic/stack-api:dev` -> `sha256:fd0cbfe053ed27945577553d67da5cbda0bf71610006e5ccc197d5761e29a220`
2. Forced swarm API service to exact digest:
- `docker service update --with-registry-auth --image git.mosaicstack.dev/mosaic/stack-api@sha256:fd0cbfe053ed27945577553d67da5cbda0bf71610006e5ccc197d5761e29a220 stack_api`
3. Verified new running task uses digest-pinned image.
Final verification (Playwright MCP):
- Login flow: `https://app.mosaicstack.dev/login` -> Authentik (`jarvis` / `jarvis`) -> redirect back to app.
- Session endpoint: `GET https://api.mosaicstack.dev/auth/session` -> `200`.
- App landed authenticated on `https://app.mosaicstack.dev/tasks` (not bounced to login).
Status:
- Auth chain is functioning end-to-end after digest-forced API rollout.
- Remaining console noise observed: missing `favicon.ico` (`404`) on app domain (non-blocking for auth).