docs(#406): add speech services documentation

Comprehensive documentation for the speech services module: - docs/SPEECH.md: Architecture, API reference, WebSocket protocol, environment variables, provider configuration, Docker setup, GPU VRAM budget, and frontend integration examples - apps/api/src/speech/AGENTS.md: Module structure, provider pattern, how to add new providers, gotchas, and test patterns - README.md: Speech capabilities section with quick start Fixes #406 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 03:23:22 -06:00
parent bc86947d01
commit 24065aa199
3 changed files with 1213 additions and 13 deletions
--- a/apps/api/src/speech/AGENTS.md
+++ b/apps/api/src/speech/AGENTS.md
@@ -0,0 +1,247 @@
+# speech — Agent Context
+
+> Part of the `apps/api/src` layer. Speech-to-text (STT) and text-to-speech (TTS) services.
+
+## Module Structure
+
+```
+speech/
+├── speech.module.ts           # NestJS module (conditional provider registration)
+├── speech.config.ts           # Environment validation + typed config (registerAs)
+├── speech.config.spec.ts      # 51 config validation tests
+├── speech.constants.ts        # NestJS injection tokens (STT_PROVIDER, TTS_PROVIDERS)
+├── speech.controller.ts       # REST endpoints (transcribe, synthesize, voices, health)
+├── speech.controller.spec.ts  # Controller tests
+├── speech.service.ts          # High-level service with fallback orchestration
+├── speech.service.spec.ts     # Service tests
+├── speech.gateway.ts          # WebSocket gateway (/speech namespace)
+├── speech.gateway.spec.ts     # Gateway tests
+├── dto/
+│   ├── transcribe.dto.ts      # Transcription request DTO (class-validator)
+│   ├── synthesize.dto.ts      # Synthesis request DTO (class-validator)
+│   └── index.ts               # Barrel export
+├── interfaces/
+│   ├── speech-types.ts        # Shared types (SpeechTier, AudioFormat, options, results)
+│   ├── stt-provider.interface.ts  # ISTTProvider contract
+│   ├── tts-provider.interface.ts  # ITTSProvider contract
+│   └── index.ts               # Barrel export
+├── pipes/
+│   ├── audio-validation.pipe.ts   # Validates uploaded audio (MIME type, size)
+│   ├── audio-validation.pipe.spec.ts
+│   ├── text-validation.pipe.ts    # Validates TTS text input (non-empty, max length)
+│   ├── text-validation.pipe.spec.ts
+│   └── index.ts               # Barrel export
+└── providers/
+    ├── base-tts.provider.ts       # Abstract base class (OpenAI SDK + common logic)
+    ├── base-tts.provider.spec.ts
+    ├── kokoro-tts.provider.ts     # Default tier (CPU, 54 voices, 8 languages)
+    ├── kokoro-tts.provider.spec.ts
+    ├── chatterbox-tts.provider.ts # Premium tier (GPU, voice cloning, emotion control)
+    ├── chatterbox-tts.provider.spec.ts
+    ├── piper-tts.provider.ts      # Fallback tier (CPU, lightweight, Raspberry Pi)
+    ├── piper-tts.provider.spec.ts
+    ├── speaches-stt.provider.ts   # STT provider (Whisper via Speaches)
+    ├── speaches-stt.provider.spec.ts
+    ├── tts-provider.factory.ts    # Factory: creates providers from config
+    └── tts-provider.factory.spec.ts
+```
+
+## Codebase Patterns
+
+### Provider Pattern (BaseTTSProvider + Factory)
+
+All TTS providers extend `BaseTTSProvider`:
+
+```typescript
+export class MyNewProvider extends BaseTTSProvider {
+  readonly name = "my-provider";
+  readonly tier: SpeechTier = "default"; // or "premium" or "fallback"
+
+  constructor(baseURL: string) {
+    super(baseURL, "default-voice-id", "mp3");
+  }
+
+  // Override listVoices() for custom voice catalog
+  override listVoices(): Promise<VoiceInfo[]> { ... }
+
+  // Override synthesize() only if non-standard API behavior is needed
+  // (see ChatterboxTTSProvider for example with extra body params)
+}
+```
+
+The base class handles:
+
+- OpenAI SDK client creation with custom `baseURL` and `apiKey: "not-needed"`
+- Standard `synthesize()` via `client.audio.speech.create()`
+- Default `listVoices()` returning just the default voice
+- `isHealthy()` via GET to the `/v1/models` endpoint
+
+### Config Pattern
+
+Config follows the existing pattern (`auth.config.ts`, `federation.config.ts`):
+
+- Export `isSttEnabled()`, `isTtsEnabled()`, etc. (boolean checks from env)
+- Export `validateSpeechConfig()` (called at module init, throws on missing required vars)
+- Export `getSpeechConfig()` (typed config object with defaults)
+- Export `speechConfig = registerAs("speech", ...)` for NestJS ConfigModule
+
+Boolean env parsing: `value === "true" || value === "1"`. No default-true.
+
+### Conditional Provider Registration
+
+In `speech.module.ts`:
+
+- STT provider uses `isSttEnabled()` at module definition time to decide whether to register
+- TTS providers use a factory function injected with `ConfigService`
+- `@Optional()` decorator on `SpeechService`'s `sttProvider` handles the case where STT is disabled
+
+### Injection Tokens
+
+```typescript
+// speech.constants.ts
+export const STT_PROVIDER = Symbol("STT_PROVIDER"); // ISTTProvider
+export const TTS_PROVIDERS = Symbol("TTS_PROVIDERS"); // Map<SpeechTier, ITTSProvider>
+```
+
+### Fallback Chain
+
+TTS fallback order: `premium` -> `default` -> `fallback`
+
+- Chain starts at the requested tier and goes downward
+- Only tiers that are both enabled AND have a registered provider are attempted
+- `ServiceUnavailableException` if all providers fail
+
+### WebSocket Gateway
+
+- Separate `/speech` namespace (not on the main gateway)
+- Authentication mirrors the main WS gateway pattern (token extraction from handshake)
+- One session per client, accumulates audio chunks in memory
+- Chunks concatenated and transcribed on `stop-transcription`
+- Session cleanup on disconnect
+
+## How to Add a New TTS Provider
+
+1. **Create the provider class** in `providers/`:
+
+```typescript
+// providers/my-tts.provider.ts
+import { BaseTTSProvider } from "./base-tts.provider";
+import type { SpeechTier } from "../interfaces/speech-types";
+
+export class MyTtsProvider extends BaseTTSProvider {
+  readonly name = "my-provider";
+  readonly tier: SpeechTier = "default"; // Choose tier
+
+  constructor(baseURL: string) {
+    super(baseURL, "default-voice", "mp3");
+  }
+
+  override listVoices(): Promise<VoiceInfo[]> {
+    // Return your voice catalog
+  }
+}
+```
+
+2. **Add env vars** to `speech.config.ts`:
+   - Add enabled check function
+   - Add URL to validation in `validateSpeechConfig()`
+   - Add config section in `getSpeechConfig()`
+
+3. **Register in factory** (`tts-provider.factory.ts`):
+
+```typescript
+if (config.tts.myTier.enabled) {
+  const provider = new MyTtsProvider(config.tts.myTier.url);
+  providers.set("myTier", provider);
+}
+```
+
+4. **Add env vars** to `.env.example`
+
+5. **Write tests** following existing patterns (mock OpenAI SDK, test synthesis + listVoices + isHealthy)
+
+## How to Add a New STT Provider
+
+1. **Implement `ISTTProvider`** (does not use a base class -- STT has only one implementation currently)
+2. **Add config section** similar to `stt` in `speech.config.ts`
+3. **Register** in `speech.module.ts` providers array with `STT_PROVIDER` token
+4. **Write tests** following `speaches-stt.provider.spec.ts` pattern
+
+## Common Gotchas
+
+- **OpenAI SDK `apiKey`**: Self-hosted services do not require an API key. Use `apiKey: "not-needed"` when creating the OpenAI client.
+- **`toFile()` import**: The `toFile` helper is imported from `"openai"` (not from a subpath). Used in the STT provider to convert Buffer to a File-like object for multipart upload.
+- **Health check URL**: `BaseTTSProvider.isHealthy()` calls `GET /v1/models`. The base URL is expected to end with `/v1`.
+- **Voice ID prefix parsing**: Kokoro voice IDs encode language + gender in first two characters. See `parseVoicePrefix()` in `kokoro-tts.provider.ts`.
+- **Chatterbox extra body params**: The `reference_audio` (base64) and `exaggeration` fields are passed via the OpenAI SDK by casting the request body. This works because the SDK passes through unknown fields.
+- **WebSocket auth**: The gateway checks `auth.token`, then `query.token`, then `Authorization` header (in that order). Match this in test setup.
+- **Config validation timing**: `validateSpeechConfig()` runs at module init (`onModuleInit`), not at provider construction. This means a misconfigured provider will fail at startup, not at first request.
+
+## Test Patterns
+
+### Mocking OpenAI SDK
+
+All provider tests mock the OpenAI SDK. Pattern:
+
+```typescript
+vi.mock("openai", () => ({
+  default: vi.fn().mockImplementation(() => ({
+    audio: {
+      speech: {
+        create: vi.fn().mockResolvedValue({
+          arrayBuffer: () => Promise.resolve(new ArrayBuffer(10)),
+        }),
+      },
+      transcriptions: {
+        create: vi.fn().mockResolvedValue({
+          text: "transcribed text",
+          language: "en",
+          duration: 3.5,
+        }),
+      },
+    },
+    models: { list: vi.fn().mockResolvedValue({ data: [] }) },
+  })),
+}));
+```
+
+### Mocking Config Injection
+
+```typescript
+const mockConfig: SpeechConfig = {
+  stt: { enabled: true, baseUrl: "http://test:8000/v1", model: "test-model", language: "en" },
+  tts: {
+    default: { enabled: true, url: "http://test:8880/v1", voice: "af_heart", format: "mp3" },
+    premium: { enabled: false, url: "" },
+    fallback: { enabled: false, url: "" },
+  },
+  limits: { maxUploadSize: 25000000, maxDurationSeconds: 600, maxTextLength: 4096 },
+};
+```
+
+### Config Test Pattern
+
+`speech.config.spec.ts` saves and restores `process.env` around each test:
+
+```typescript
+let savedEnv: NodeJS.ProcessEnv;
+beforeEach(() => {
+  savedEnv = { ...process.env };
+});
+afterEach(() => {
+  process.env = savedEnv;
+});
+```
+
+## Key Files
+
+| File                                | Purpose                                                                  |
+| ----------------------------------- | ------------------------------------------------------------------------ |
+| `speech.module.ts`                  | Module registration with conditional providers                           |
+| `speech.config.ts`                  | All speech env vars + validation (51 tests)                              |
+| `speech.service.ts`                 | Core service: transcribe, synthesize (with fallback), listVoices         |
+| `speech.controller.ts`              | REST endpoints: POST transcribe, POST synthesize, GET voices, GET health |
+| `speech.gateway.ts`                 | WebSocket streaming transcription (/speech namespace)                    |
+| `providers/base-tts.provider.ts`    | Abstract base for all TTS providers (OpenAI SDK wrapper)                 |
+| `providers/tts-provider.factory.ts` | Creates provider instances from config                                   |
+| `interfaces/speech-types.ts`        | All shared types: SpeechTier, AudioFormat, options, results              |