docs(#406): add speech services documentation
All checks were successful
ci/woodpecker/push/api Pipeline was successful

Comprehensive documentation for the speech services module:
- docs/SPEECH.md: Architecture, API reference, WebSocket protocol,
  environment variables, provider configuration, Docker setup,
  GPU VRAM budget, and frontend integration examples
- apps/api/src/speech/AGENTS.md: Module structure, provider pattern,
  how to add new providers, gotchas, and test patterns
- README.md: Speech capabilities section with quick start

Fixes #406

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-15 03:23:22 -06:00
parent bc86947d01
commit 24065aa199
3 changed files with 1213 additions and 13 deletions

View File

@@ -0,0 +1,247 @@
# speech — Agent Context
> Part of the `apps/api/src` layer. Speech-to-text (STT) and text-to-speech (TTS) services.
## Module Structure
```
speech/
├── speech.module.ts # NestJS module (conditional provider registration)
├── speech.config.ts # Environment validation + typed config (registerAs)
├── speech.config.spec.ts # 51 config validation tests
├── speech.constants.ts # NestJS injection tokens (STT_PROVIDER, TTS_PROVIDERS)
├── speech.controller.ts # REST endpoints (transcribe, synthesize, voices, health)
├── speech.controller.spec.ts # Controller tests
├── speech.service.ts # High-level service with fallback orchestration
├── speech.service.spec.ts # Service tests
├── speech.gateway.ts # WebSocket gateway (/speech namespace)
├── speech.gateway.spec.ts # Gateway tests
├── dto/
│ ├── transcribe.dto.ts # Transcription request DTO (class-validator)
│ ├── synthesize.dto.ts # Synthesis request DTO (class-validator)
│ └── index.ts # Barrel export
├── interfaces/
│ ├── speech-types.ts # Shared types (SpeechTier, AudioFormat, options, results)
│ ├── stt-provider.interface.ts # ISTTProvider contract
│ ├── tts-provider.interface.ts # ITTSProvider contract
│ └── index.ts # Barrel export
├── pipes/
│ ├── audio-validation.pipe.ts # Validates uploaded audio (MIME type, size)
│ ├── audio-validation.pipe.spec.ts
│ ├── text-validation.pipe.ts # Validates TTS text input (non-empty, max length)
│ ├── text-validation.pipe.spec.ts
│ └── index.ts # Barrel export
└── providers/
├── base-tts.provider.ts # Abstract base class (OpenAI SDK + common logic)
├── base-tts.provider.spec.ts
├── kokoro-tts.provider.ts # Default tier (CPU, 54 voices, 8 languages)
├── kokoro-tts.provider.spec.ts
├── chatterbox-tts.provider.ts # Premium tier (GPU, voice cloning, emotion control)
├── chatterbox-tts.provider.spec.ts
├── piper-tts.provider.ts # Fallback tier (CPU, lightweight, Raspberry Pi)
├── piper-tts.provider.spec.ts
├── speaches-stt.provider.ts # STT provider (Whisper via Speaches)
├── speaches-stt.provider.spec.ts
├── tts-provider.factory.ts # Factory: creates providers from config
└── tts-provider.factory.spec.ts
```
## Codebase Patterns
### Provider Pattern (BaseTTSProvider + Factory)
All TTS providers extend `BaseTTSProvider`:
```typescript
export class MyNewProvider extends BaseTTSProvider {
readonly name = "my-provider";
readonly tier: SpeechTier = "default"; // or "premium" or "fallback"
constructor(baseURL: string) {
super(baseURL, "default-voice-id", "mp3");
}
// Override listVoices() for custom voice catalog
override listVoices(): Promise<VoiceInfo[]> { ... }
// Override synthesize() only if non-standard API behavior is needed
// (see ChatterboxTTSProvider for example with extra body params)
}
```
The base class handles:
- OpenAI SDK client creation with custom `baseURL` and `apiKey: "not-needed"`
- Standard `synthesize()` via `client.audio.speech.create()`
- Default `listVoices()` returning just the default voice
- `isHealthy()` via GET to the `/v1/models` endpoint
### Config Pattern
Config follows the existing pattern (`auth.config.ts`, `federation.config.ts`):
- Export `isSttEnabled()`, `isTtsEnabled()`, etc. (boolean checks from env)
- Export `validateSpeechConfig()` (called at module init, throws on missing required vars)
- Export `getSpeechConfig()` (typed config object with defaults)
- Export `speechConfig = registerAs("speech", ...)` for NestJS ConfigModule
Boolean env parsing: `value === "true" || value === "1"`. No default-true.
### Conditional Provider Registration
In `speech.module.ts`:
- STT provider uses `isSttEnabled()` at module definition time to decide whether to register
- TTS providers use a factory function injected with `ConfigService`
- `@Optional()` decorator on `SpeechService`'s `sttProvider` handles the case where STT is disabled
### Injection Tokens
```typescript
// speech.constants.ts
export const STT_PROVIDER = Symbol("STT_PROVIDER"); // ISTTProvider
export const TTS_PROVIDERS = Symbol("TTS_PROVIDERS"); // Map<SpeechTier, ITTSProvider>
```
### Fallback Chain
TTS fallback order: `premium` -> `default` -> `fallback`
- Chain starts at the requested tier and goes downward
- Only tiers that are both enabled AND have a registered provider are attempted
- `ServiceUnavailableException` if all providers fail
### WebSocket Gateway
- Separate `/speech` namespace (not on the main gateway)
- Authentication mirrors the main WS gateway pattern (token extraction from handshake)
- One session per client, accumulates audio chunks in memory
- Chunks concatenated and transcribed on `stop-transcription`
- Session cleanup on disconnect
## How to Add a New TTS Provider
1. **Create the provider class** in `providers/`:
```typescript
// providers/my-tts.provider.ts
import { BaseTTSProvider } from "./base-tts.provider";
import type { SpeechTier } from "../interfaces/speech-types";
export class MyTtsProvider extends BaseTTSProvider {
readonly name = "my-provider";
readonly tier: SpeechTier = "default"; // Choose tier
constructor(baseURL: string) {
super(baseURL, "default-voice", "mp3");
}
override listVoices(): Promise<VoiceInfo[]> {
// Return your voice catalog
}
}
```
2. **Add env vars** to `speech.config.ts`:
- Add enabled check function
- Add URL to validation in `validateSpeechConfig()`
- Add config section in `getSpeechConfig()`
3. **Register in factory** (`tts-provider.factory.ts`):
```typescript
if (config.tts.myTier.enabled) {
const provider = new MyTtsProvider(config.tts.myTier.url);
providers.set("myTier", provider);
}
```
4. **Add env vars** to `.env.example`
5. **Write tests** following existing patterns (mock OpenAI SDK, test synthesis + listVoices + isHealthy)
## How to Add a New STT Provider
1. **Implement `ISTTProvider`** (does not use a base class -- STT has only one implementation currently)
2. **Add config section** similar to `stt` in `speech.config.ts`
3. **Register** in `speech.module.ts` providers array with `STT_PROVIDER` token
4. **Write tests** following `speaches-stt.provider.spec.ts` pattern
## Common Gotchas
- **OpenAI SDK `apiKey`**: Self-hosted services do not require an API key. Use `apiKey: "not-needed"` when creating the OpenAI client.
- **`toFile()` import**: The `toFile` helper is imported from `"openai"` (not from a subpath). Used in the STT provider to convert Buffer to a File-like object for multipart upload.
- **Health check URL**: `BaseTTSProvider.isHealthy()` calls `GET /v1/models`. The base URL is expected to end with `/v1`.
- **Voice ID prefix parsing**: Kokoro voice IDs encode language + gender in first two characters. See `parseVoicePrefix()` in `kokoro-tts.provider.ts`.
- **Chatterbox extra body params**: The `reference_audio` (base64) and `exaggeration` fields are passed via the OpenAI SDK by casting the request body. This works because the SDK passes through unknown fields.
- **WebSocket auth**: The gateway checks `auth.token`, then `query.token`, then `Authorization` header (in that order). Match this in test setup.
- **Config validation timing**: `validateSpeechConfig()` runs at module init (`onModuleInit`), not at provider construction. This means a misconfigured provider will fail at startup, not at first request.
## Test Patterns
### Mocking OpenAI SDK
All provider tests mock the OpenAI SDK. Pattern:
```typescript
vi.mock("openai", () => ({
default: vi.fn().mockImplementation(() => ({
audio: {
speech: {
create: vi.fn().mockResolvedValue({
arrayBuffer: () => Promise.resolve(new ArrayBuffer(10)),
}),
},
transcriptions: {
create: vi.fn().mockResolvedValue({
text: "transcribed text",
language: "en",
duration: 3.5,
}),
},
},
models: { list: vi.fn().mockResolvedValue({ data: [] }) },
})),
}));
```
### Mocking Config Injection
```typescript
const mockConfig: SpeechConfig = {
stt: { enabled: true, baseUrl: "http://test:8000/v1", model: "test-model", language: "en" },
tts: {
default: { enabled: true, url: "http://test:8880/v1", voice: "af_heart", format: "mp3" },
premium: { enabled: false, url: "" },
fallback: { enabled: false, url: "" },
},
limits: { maxUploadSize: 25000000, maxDurationSeconds: 600, maxTextLength: 4096 },
};
```
### Config Test Pattern
`speech.config.spec.ts` saves and restores `process.env` around each test:
```typescript
let savedEnv: NodeJS.ProcessEnv;
beforeEach(() => {
savedEnv = { ...process.env };
});
afterEach(() => {
process.env = savedEnv;
});
```
## Key Files
| File | Purpose |
| ----------------------------------- | ------------------------------------------------------------------------ |
| `speech.module.ts` | Module registration with conditional providers |
| `speech.config.ts` | All speech env vars + validation (51 tests) |
| `speech.service.ts` | Core service: transcribe, synthesize (with fallback), listVoices |
| `speech.controller.ts` | REST endpoints: POST transcribe, POST synthesize, GET voices, GET health |
| `speech.gateway.ts` | WebSocket streaming transcription (/speech namespace) |
| `providers/base-tts.provider.ts` | Abstract base for all TTS providers (OpenAI SDK wrapper) |
| `providers/tts-provider.factory.ts` | Creates provider instances from config |
| `interfaces/speech-types.ts` | All shared types: SpeechTier, AudioFormat, options, results |