docs(#406): add speech services documentation
All checks were successful
ci/woodpecker/push/api Pipeline was successful
All checks were successful
ci/woodpecker/push/api Pipeline was successful
Comprehensive documentation for the speech services module: - docs/SPEECH.md: Architecture, API reference, WebSocket protocol, environment variables, provider configuration, Docker setup, GPU VRAM budget, and frontend integration examples - apps/api/src/speech/AGENTS.md: Module structure, provider pattern, how to add new providers, gotchas, and test patterns - README.md: Speech capabilities section with quick start Fixes #406 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
247
apps/api/src/speech/AGENTS.md
Normal file
247
apps/api/src/speech/AGENTS.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# speech — Agent Context
|
||||
|
||||
> Part of the `apps/api/src` layer. Speech-to-text (STT) and text-to-speech (TTS) services.
|
||||
|
||||
## Module Structure
|
||||
|
||||
```
|
||||
speech/
|
||||
├── speech.module.ts # NestJS module (conditional provider registration)
|
||||
├── speech.config.ts # Environment validation + typed config (registerAs)
|
||||
├── speech.config.spec.ts # 51 config validation tests
|
||||
├── speech.constants.ts # NestJS injection tokens (STT_PROVIDER, TTS_PROVIDERS)
|
||||
├── speech.controller.ts # REST endpoints (transcribe, synthesize, voices, health)
|
||||
├── speech.controller.spec.ts # Controller tests
|
||||
├── speech.service.ts # High-level service with fallback orchestration
|
||||
├── speech.service.spec.ts # Service tests
|
||||
├── speech.gateway.ts # WebSocket gateway (/speech namespace)
|
||||
├── speech.gateway.spec.ts # Gateway tests
|
||||
├── dto/
|
||||
│ ├── transcribe.dto.ts # Transcription request DTO (class-validator)
|
||||
│ ├── synthesize.dto.ts # Synthesis request DTO (class-validator)
|
||||
│ └── index.ts # Barrel export
|
||||
├── interfaces/
|
||||
│ ├── speech-types.ts # Shared types (SpeechTier, AudioFormat, options, results)
|
||||
│ ├── stt-provider.interface.ts # ISTTProvider contract
|
||||
│ ├── tts-provider.interface.ts # ITTSProvider contract
|
||||
│ └── index.ts # Barrel export
|
||||
├── pipes/
|
||||
│ ├── audio-validation.pipe.ts # Validates uploaded audio (MIME type, size)
|
||||
│ ├── audio-validation.pipe.spec.ts
|
||||
│ ├── text-validation.pipe.ts # Validates TTS text input (non-empty, max length)
|
||||
│ ├── text-validation.pipe.spec.ts
|
||||
│ └── index.ts # Barrel export
|
||||
└── providers/
|
||||
├── base-tts.provider.ts # Abstract base class (OpenAI SDK + common logic)
|
||||
├── base-tts.provider.spec.ts
|
||||
├── kokoro-tts.provider.ts # Default tier (CPU, 54 voices, 8 languages)
|
||||
├── kokoro-tts.provider.spec.ts
|
||||
├── chatterbox-tts.provider.ts # Premium tier (GPU, voice cloning, emotion control)
|
||||
├── chatterbox-tts.provider.spec.ts
|
||||
├── piper-tts.provider.ts # Fallback tier (CPU, lightweight, Raspberry Pi)
|
||||
├── piper-tts.provider.spec.ts
|
||||
├── speaches-stt.provider.ts # STT provider (Whisper via Speaches)
|
||||
├── speaches-stt.provider.spec.ts
|
||||
├── tts-provider.factory.ts # Factory: creates providers from config
|
||||
└── tts-provider.factory.spec.ts
|
||||
```
|
||||
|
||||
## Codebase Patterns
|
||||
|
||||
### Provider Pattern (BaseTTSProvider + Factory)
|
||||
|
||||
All TTS providers extend `BaseTTSProvider`:
|
||||
|
||||
```typescript
|
||||
export class MyNewProvider extends BaseTTSProvider {
|
||||
readonly name = "my-provider";
|
||||
readonly tier: SpeechTier = "default"; // or "premium" or "fallback"
|
||||
|
||||
constructor(baseURL: string) {
|
||||
super(baseURL, "default-voice-id", "mp3");
|
||||
}
|
||||
|
||||
// Override listVoices() for custom voice catalog
|
||||
override listVoices(): Promise<VoiceInfo[]> { ... }
|
||||
|
||||
// Override synthesize() only if non-standard API behavior is needed
|
||||
// (see ChatterboxTTSProvider for example with extra body params)
|
||||
}
|
||||
```
|
||||
|
||||
The base class handles:
|
||||
|
||||
- OpenAI SDK client creation with custom `baseURL` and `apiKey: "not-needed"`
|
||||
- Standard `synthesize()` via `client.audio.speech.create()`
|
||||
- Default `listVoices()` returning just the default voice
|
||||
- `isHealthy()` via GET to the `/v1/models` endpoint
|
||||
|
||||
### Config Pattern
|
||||
|
||||
Config follows the existing pattern (`auth.config.ts`, `federation.config.ts`):
|
||||
|
||||
- Export `isSttEnabled()`, `isTtsEnabled()`, etc. (boolean checks from env)
|
||||
- Export `validateSpeechConfig()` (called at module init, throws on missing required vars)
|
||||
- Export `getSpeechConfig()` (typed config object with defaults)
|
||||
- Export `speechConfig = registerAs("speech", ...)` for NestJS ConfigModule
|
||||
|
||||
Boolean env parsing: `value === "true" || value === "1"`. No default-true.
|
||||
|
||||
### Conditional Provider Registration
|
||||
|
||||
In `speech.module.ts`:
|
||||
|
||||
- STT provider uses `isSttEnabled()` at module definition time to decide whether to register
|
||||
- TTS providers use a factory function injected with `ConfigService`
|
||||
- `@Optional()` decorator on `SpeechService`'s `sttProvider` handles the case where STT is disabled
|
||||
|
||||
### Injection Tokens
|
||||
|
||||
```typescript
|
||||
// speech.constants.ts
|
||||
export const STT_PROVIDER = Symbol("STT_PROVIDER"); // ISTTProvider
|
||||
export const TTS_PROVIDERS = Symbol("TTS_PROVIDERS"); // Map<SpeechTier, ITTSProvider>
|
||||
```
|
||||
|
||||
### Fallback Chain
|
||||
|
||||
TTS fallback order: `premium` -> `default` -> `fallback`
|
||||
|
||||
- Chain starts at the requested tier and goes downward
|
||||
- Only tiers that are both enabled AND have a registered provider are attempted
|
||||
- `ServiceUnavailableException` if all providers fail
|
||||
|
||||
### WebSocket Gateway
|
||||
|
||||
- Separate `/speech` namespace (not on the main gateway)
|
||||
- Authentication mirrors the main WS gateway pattern (token extraction from handshake)
|
||||
- One session per client, accumulates audio chunks in memory
|
||||
- Chunks concatenated and transcribed on `stop-transcription`
|
||||
- Session cleanup on disconnect
|
||||
|
||||
## How to Add a New TTS Provider
|
||||
|
||||
1. **Create the provider class** in `providers/`:
|
||||
|
||||
```typescript
|
||||
// providers/my-tts.provider.ts
|
||||
import { BaseTTSProvider } from "./base-tts.provider";
|
||||
import type { SpeechTier } from "../interfaces/speech-types";
|
||||
|
||||
export class MyTtsProvider extends BaseTTSProvider {
|
||||
readonly name = "my-provider";
|
||||
readonly tier: SpeechTier = "default"; // Choose tier
|
||||
|
||||
constructor(baseURL: string) {
|
||||
super(baseURL, "default-voice", "mp3");
|
||||
}
|
||||
|
||||
override listVoices(): Promise<VoiceInfo[]> {
|
||||
// Return your voice catalog
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **Add env vars** to `speech.config.ts`:
|
||||
- Add enabled check function
|
||||
- Add URL to validation in `validateSpeechConfig()`
|
||||
- Add config section in `getSpeechConfig()`
|
||||
|
||||
3. **Register in factory** (`tts-provider.factory.ts`):
|
||||
|
||||
```typescript
|
||||
if (config.tts.myTier.enabled) {
|
||||
const provider = new MyTtsProvider(config.tts.myTier.url);
|
||||
providers.set("myTier", provider);
|
||||
}
|
||||
```
|
||||
|
||||
4. **Add env vars** to `.env.example`
|
||||
|
||||
5. **Write tests** following existing patterns (mock OpenAI SDK, test synthesis + listVoices + isHealthy)
|
||||
|
||||
## How to Add a New STT Provider
|
||||
|
||||
1. **Implement `ISTTProvider`** (does not use a base class -- STT has only one implementation currently)
|
||||
2. **Add config section** similar to `stt` in `speech.config.ts`
|
||||
3. **Register** in `speech.module.ts` providers array with `STT_PROVIDER` token
|
||||
4. **Write tests** following `speaches-stt.provider.spec.ts` pattern
|
||||
|
||||
## Common Gotchas
|
||||
|
||||
- **OpenAI SDK `apiKey`**: Self-hosted services do not require an API key. Use `apiKey: "not-needed"` when creating the OpenAI client.
|
||||
- **`toFile()` import**: The `toFile` helper is imported from `"openai"` (not from a subpath). Used in the STT provider to convert Buffer to a File-like object for multipart upload.
|
||||
- **Health check URL**: `BaseTTSProvider.isHealthy()` calls `GET /v1/models`. The base URL is expected to end with `/v1`.
|
||||
- **Voice ID prefix parsing**: Kokoro voice IDs encode language + gender in first two characters. See `parseVoicePrefix()` in `kokoro-tts.provider.ts`.
|
||||
- **Chatterbox extra body params**: The `reference_audio` (base64) and `exaggeration` fields are passed via the OpenAI SDK by casting the request body. This works because the SDK passes through unknown fields.
|
||||
- **WebSocket auth**: The gateway checks `auth.token`, then `query.token`, then `Authorization` header (in that order). Match this in test setup.
|
||||
- **Config validation timing**: `validateSpeechConfig()` runs at module init (`onModuleInit`), not at provider construction. This means a misconfigured provider will fail at startup, not at first request.
|
||||
|
||||
## Test Patterns
|
||||
|
||||
### Mocking OpenAI SDK
|
||||
|
||||
All provider tests mock the OpenAI SDK. Pattern:
|
||||
|
||||
```typescript
|
||||
vi.mock("openai", () => ({
|
||||
default: vi.fn().mockImplementation(() => ({
|
||||
audio: {
|
||||
speech: {
|
||||
create: vi.fn().mockResolvedValue({
|
||||
arrayBuffer: () => Promise.resolve(new ArrayBuffer(10)),
|
||||
}),
|
||||
},
|
||||
transcriptions: {
|
||||
create: vi.fn().mockResolvedValue({
|
||||
text: "transcribed text",
|
||||
language: "en",
|
||||
duration: 3.5,
|
||||
}),
|
||||
},
|
||||
},
|
||||
models: { list: vi.fn().mockResolvedValue({ data: [] }) },
|
||||
})),
|
||||
}));
|
||||
```
|
||||
|
||||
### Mocking Config Injection
|
||||
|
||||
```typescript
|
||||
const mockConfig: SpeechConfig = {
|
||||
stt: { enabled: true, baseUrl: "http://test:8000/v1", model: "test-model", language: "en" },
|
||||
tts: {
|
||||
default: { enabled: true, url: "http://test:8880/v1", voice: "af_heart", format: "mp3" },
|
||||
premium: { enabled: false, url: "" },
|
||||
fallback: { enabled: false, url: "" },
|
||||
},
|
||||
limits: { maxUploadSize: 25000000, maxDurationSeconds: 600, maxTextLength: 4096 },
|
||||
};
|
||||
```
|
||||
|
||||
### Config Test Pattern
|
||||
|
||||
`speech.config.spec.ts` saves and restores `process.env` around each test:
|
||||
|
||||
```typescript
|
||||
let savedEnv: NodeJS.ProcessEnv;
|
||||
beforeEach(() => {
|
||||
savedEnv = { ...process.env };
|
||||
});
|
||||
afterEach(() => {
|
||||
process.env = savedEnv;
|
||||
});
|
||||
```
|
||||
|
||||
## Key Files
|
||||
|
||||
| File | Purpose |
|
||||
| ----------------------------------- | ------------------------------------------------------------------------ |
|
||||
| `speech.module.ts` | Module registration with conditional providers |
|
||||
| `speech.config.ts` | All speech env vars + validation (51 tests) |
|
||||
| `speech.service.ts` | Core service: transcribe, synthesize (with fallback), listVoices |
|
||||
| `speech.controller.ts` | REST endpoints: POST transcribe, POST synthesize, GET voices, GET health |
|
||||
| `speech.gateway.ts` | WebSocket streaming transcription (/speech namespace) |
|
||||
| `providers/base-tts.provider.ts` | Abstract base for all TTS providers (OpenAI SDK wrapper) |
|
||||
| `providers/tts-provider.factory.ts` | Creates provider instances from config |
|
||||
| `interfaces/speech-types.ts` | All shared types: SpeechTier, AudioFormat, options, results |
|
||||
Reference in New Issue
Block a user