docs(#406): add speech services documentation

Comprehensive documentation for the speech services module: - docs/SPEECH.md: Architecture, API reference, WebSocket protocol, environment variables, provider configuration, Docker setup, GPU VRAM budget, and frontend integration examples - apps/api/src/speech/AGENTS.md: Module structure, provider pattern, how to add new providers, gotchas, and test patterns - README.md: Speech capabilities section with quick start Fixes #406 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 03:23:22 -06:00
parent bc86947d01
commit 24065aa199
3 changed files with 1213 additions and 13 deletions
--- a/docs/SPEECH.md
+++ b/docs/SPEECH.md
@@ -0,0 +1,929 @@
+# Speech Services
+
+Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.
+
+## Table of Contents
+
+- [Architecture Overview](#architecture-overview)
+- [Provider Abstraction](#provider-abstraction)
+- [TTS Tier System and Fallback Chain](#tts-tier-system-and-fallback-chain)
+- [API Endpoint Reference](#api-endpoint-reference)
+- [WebSocket Streaming Protocol](#websocket-streaming-protocol)
+- [Environment Variable Reference](#environment-variable-reference)
+- [Provider Configuration](#provider-configuration)
+- [Voice Cloning Setup (Chatterbox)](#voice-cloning-setup-chatterbox)
+- [Docker Compose Setup](#docker-compose-setup)
+- [GPU VRAM Budget](#gpu-vram-budget)
+- [Frontend Integration](#frontend-integration)
+
+---
+
+## Architecture Overview
+
+```
+                          +-------------------+
+                          |  SpeechController |
+                          |  (REST endpoints) |
+                          +--------+----------+
+                                   |
+                    +--------------+--------------+
+                    |         SpeechService        |
+                    |  (provider selection,         |
+                    |   fallback orchestration)     |
+                    +---------+----------+---------+
+                              |          |
+                 +------------+    +-----+-------+
+                 |                 |             |
+          +------+------+   +-----+-----+ +-----+-----+
+          | STT Provider|   |TTS Provider| |TTS Provider|
+          | (Speaches)  |   |Map<Tier,P> | |Map<Tier,P> |
+          +------+------+   +-----+-----+ +-----+-----+
+                 |                 |             |
+          +------+------+   +-----+-----+ +-----+-----+
+          | Speaches    |   | Kokoro    | | Chatterbox |
+          | (Whisper)   |   | (default) | | (premium)  |
+          +-------------+   +-----------+ +-----+------+
+                                                |
+                                          +-----+-----+
+                                          |   Piper   |
+                                          | (fallback)|
+                                          +-----------+
+
+          +-------------------+
+          |  SpeechGateway    |
+          |  (WebSocket /speech)
+          +--------+----------+
+                   |
+          Uses SpeechService.transcribe()
+```
+
+The speech module (`apps/api/src/speech/`) is a self-contained NestJS module consisting of:
+
+| Component  | File                   | Purpose                                    |
+| ---------- | ---------------------- | ------------------------------------------ |
+| Module     | `speech.module.ts`     | Registers providers, controllers, gateway  |
+| Config     | `speech.config.ts`     | Environment validation and typed config    |
+| Service    | `speech.service.ts`    | High-level speech operations with fallback |
+| Controller | `speech.controller.ts` | REST API endpoints                         |
+| Gateway    | `speech.gateway.ts`    | WebSocket streaming transcription          |
+| Constants  | `speech.constants.ts`  | NestJS injection tokens                    |
+
+### Key Design Decisions
+
+1. **OpenAI-compatible APIs**: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom `baseURL`.
+
+2. **Provider abstraction**: STT and TTS providers implement well-defined interfaces (`ISTTProvider`, `ITTSProvider`). New providers can be added without modifying the service layer.
+
+3. **Conditional registration**: Providers are only instantiated when their corresponding `*_ENABLED` flag is `true`. The STT provider uses NestJS `@Optional()` injection.
+
+4. **Fail-fast validation**: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.
+
+---
+
+## Provider Abstraction
+
+### STT Provider Interface
+
+```typescript
+interface ISTTProvider {
+  readonly name: string;
+  transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
+  isHealthy(): Promise<boolean>;
+}
+```
+
+Currently implemented by `SpeachesSttProvider` which connects to a Speaches (faster-whisper) server.
+
+### TTS Provider Interface
+
+```typescript
+interface ITTSProvider {
+  readonly name: string;
+  readonly tier: SpeechTier;
+  synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
+  listVoices(): Promise<VoiceInfo[]>;
+  isHealthy(): Promise<boolean>;
+}
+```
+
+All TTS providers extend `BaseTTSProvider`, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set `name` and `tier` and optionally override `listVoices()` or `synthesize()`.
+
+### Provider Registration
+
+Providers are created by the `TTS Provider Factory` (`providers/tts-provider.factory.ts`) based on configuration:
+
+| Tier       | Provider Class          | Engine                    | Requirements |
+| ---------- | ----------------------- | ------------------------- | ------------ |
+| `default`  | `KokoroTtsProvider`     | Kokoro-FastAPI            | CPU only     |
+| `premium`  | `ChatterboxTTSProvider` | Chatterbox TTS Server     | NVIDIA GPU   |
+| `fallback` | `PiperTtsProvider`      | Piper via OpenedAI Speech | CPU only     |
+
+---
+
+## TTS Tier System and Fallback Chain
+
+TTS uses a tiered architecture with automatic fallback:
+
+```
+Request with tier="premium"
+    |
+    v
+[premium] Chatterbox available? --yes--> Use Chatterbox
+    |                                         |
+    no                                   (success/fail)
+    |
+    v
+[default] Kokoro available? ------yes--> Use Kokoro
+    |                                         |
+    no                                   (success/fail)
+    |
+    v
+[fallback] Piper available? -----yes--> Use Piper
+    |                                         |
+    no                                   (success/fail)
+    |
+    v
+ServiceUnavailableException
+```
+
+**Fallback order:** `premium` -> `default` -> `fallback`
+
+The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:
+
+1. It is enabled in configuration (`TTS_ENABLED`, `TTS_PREMIUM_ENABLED`, `TTS_FALLBACK_ENABLED`)
+2. A provider is registered for that tier
+
+If no tier is specified in the request, `default` is used as the starting point.
+
+---
+
+## API Endpoint Reference
+
+All speech endpoints are under `/api/speech/` and require authentication (Bearer token) plus workspace context (`x-workspace-id` header).
+
+### POST /api/speech/transcribe
+
+Transcribe an uploaded audio file to text.
+
+**Authentication:** Bearer token + workspace membership
+**Content-Type:** `multipart/form-data`
+
+**Form Fields:**
+
+| Field         | Type   | Required | Description                                            |
+| ------------- | ------ | -------- | ------------------------------------------------------ |
+| `file`        | File   | Yes      | Audio file (max 25 MB)                                 |
+| `language`    | string | No       | Language code (e.g., "en", "fr"). Default: from config |
+| `model`       | string | No       | Whisper model override. Default: from config           |
+| `prompt`      | string | No       | Prompt to guide transcription (max 1000 chars)         |
+| `temperature` | number | No       | Temperature 0.0-1.0. Lower = more deterministic        |
+
+**Accepted Audio Formats:**
+`audio/wav`, `audio/mp3`, `audio/mpeg`, `audio/webm`, `audio/ogg`, `audio/flac`, `audio/x-m4a`
+
+**Response:**
+
+```json
+{
+  "data": {
+    "text": "Hello, this is a transcription test.",
+    "language": "en",
+    "durationSeconds": 3.5,
+    "confidence": 0.95,
+    "segments": [
+      {
+        "text": "Hello, this is a transcription test.",
+        "start": 0.0,
+        "end": 3.5,
+        "confidence": 0.95
+      }
+    ]
+  }
+}
+```
+
+**Example:**
+
+```bash
+curl -X POST http://localhost:3001/api/speech/transcribe \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "x-workspace-id: WORKSPACE_ID" \
+  -F "file=@recording.wav" \
+  -F "language=en"
+```
+
+### POST /api/speech/synthesize
+
+Synthesize text to audio using TTS providers.
+
+**Authentication:** Bearer token + workspace membership
+**Content-Type:** `application/json`
+
+**Request Body:**
+
+| Field    | Type   | Required | Description                                                 |
+| -------- | ------ | -------- | ----------------------------------------------------------- |
+| `text`   | string | Yes      | Text to synthesize (max 4096 chars)                         |
+| `voice`  | string | No       | Voice ID. Default: from config (e.g., "af_heart")           |
+| `speed`  | number | No       | Speed multiplier 0.5-2.0. Default: 1.0                      |
+| `format` | string | No       | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 |
+| `tier`   | string | No       | Provider tier: default, premium, fallback. Default: default |
+
+**Response:** Binary audio data with appropriate `Content-Type` header.
+
+| Format | Content-Type |
+| ------ | ------------ |
+| mp3    | `audio/mpeg` |
+| wav    | `audio/wav`  |
+| opus   | `audio/opus` |
+| flac   | `audio/flac` |
+| aac    | `audio/aac`  |
+| pcm    | `audio/pcm`  |
+
+**Example:**
+
+```bash
+curl -X POST http://localhost:3001/api/speech/synthesize \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "x-workspace-id: WORKSPACE_ID" \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
+  --output speech.mp3
+```
+
+### GET /api/speech/voices
+
+List available TTS voices across all tiers.
+
+**Authentication:** Bearer token + workspace access
+**Query Parameters:**
+
+| Parameter | Type   | Required | Description                                |
+| --------- | ------ | -------- | ------------------------------------------ |
+| `tier`    | string | No       | Filter by tier: default, premium, fallback |
+
+**Response:**
+
+```json
+{
+  "data": [
+    {
+      "id": "af_heart",
+      "name": "Heart (American Female)",
+      "language": "en-US",
+      "tier": "default",
+      "isDefault": true
+    },
+    {
+      "id": "am_adam",
+      "name": "Adam (American Male)",
+      "language": "en-US",
+      "tier": "default",
+      "isDefault": false
+    }
+  ]
+}
+```
+
+**Example:**
+
+```bash
+curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "x-workspace-id: WORKSPACE_ID"
+```
+
+### GET /api/speech/health
+
+Check availability of STT and TTS providers.
+
+**Authentication:** Bearer token + workspace access
+
+**Response:**
+
+```json
+{
+  "data": {
+    "stt": { "available": true },
+    "tts": { "available": true }
+  }
+}
+```
+
+---
+
+## WebSocket Streaming Protocol
+
+The speech module provides a WebSocket gateway at namespace `/speech` for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.
+
+### Connection
+
+Connect to the `/speech` namespace with authentication:
+
+```typescript
+import { io } from "socket.io-client";
+
+const socket = io("http://localhost:3001/speech", {
+  auth: { token: "YOUR_SESSION_TOKEN" },
+});
+```
+
+**Authentication methods** (checked in order):
+
+1. `auth.token` in handshake
+2. `query.token` in handshake URL
+3. `Authorization: Bearer <token>` header
+
+Connection is rejected if:
+
+- No valid token is provided
+- Session verification fails
+- User has no workspace membership
+
+**Connection timeout:** 5 seconds for authentication.
+
+### Protocol Flow
+
+```
+Client                          Server
+  |                               |
+  |--- connect (with token) ----->|
+  |                               |  (authenticate, check workspace)
+  |<--- connected ----------------|
+  |                               |
+  |--- start-transcription ------>|  { language?: "en" }
+  |<--- transcription-started ----|  { sessionId, language }
+  |                               |
+  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
+  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
+  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
+  |                               |
+  |--- stop-transcription ------->|
+  |                               |  (concatenate chunks, transcribe)
+  |<--- transcription-final ------|  { text, language, durationSeconds, ... }
+  |                               |
+```
+
+### Client Events (emit)
+
+| Event                 | Payload                  | Description                              |
+| --------------------- | ------------------------ | ---------------------------------------- |
+| `start-transcription` | `{ language?: string }`  | Begin a new transcription session        |
+| `audio-chunk`         | `Buffer` or `Uint8Array` | Send audio data chunk                    |
+| `stop-transcription`  | (none)                   | Stop recording and trigger transcription |
+
+### Server Events (listen)
+
+| Event                   | Payload                                                     | Description                |
+| ----------------------- | ----------------------------------------------------------- | -------------------------- |
+| `transcription-started` | `{ sessionId, language }`                                   | Session created            |
+| `transcription-final`   | `{ text, language, durationSeconds, confidence, segments }` | Transcription result       |
+| `transcription-error`   | `{ message }`                                               | Error during transcription |
+
+### Session Management
+
+- One active transcription session per client connection
+- Starting a new session replaces any existing session
+- Sessions are cleaned up on client disconnect
+- Audio chunks are accumulated in memory
+- Total accumulated size is capped by `SPEECH_MAX_UPLOAD_SIZE` (default: 25 MB)
+
+### Example Client Usage
+
+```typescript
+import { io } from "socket.io-client";
+
+const socket = io("http://localhost:3001/speech", {
+  auth: { token: sessionToken },
+});
+
+// Start recording
+socket.emit("start-transcription", { language: "en" });
+
+socket.on("transcription-started", ({ sessionId }) => {
+  console.log("Session started:", sessionId);
+});
+
+// Stream audio chunks from MediaRecorder
+mediaRecorder.ondataavailable = (event) => {
+  if (event.data.size > 0) {
+    event.data.arrayBuffer().then((buffer) => {
+      socket.emit("audio-chunk", new Uint8Array(buffer));
+    });
+  }
+};
+
+// Stop and get result
+socket.emit("stop-transcription");
+
+socket.on("transcription-final", (result) => {
+  console.log("Transcription:", result.text);
+  console.log("Duration:", result.durationSeconds, "seconds");
+});
+
+socket.on("transcription-error", ({ message }) => {
+  console.error("Transcription error:", message);
+});
+```
+
+---
+
+## Environment Variable Reference
+
+### Speech-to-Text (STT)
+
+| Variable       | Default                                 | Description                                          |
+| -------------- | --------------------------------------- | ---------------------------------------------------- |
+| `STT_ENABLED`  | `false`                                 | Enable speech-to-text transcription                  |
+| `STT_BASE_URL` | `http://speaches:8000/v1`               | Speaches server URL (required when STT_ENABLED=true) |
+| `STT_MODEL`    | `Systran/faster-whisper-large-v3-turbo` | Whisper model for transcription                      |
+| `STT_LANGUAGE` | `en`                                    | Default language code                                |
+
+### Text-to-Speech (TTS) - Default Engine (Kokoro)
+
+| Variable             | Default                     | Description                                         |
+| -------------------- | --------------------------- | --------------------------------------------------- |
+| `TTS_ENABLED`        | `false`                     | Enable default TTS engine                           |
+| `TTS_DEFAULT_URL`    | `http://kokoro-tts:8880/v1` | Kokoro-FastAPI URL (required when TTS_ENABLED=true) |
+| `TTS_DEFAULT_VOICE`  | `af_heart`                  | Default Kokoro voice ID                             |
+| `TTS_DEFAULT_FORMAT` | `mp3`                       | Default audio output format                         |
+
+### Text-to-Speech (TTS) - Premium Engine (Chatterbox)
+
+| Variable              | Default                         | Description                                                 |
+| --------------------- | ------------------------------- | ----------------------------------------------------------- |
+| `TTS_PREMIUM_ENABLED` | `false`                         | Enable premium TTS engine                                   |
+| `TTS_PREMIUM_URL`     | `http://chatterbox-tts:8881/v1` | Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) |
+
+### Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)
+
+| Variable               | Default                          | Description                                                   |
+| ---------------------- | -------------------------------- | ------------------------------------------------------------- |
+| `TTS_FALLBACK_ENABLED` | `false`                          | Enable fallback TTS engine                                    |
+| `TTS_FALLBACK_URL`     | `http://openedai-speech:8000/v1` | OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) |
+
+### Service Limits
+
+| Variable                      | Default    | Description                                    |
+| ----------------------------- | ---------- | ---------------------------------------------- |
+| `SPEECH_MAX_UPLOAD_SIZE`      | `25000000` | Maximum upload file size in bytes (25 MB)      |
+| `SPEECH_MAX_DURATION_SECONDS` | `600`      | Maximum audio duration in seconds (10 minutes) |
+| `SPEECH_MAX_TEXT_LENGTH`      | `4096`     | Maximum text length for TTS in characters      |
+
+### Conditional Validation
+
+When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:
+
+```
+STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
+Either set these variables or disable by setting STT_ENABLED=false.
+```
+
+Boolean parsing: `value === "true"` or `value === "1"`. Unset or empty values default to `false`.
+
+---
+
+## Provider Configuration
+
+### Kokoro (Default Tier)
+
+**Engine:** [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
+**License:** Apache 2.0
+**Requirements:** CPU only
+**Docker Image:** `ghcr.io/remsky/kokoro-fastapi:latest-cpu`
+
+**Capabilities:**
+
+- 54 built-in voices across 8 languages
+- Speed control: 0.25x to 4.0x
+- Output formats: mp3, wav, opus, flac
+- Voice metadata derived from ID prefix (language, gender, accent)
+
+**Voice ID Format:** `{lang}{gender}_{name}`
+
+- First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
+- Second character: gender (f=Female, m=Male)
+
+**Example voices:**
+| Voice ID | Name | Language | Gender |
+|----------|------|----------|--------|
+| `af_heart` | Heart | en-US | Female |
+| `am_adam` | Adam | en-US | Male |
+| `bf_alice` | Alice | en-GB | Female |
+| `bm_daniel` | Daniel | en-GB | Male |
+| `ef_dora` | Dora | es | Female |
+| `ff_camille` | Camille | fr | Female |
+| `jf_alpha` | Alpha | ja | Female |
+| `zf_xiaobei` | Xiaobei | zh | Female |
+
+### Chatterbox (Premium Tier)
+
+**Engine:** [Chatterbox TTS Server](https://github.com/devnen/chatterbox-tts-server)
+**License:** Proprietary
+**Requirements:** NVIDIA GPU with CUDA
+**Docker Image:** `devnen/chatterbox-tts-server:latest`
+
+**Capabilities:**
+
+- Voice cloning via reference audio sample
+- Emotion exaggeration control (0.0 - 1.0)
+- Cross-language voice transfer (23 languages)
+- Higher quality synthesis than default tier
+
+**Supported Languages:**
+en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro
+
+**Extended Options (Chatterbox-specific):**
+
+| Option                | Type   | Description                                               |
+| --------------------- | ------ | --------------------------------------------------------- |
+| `referenceAudio`      | Buffer | Audio sample for voice cloning (5-30 seconds recommended) |
+| `emotionExaggeration` | number | Emotion intensity 0.0-1.0 (clamped)                       |
+
+These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.
+
+### Piper (Fallback Tier)
+
+**Engine:** [Piper](https://github.com/rhasspy/piper) via [OpenedAI Speech](https://github.com/matatonic/openedai-speech)
+**License:** GPL (OpenedAI Speech)
+**Requirements:** CPU only (runs on Raspberry Pi)
+**Docker Image:** Use OpenedAI Speech image
+
+**Capabilities:**
+
+- 100+ voices across 40+ languages
+- 6 standard OpenAI voice names (mapped to Piper voices)
+- Output formats: mp3, wav, opus, flac
+- Ultra-lightweight, designed for low-resource environments
+
+**Standard Voice Mapping:**
+
+| OpenAI Voice | Piper Voice          | Gender | Description           |
+| ------------ | -------------------- | ------ | --------------------- |
+| `alloy`      | en_US-amy-medium     | Female | Warm, balanced        |
+| `echo`       | en_US-ryan-medium    | Male   | Clear, articulate     |
+| `fable`      | en_GB-alan-medium    | Male   | British narrator      |
+| `onyx`       | en_US-danny-low      | Male   | Deep, resonant        |
+| `nova`       | en_US-lessac-medium  | Female | Expressive, versatile |
+| `shimmer`    | en_US-kristin-medium | Female | Bright, energetic     |
+
+### Speaches (STT)
+
+**Engine:** [Speaches](https://github.com/speaches-ai/speaches) (faster-whisper backend)
+**License:** MIT
+**Requirements:** CPU (GPU optional for faster inference)
+**Docker Image:** `ghcr.io/speaches-ai/speaches:latest`
+
+**Capabilities:**
+
+- OpenAI-compatible `/v1/audio/transcriptions` endpoint
+- Whisper models via faster-whisper
+- Verbose JSON response with segments and timestamps
+- Language detection
+
+**Default model:** `Systran/faster-whisper-large-v3-turbo`
+
+---
+
+## Voice Cloning Setup (Chatterbox)
+
+Voice cloning is available through the Chatterbox premium TTS provider.
+
+### Prerequisites
+
+1. NVIDIA GPU with CUDA support
+2. `nvidia-container-toolkit` installed on the Docker host
+3. Docker runtime configured for GPU access
+4. TTS premium tier enabled (`TTS_PREMIUM_ENABLED=true`)
+
+### Basic Voice Cloning
+
+Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:
+
+```typescript
+import { SpeechService } from "./speech.service";
+import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";
+
+const options: ChatterboxSynthesizeOptions = {
+  tier: "premium",
+  referenceAudio: myAudioBuffer, // 5-30 second audio sample
+  emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
+};
+
+const result = await speechService.synthesize("Hello, this is my cloned voice!", options);
+```
+
+### Voice Cloning Tips
+
+- **Audio quality:** Use clean recordings without background noise
+- **Duration:** 5-30 seconds works best; shorter clips may produce lower quality
+- **Format:** WAV provides the best quality; MP3 is also accepted
+- **Emotion:** Start with 0.5 (moderate) and adjust from there
+- **Cross-language:** You can clone a voice in one language and synthesize in another
+
+---
+
+## Docker Compose Setup
+
+### Development (Local)
+
+Speech services are defined in a separate overlay file `docker-compose.speech.yml`. This keeps them optional and separate from core services.
+
+**Start basic speech services (STT + default TTS):**
+
+```bash
+# Using docker compose directly
+docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d
+
+# Using Makefile
+make speech-up
+```
+
+**Start with premium TTS (requires NVIDIA GPU):**
+
+```bash
+docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d
+```
+
+**Stop speech services:**
+
+```bash
+# Using docker compose directly
+docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans
+
+# Using Makefile
+make speech-down
+```
+
+**View logs:**
+
+```bash
+make speech-logs
+```
+
+### Development Services
+
+| Service        | Container             | Port                            | Image                                      |
+| -------------- | --------------------- | ------------------------------- | ------------------------------------------ |
+| Speaches (STT) | mosaic-speaches       | 8090 (host) -> 8000 (container) | `ghcr.io/speaches-ai/speaches:latest`      |
+| Kokoro TTS     | mosaic-kokoro-tts     | 8880 (host) -> 8880 (container) | `ghcr.io/remsky/kokoro-fastapi:latest-cpu` |
+| Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | `devnen/chatterbox-tts-server:latest`      |
+
+### Production (Docker Swarm)
+
+For production deployments, use `docker/docker-compose.sample.speech.yml`. This file is designed for Docker Swarm with Traefik integration.
+
+**Required environment variables:**
+
+```bash
+STT_DOMAIN=stt.example.com
+TTS_DOMAIN=tts.example.com
+```
+
+**Optional environment variables:**
+
+```bash
+WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
+CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
+TRAEFIK_ENTRYPOINT=websecure
+TRAEFIK_CERTRESOLVER=letsencrypt
+TRAEFIK_DOCKER_NETWORK=traefik-public
+TRAEFIK_TLS_ENABLED=true
+```
+
+**Deploy:**
+
+```bash
+docker stack deploy -c docker/docker-compose.sample.speech.yml speech
+```
+
+**Connecting to Mosaic Stack:** Set the speech URLs in your Mosaic Stack `.env`:
+
+```bash
+# Same Docker network
+STT_BASE_URL=http://speaches:8000/v1
+TTS_DEFAULT_URL=http://kokoro-tts:8880/v1
+
+# External / different network
+STT_BASE_URL=https://stt.example.com/v1
+TTS_DEFAULT_URL=https://tts.example.com/v1
+```
+
+### Health Checks
+
+All speech containers include health checks:
+
+| Service        | Endpoint                       | Interval | Start Period |
+| -------------- | ------------------------------ | -------- | ------------ |
+| Speaches       | `http://localhost:8000/health` | 30s      | 120s         |
+| Kokoro TTS     | `http://localhost:8880/health` | 30s      | 120s         |
+| Chatterbox TTS | `http://localhost:8000/health` | 30s      | 180s         |
+
+Chatterbox has a longer start period (180s) because GPU model loading takes additional time.
+
+---
+
+## GPU VRAM Budget
+
+Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.
+
+### Chatterbox VRAM Requirements
+
+| Component               | Approximate VRAM   |
+| ----------------------- | ------------------ |
+| Chatterbox TTS model    | ~2-4 GB            |
+| Voice cloning inference | ~1-2 GB additional |
+| **Total recommended**   | **4-6 GB**         |
+
+### Shared GPU Considerations
+
+If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):
+
+| Service              | VRAM Usage  | Notes                             |
+| -------------------- | ----------- | --------------------------------- |
+| Ollama (7B model)    | ~4-6 GB     | Depends on model size             |
+| Ollama (13B model)   | ~8-10 GB    | Larger models need more           |
+| Chatterbox TTS       | ~4-6 GB     | Voice cloning is memory-intensive |
+| **Combined minimum** | **8-12 GB** | For 7B LLM + Chatterbox           |
+
+**Recommendations:**
+
+- 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
+- 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
+- 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom
+
+If VRAM is limited, consider:
+
+1. Disabling Chatterbox (`TTS_PREMIUM_ENABLED=false`) and using Kokoro (CPU) as default
+2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy
+3. Running Chatterbox on a separate GPU host
+
+### Docker Swarm GPU Scheduling
+
+For Docker Swarm deployments with GPU, configure generic resources on the node:
+
+```json
+// /etc/docker/daemon.json
+{
+  "runtimes": {
+    "nvidia": {
+      "path": "nvidia-container-runtime"
+    }
+  },
+  "node-generic-resources": ["NVIDIA-GPU=0"]
+}
+```
+
+See the [Docker GPU Swarm documentation](https://docs.docker.com/engine/daemon/nvidia-gpu/#configure-gpus-for-docker-swarm) for details.
+
+---
+
+## Frontend Integration
+
+Speech services are consumed from the frontend through the REST API and WebSocket gateway.
+
+### REST API Usage
+
+**Transcribe audio:**
+
+```typescript
+async function transcribeAudio(file: File, token: string, workspaceId: string) {
+  const formData = new FormData();
+  formData.append("file", file);
+  formData.append("language", "en");
+
+  const response = await fetch("/api/speech/transcribe", {
+    method: "POST",
+    headers: {
+      Authorization: `Bearer ${token}`,
+      "x-workspace-id": workspaceId,
+    },
+    body: formData,
+  });
+
+  const { data } = await response.json();
+  return data.text;
+}
+```
+
+**Synthesize speech:**
+
+```typescript
+async function synthesizeSpeech(
+  text: string,
+  token: string,
+  workspaceId: string,
+  voice = "af_heart"
+) {
+  const response = await fetch("/api/speech/synthesize", {
+    method: "POST",
+    headers: {
+      Authorization: `Bearer ${token}`,
+      "x-workspace-id": workspaceId,
+      "Content-Type": "application/json",
+    },
+    body: JSON.stringify({ text, voice, format: "mp3" }),
+  });
+
+  const audioBlob = await response.blob();
+  const audioUrl = URL.createObjectURL(audioBlob);
+  const audio = new Audio(audioUrl);
+  audio.play();
+}
+```
+
+**List voices:**
+
+```typescript
+async function listVoices(token: string, workspaceId: string, tier?: string) {
+  const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";
+
+  const response = await fetch(url, {
+    headers: {
+      Authorization: `Bearer ${token}`,
+      "x-workspace-id": workspaceId,
+    },
+  });
+
+  const { data } = await response.json();
+  return data; // VoiceInfo[]
+}
+```
+
+### WebSocket Streaming Usage
+
+For real-time transcription using the browser's MediaRecorder API:
+
+```typescript
+import { io } from "socket.io-client";
+
+function createSpeechSocket(token: string) {
+  const socket = io("/speech", {
+    auth: { token },
+  });
+
+  let mediaRecorder: MediaRecorder | null = null;
+
+  async function startRecording() {
+    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+    mediaRecorder = new MediaRecorder(stream, {
+      mimeType: "audio/webm;codecs=opus",
+    });
+
+    socket.emit("start-transcription", { language: "en" });
+
+    mediaRecorder.ondataavailable = (event) => {
+      if (event.data.size > 0) {
+        event.data.arrayBuffer().then((buffer) => {
+          socket.emit("audio-chunk", new Uint8Array(buffer));
+        });
+      }
+    };
+
+    mediaRecorder.start(250); // Send chunks every 250ms
+  }
+
+  async function stopRecording(): Promise<string> {
+    return new Promise((resolve, reject) => {
+      socket.once("transcription-final", (result) => {
+        resolve(result.text);
+      });
+
+      socket.once("transcription-error", ({ message }) => {
+        reject(new Error(message));
+      });
+
+      if (mediaRecorder) {
+        mediaRecorder.stop();
+        mediaRecorder.stream.getTracks().forEach((track) => track.stop());
+        mediaRecorder = null;
+      }
+
+      socket.emit("stop-transcription");
+    });
+  }
+
+  return { socket, startRecording, stopRecording };
+}
+```
+
+### Check Speech Availability
+
+Before showing speech UI elements, check provider availability:
+
+```typescript
+async function checkSpeechHealth(token: string, workspaceId: string) {
+  const response = await fetch("/api/speech/health", {
+    headers: {
+      Authorization: `Bearer ${token}`,
+      "x-workspace-id": workspaceId,
+    },
+  });
+
+  const { data } = await response.json();
+  return {
+    canTranscribe: data.stt.available,
+    canSynthesize: data.tts.available,
+  };
+}
+```