docs(#406): add speech services documentation
All checks were successful
ci/woodpecker/push/api Pipeline was successful

Comprehensive documentation for the speech services module:
- docs/SPEECH.md: Architecture, API reference, WebSocket protocol,
  environment variables, provider configuration, Docker setup,
  GPU VRAM budget, and frontend integration examples
- apps/api/src/speech/AGENTS.md: Module structure, provider pattern,
  how to add new providers, gotchas, and test patterns
- README.md: Speech capabilities section with quick start

Fixes #406

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-15 03:23:22 -06:00
parent bc86947d01
commit 24065aa199
3 changed files with 1213 additions and 13 deletions

929
docs/SPEECH.md Normal file
View File

@@ -0,0 +1,929 @@
# Speech Services
Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.
## Table of Contents
- [Architecture Overview](#architecture-overview)
- [Provider Abstraction](#provider-abstraction)
- [TTS Tier System and Fallback Chain](#tts-tier-system-and-fallback-chain)
- [API Endpoint Reference](#api-endpoint-reference)
- [WebSocket Streaming Protocol](#websocket-streaming-protocol)
- [Environment Variable Reference](#environment-variable-reference)
- [Provider Configuration](#provider-configuration)
- [Voice Cloning Setup (Chatterbox)](#voice-cloning-setup-chatterbox)
- [Docker Compose Setup](#docker-compose-setup)
- [GPU VRAM Budget](#gpu-vram-budget)
- [Frontend Integration](#frontend-integration)
---
## Architecture Overview
```
+-------------------+
| SpeechController |
| (REST endpoints) |
+--------+----------+
|
+--------------+--------------+
| SpeechService |
| (provider selection, |
| fallback orchestration) |
+---------+----------+---------+
| |
+------------+ +-----+-------+
| | |
+------+------+ +-----+-----+ +-----+-----+
| STT Provider| |TTS Provider| |TTS Provider|
| (Speaches) | |Map<Tier,P> | |Map<Tier,P> |
+------+------+ +-----+-----+ +-----+-----+
| | |
+------+------+ +-----+-----+ +-----+-----+
| Speaches | | Kokoro | | Chatterbox |
| (Whisper) | | (default) | | (premium) |
+-------------+ +-----------+ +-----+------+
|
+-----+-----+
| Piper |
| (fallback)|
+-----------+
+-------------------+
| SpeechGateway |
| (WebSocket /speech)
+--------+----------+
|
Uses SpeechService.transcribe()
```
The speech module (`apps/api/src/speech/`) is a self-contained NestJS module consisting of:
| Component | File | Purpose |
| ---------- | ---------------------- | ------------------------------------------ |
| Module | `speech.module.ts` | Registers providers, controllers, gateway |
| Config | `speech.config.ts` | Environment validation and typed config |
| Service | `speech.service.ts` | High-level speech operations with fallback |
| Controller | `speech.controller.ts` | REST API endpoints |
| Gateway | `speech.gateway.ts` | WebSocket streaming transcription |
| Constants | `speech.constants.ts` | NestJS injection tokens |
### Key Design Decisions
1. **OpenAI-compatible APIs**: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom `baseURL`.
2. **Provider abstraction**: STT and TTS providers implement well-defined interfaces (`ISTTProvider`, `ITTSProvider`). New providers can be added without modifying the service layer.
3. **Conditional registration**: Providers are only instantiated when their corresponding `*_ENABLED` flag is `true`. The STT provider uses NestJS `@Optional()` injection.
4. **Fail-fast validation**: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.
---
## Provider Abstraction
### STT Provider Interface
```typescript
interface ISTTProvider {
readonly name: string;
transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
isHealthy(): Promise<boolean>;
}
```
Currently implemented by `SpeachesSttProvider` which connects to a Speaches (faster-whisper) server.
### TTS Provider Interface
```typescript
interface ITTSProvider {
readonly name: string;
readonly tier: SpeechTier;
synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
listVoices(): Promise<VoiceInfo[]>;
isHealthy(): Promise<boolean>;
}
```
All TTS providers extend `BaseTTSProvider`, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set `name` and `tier` and optionally override `listVoices()` or `synthesize()`.
### Provider Registration
Providers are created by the `TTS Provider Factory` (`providers/tts-provider.factory.ts`) based on configuration:
| Tier | Provider Class | Engine | Requirements |
| ---------- | ----------------------- | ------------------------- | ------------ |
| `default` | `KokoroTtsProvider` | Kokoro-FastAPI | CPU only |
| `premium` | `ChatterboxTTSProvider` | Chatterbox TTS Server | NVIDIA GPU |
| `fallback` | `PiperTtsProvider` | Piper via OpenedAI Speech | CPU only |
---
## TTS Tier System and Fallback Chain
TTS uses a tiered architecture with automatic fallback:
```
Request with tier="premium"
|
v
[premium] Chatterbox available? --yes--> Use Chatterbox
| |
no (success/fail)
|
v
[default] Kokoro available? ------yes--> Use Kokoro
| |
no (success/fail)
|
v
[fallback] Piper available? -----yes--> Use Piper
| |
no (success/fail)
|
v
ServiceUnavailableException
```
**Fallback order:** `premium` -> `default` -> `fallback`
The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:
1. It is enabled in configuration (`TTS_ENABLED`, `TTS_PREMIUM_ENABLED`, `TTS_FALLBACK_ENABLED`)
2. A provider is registered for that tier
If no tier is specified in the request, `default` is used as the starting point.
---
## API Endpoint Reference
All speech endpoints are under `/api/speech/` and require authentication (Bearer token) plus workspace context (`x-workspace-id` header).
### POST /api/speech/transcribe
Transcribe an uploaded audio file to text.
**Authentication:** Bearer token + workspace membership
**Content-Type:** `multipart/form-data`
**Form Fields:**
| Field | Type | Required | Description |
| ------------- | ------ | -------- | ------------------------------------------------------ |
| `file` | File | Yes | Audio file (max 25 MB) |
| `language` | string | No | Language code (e.g., "en", "fr"). Default: from config |
| `model` | string | No | Whisper model override. Default: from config |
| `prompt` | string | No | Prompt to guide transcription (max 1000 chars) |
| `temperature` | number | No | Temperature 0.0-1.0. Lower = more deterministic |
**Accepted Audio Formats:**
`audio/wav`, `audio/mp3`, `audio/mpeg`, `audio/webm`, `audio/ogg`, `audio/flac`, `audio/x-m4a`
**Response:**
```json
{
"data": {
"text": "Hello, this is a transcription test.",
"language": "en",
"durationSeconds": 3.5,
"confidence": 0.95,
"segments": [
{
"text": "Hello, this is a transcription test.",
"start": 0.0,
"end": 3.5,
"confidence": 0.95
}
]
}
}
```
**Example:**
```bash
curl -X POST http://localhost:3001/api/speech/transcribe \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID" \
-F "file=@recording.wav" \
-F "language=en"
```
### POST /api/speech/synthesize
Synthesize text to audio using TTS providers.
**Authentication:** Bearer token + workspace membership
**Content-Type:** `application/json`
**Request Body:**
| Field | Type | Required | Description |
| -------- | ------ | -------- | ----------------------------------------------------------- |
| `text` | string | Yes | Text to synthesize (max 4096 chars) |
| `voice` | string | No | Voice ID. Default: from config (e.g., "af_heart") |
| `speed` | number | No | Speed multiplier 0.5-2.0. Default: 1.0 |
| `format` | string | No | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 |
| `tier` | string | No | Provider tier: default, premium, fallback. Default: default |
**Response:** Binary audio data with appropriate `Content-Type` header.
| Format | Content-Type |
| ------ | ------------ |
| mp3 | `audio/mpeg` |
| wav | `audio/wav` |
| opus | `audio/opus` |
| flac | `audio/flac` |
| aac | `audio/aac` |
| pcm | `audio/pcm` |
**Example:**
```bash
curl -X POST http://localhost:3001/api/speech/synthesize \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
--output speech.mp3
```
### GET /api/speech/voices
List available TTS voices across all tiers.
**Authentication:** Bearer token + workspace access
**Query Parameters:**
| Parameter | Type | Required | Description |
| --------- | ------ | -------- | ------------------------------------------ |
| `tier` | string | No | Filter by tier: default, premium, fallback |
**Response:**
```json
{
"data": [
{
"id": "af_heart",
"name": "Heart (American Female)",
"language": "en-US",
"tier": "default",
"isDefault": true
},
{
"id": "am_adam",
"name": "Adam (American Male)",
"language": "en-US",
"tier": "default",
"isDefault": false
}
]
}
```
**Example:**
```bash
curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID"
```
### GET /api/speech/health
Check availability of STT and TTS providers.
**Authentication:** Bearer token + workspace access
**Response:**
```json
{
"data": {
"stt": { "available": true },
"tts": { "available": true }
}
}
```
---
## WebSocket Streaming Protocol
The speech module provides a WebSocket gateway at namespace `/speech` for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.
### Connection
Connect to the `/speech` namespace with authentication:
```typescript
import { io } from "socket.io-client";
const socket = io("http://localhost:3001/speech", {
auth: { token: "YOUR_SESSION_TOKEN" },
});
```
**Authentication methods** (checked in order):
1. `auth.token` in handshake
2. `query.token` in handshake URL
3. `Authorization: Bearer <token>` header
Connection is rejected if:
- No valid token is provided
- Session verification fails
- User has no workspace membership
**Connection timeout:** 5 seconds for authentication.
### Protocol Flow
```
Client Server
| |
|--- connect (with token) ----->|
| | (authenticate, check workspace)
|<--- connected ----------------|
| |
|--- start-transcription ------>| { language?: "en" }
|<--- transcription-started ----| { sessionId, language }
| |
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|--- audio-chunk -------------->| (Buffer/Uint8Array)
| |
|--- stop-transcription ------->|
| | (concatenate chunks, transcribe)
|<--- transcription-final ------| { text, language, durationSeconds, ... }
| |
```
### Client Events (emit)
| Event | Payload | Description |
| --------------------- | ------------------------ | ---------------------------------------- |
| `start-transcription` | `{ language?: string }` | Begin a new transcription session |
| `audio-chunk` | `Buffer` or `Uint8Array` | Send audio data chunk |
| `stop-transcription` | (none) | Stop recording and trigger transcription |
### Server Events (listen)
| Event | Payload | Description |
| ----------------------- | ----------------------------------------------------------- | -------------------------- |
| `transcription-started` | `{ sessionId, language }` | Session created |
| `transcription-final` | `{ text, language, durationSeconds, confidence, segments }` | Transcription result |
| `transcription-error` | `{ message }` | Error during transcription |
### Session Management
- One active transcription session per client connection
- Starting a new session replaces any existing session
- Sessions are cleaned up on client disconnect
- Audio chunks are accumulated in memory
- Total accumulated size is capped by `SPEECH_MAX_UPLOAD_SIZE` (default: 25 MB)
### Example Client Usage
```typescript
import { io } from "socket.io-client";
const socket = io("http://localhost:3001/speech", {
auth: { token: sessionToken },
});
// Start recording
socket.emit("start-transcription", { language: "en" });
socket.on("transcription-started", ({ sessionId }) => {
console.log("Session started:", sessionId);
});
// Stream audio chunks from MediaRecorder
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
event.data.arrayBuffer().then((buffer) => {
socket.emit("audio-chunk", new Uint8Array(buffer));
});
}
};
// Stop and get result
socket.emit("stop-transcription");
socket.on("transcription-final", (result) => {
console.log("Transcription:", result.text);
console.log("Duration:", result.durationSeconds, "seconds");
});
socket.on("transcription-error", ({ message }) => {
console.error("Transcription error:", message);
});
```
---
## Environment Variable Reference
### Speech-to-Text (STT)
| Variable | Default | Description |
| -------------- | --------------------------------------- | ---------------------------------------------------- |
| `STT_ENABLED` | `false` | Enable speech-to-text transcription |
| `STT_BASE_URL` | `http://speaches:8000/v1` | Speaches server URL (required when STT_ENABLED=true) |
| `STT_MODEL` | `Systran/faster-whisper-large-v3-turbo` | Whisper model for transcription |
| `STT_LANGUAGE` | `en` | Default language code |
### Text-to-Speech (TTS) - Default Engine (Kokoro)
| Variable | Default | Description |
| -------------------- | --------------------------- | --------------------------------------------------- |
| `TTS_ENABLED` | `false` | Enable default TTS engine |
| `TTS_DEFAULT_URL` | `http://kokoro-tts:8880/v1` | Kokoro-FastAPI URL (required when TTS_ENABLED=true) |
| `TTS_DEFAULT_VOICE` | `af_heart` | Default Kokoro voice ID |
| `TTS_DEFAULT_FORMAT` | `mp3` | Default audio output format |
### Text-to-Speech (TTS) - Premium Engine (Chatterbox)
| Variable | Default | Description |
| --------------------- | ------------------------------- | ----------------------------------------------------------- |
| `TTS_PREMIUM_ENABLED` | `false` | Enable premium TTS engine |
| `TTS_PREMIUM_URL` | `http://chatterbox-tts:8881/v1` | Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) |
### Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)
| Variable | Default | Description |
| ---------------------- | -------------------------------- | ------------------------------------------------------------- |
| `TTS_FALLBACK_ENABLED` | `false` | Enable fallback TTS engine |
| `TTS_FALLBACK_URL` | `http://openedai-speech:8000/v1` | OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) |
### Service Limits
| Variable | Default | Description |
| ----------------------------- | ---------- | ---------------------------------------------- |
| `SPEECH_MAX_UPLOAD_SIZE` | `25000000` | Maximum upload file size in bytes (25 MB) |
| `SPEECH_MAX_DURATION_SECONDS` | `600` | Maximum audio duration in seconds (10 minutes) |
| `SPEECH_MAX_TEXT_LENGTH` | `4096` | Maximum text length for TTS in characters |
### Conditional Validation
When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:
```
STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
Either set these variables or disable by setting STT_ENABLED=false.
```
Boolean parsing: `value === "true"` or `value === "1"`. Unset or empty values default to `false`.
---
## Provider Configuration
### Kokoro (Default Tier)
**Engine:** [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
**License:** Apache 2.0
**Requirements:** CPU only
**Docker Image:** `ghcr.io/remsky/kokoro-fastapi:latest-cpu`
**Capabilities:**
- 54 built-in voices across 8 languages
- Speed control: 0.25x to 4.0x
- Output formats: mp3, wav, opus, flac
- Voice metadata derived from ID prefix (language, gender, accent)
**Voice ID Format:** `{lang}{gender}_{name}`
- First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
- Second character: gender (f=Female, m=Male)
**Example voices:**
| Voice ID | Name | Language | Gender |
|----------|------|----------|--------|
| `af_heart` | Heart | en-US | Female |
| `am_adam` | Adam | en-US | Male |
| `bf_alice` | Alice | en-GB | Female |
| `bm_daniel` | Daniel | en-GB | Male |
| `ef_dora` | Dora | es | Female |
| `ff_camille` | Camille | fr | Female |
| `jf_alpha` | Alpha | ja | Female |
| `zf_xiaobei` | Xiaobei | zh | Female |
### Chatterbox (Premium Tier)
**Engine:** [Chatterbox TTS Server](https://github.com/devnen/chatterbox-tts-server)
**License:** Proprietary
**Requirements:** NVIDIA GPU with CUDA
**Docker Image:** `devnen/chatterbox-tts-server:latest`
**Capabilities:**
- Voice cloning via reference audio sample
- Emotion exaggeration control (0.0 - 1.0)
- Cross-language voice transfer (23 languages)
- Higher quality synthesis than default tier
**Supported Languages:**
en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro
**Extended Options (Chatterbox-specific):**
| Option | Type | Description |
| --------------------- | ------ | --------------------------------------------------------- |
| `referenceAudio` | Buffer | Audio sample for voice cloning (5-30 seconds recommended) |
| `emotionExaggeration` | number | Emotion intensity 0.0-1.0 (clamped) |
These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.
### Piper (Fallback Tier)
**Engine:** [Piper](https://github.com/rhasspy/piper) via [OpenedAI Speech](https://github.com/matatonic/openedai-speech)
**License:** GPL (OpenedAI Speech)
**Requirements:** CPU only (runs on Raspberry Pi)
**Docker Image:** Use OpenedAI Speech image
**Capabilities:**
- 100+ voices across 40+ languages
- 6 standard OpenAI voice names (mapped to Piper voices)
- Output formats: mp3, wav, opus, flac
- Ultra-lightweight, designed for low-resource environments
**Standard Voice Mapping:**
| OpenAI Voice | Piper Voice | Gender | Description |
| ------------ | -------------------- | ------ | --------------------- |
| `alloy` | en_US-amy-medium | Female | Warm, balanced |
| `echo` | en_US-ryan-medium | Male | Clear, articulate |
| `fable` | en_GB-alan-medium | Male | British narrator |
| `onyx` | en_US-danny-low | Male | Deep, resonant |
| `nova` | en_US-lessac-medium | Female | Expressive, versatile |
| `shimmer` | en_US-kristin-medium | Female | Bright, energetic |
### Speaches (STT)
**Engine:** [Speaches](https://github.com/speaches-ai/speaches) (faster-whisper backend)
**License:** MIT
**Requirements:** CPU (GPU optional for faster inference)
**Docker Image:** `ghcr.io/speaches-ai/speaches:latest`
**Capabilities:**
- OpenAI-compatible `/v1/audio/transcriptions` endpoint
- Whisper models via faster-whisper
- Verbose JSON response with segments and timestamps
- Language detection
**Default model:** `Systran/faster-whisper-large-v3-turbo`
---
## Voice Cloning Setup (Chatterbox)
Voice cloning is available through the Chatterbox premium TTS provider.
### Prerequisites
1. NVIDIA GPU with CUDA support
2. `nvidia-container-toolkit` installed on the Docker host
3. Docker runtime configured for GPU access
4. TTS premium tier enabled (`TTS_PREMIUM_ENABLED=true`)
### Basic Voice Cloning
Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:
```typescript
import { SpeechService } from "./speech.service";
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";
const options: ChatterboxSynthesizeOptions = {
tier: "premium",
referenceAudio: myAudioBuffer, // 5-30 second audio sample
emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
};
const result = await speechService.synthesize("Hello, this is my cloned voice!", options);
```
### Voice Cloning Tips
- **Audio quality:** Use clean recordings without background noise
- **Duration:** 5-30 seconds works best; shorter clips may produce lower quality
- **Format:** WAV provides the best quality; MP3 is also accepted
- **Emotion:** Start with 0.5 (moderate) and adjust from there
- **Cross-language:** You can clone a voice in one language and synthesize in another
---
## Docker Compose Setup
### Development (Local)
Speech services are defined in a separate overlay file `docker-compose.speech.yml`. This keeps them optional and separate from core services.
**Start basic speech services (STT + default TTS):**
```bash
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d
# Using Makefile
make speech-up
```
**Start with premium TTS (requires NVIDIA GPU):**
```bash
docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d
```
**Stop speech services:**
```bash
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans
# Using Makefile
make speech-down
```
**View logs:**
```bash
make speech-logs
```
### Development Services
| Service | Container | Port | Image |
| -------------- | --------------------- | ------------------------------- | ------------------------------------------ |
| Speaches (STT) | mosaic-speaches | 8090 (host) -> 8000 (container) | `ghcr.io/speaches-ai/speaches:latest` |
| Kokoro TTS | mosaic-kokoro-tts | 8880 (host) -> 8880 (container) | `ghcr.io/remsky/kokoro-fastapi:latest-cpu` |
| Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | `devnen/chatterbox-tts-server:latest` |
### Production (Docker Swarm)
For production deployments, use `docker/docker-compose.sample.speech.yml`. This file is designed for Docker Swarm with Traefik integration.
**Required environment variables:**
```bash
STT_DOMAIN=stt.example.com
TTS_DOMAIN=tts.example.com
```
**Optional environment variables:**
```bash
WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
TRAEFIK_ENTRYPOINT=websecure
TRAEFIK_CERTRESOLVER=letsencrypt
TRAEFIK_DOCKER_NETWORK=traefik-public
TRAEFIK_TLS_ENABLED=true
```
**Deploy:**
```bash
docker stack deploy -c docker/docker-compose.sample.speech.yml speech
```
**Connecting to Mosaic Stack:** Set the speech URLs in your Mosaic Stack `.env`:
```bash
# Same Docker network
STT_BASE_URL=http://speaches:8000/v1
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1
# External / different network
STT_BASE_URL=https://stt.example.com/v1
TTS_DEFAULT_URL=https://tts.example.com/v1
```
### Health Checks
All speech containers include health checks:
| Service | Endpoint | Interval | Start Period |
| -------------- | ------------------------------ | -------- | ------------ |
| Speaches | `http://localhost:8000/health` | 30s | 120s |
| Kokoro TTS | `http://localhost:8880/health` | 30s | 120s |
| Chatterbox TTS | `http://localhost:8000/health` | 30s | 180s |
Chatterbox has a longer start period (180s) because GPU model loading takes additional time.
---
## GPU VRAM Budget
Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.
### Chatterbox VRAM Requirements
| Component | Approximate VRAM |
| ----------------------- | ------------------ |
| Chatterbox TTS model | ~2-4 GB |
| Voice cloning inference | ~1-2 GB additional |
| **Total recommended** | **4-6 GB** |
### Shared GPU Considerations
If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):
| Service | VRAM Usage | Notes |
| -------------------- | ----------- | --------------------------------- |
| Ollama (7B model) | ~4-6 GB | Depends on model size |
| Ollama (13B model) | ~8-10 GB | Larger models need more |
| Chatterbox TTS | ~4-6 GB | Voice cloning is memory-intensive |
| **Combined minimum** | **8-12 GB** | For 7B LLM + Chatterbox |
**Recommendations:**
- 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
- 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
- 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom
If VRAM is limited, consider:
1. Disabling Chatterbox (`TTS_PREMIUM_ENABLED=false`) and using Kokoro (CPU) as default
2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy
3. Running Chatterbox on a separate GPU host
### Docker Swarm GPU Scheduling
For Docker Swarm deployments with GPU, configure generic resources on the node:
```json
// /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime"
}
},
"node-generic-resources": ["NVIDIA-GPU=0"]
}
```
See the [Docker GPU Swarm documentation](https://docs.docker.com/engine/daemon/nvidia-gpu/#configure-gpus-for-docker-swarm) for details.
---
## Frontend Integration
Speech services are consumed from the frontend through the REST API and WebSocket gateway.
### REST API Usage
**Transcribe audio:**
```typescript
async function transcribeAudio(file: File, token: string, workspaceId: string) {
const formData = new FormData();
formData.append("file", file);
formData.append("language", "en");
const response = await fetch("/api/speech/transcribe", {
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
body: formData,
});
const { data } = await response.json();
return data.text;
}
```
**Synthesize speech:**
```typescript
async function synthesizeSpeech(
text: string,
token: string,
workspaceId: string,
voice = "af_heart"
) {
const response = await fetch("/api/speech/synthesize", {
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
"Content-Type": "application/json",
},
body: JSON.stringify({ text, voice, format: "mp3" }),
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
}
```
**List voices:**
```typescript
async function listVoices(token: string, workspaceId: string, tier?: string) {
const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";
const response = await fetch(url, {
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
});
const { data } = await response.json();
return data; // VoiceInfo[]
}
```
### WebSocket Streaming Usage
For real-time transcription using the browser's MediaRecorder API:
```typescript
import { io } from "socket.io-client";
function createSpeechSocket(token: string) {
const socket = io("/speech", {
auth: { token },
});
let mediaRecorder: MediaRecorder | null = null;
async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus",
});
socket.emit("start-transcription", { language: "en" });
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
event.data.arrayBuffer().then((buffer) => {
socket.emit("audio-chunk", new Uint8Array(buffer));
});
}
};
mediaRecorder.start(250); // Send chunks every 250ms
}
async function stopRecording(): Promise<string> {
return new Promise((resolve, reject) => {
socket.once("transcription-final", (result) => {
resolve(result.text);
});
socket.once("transcription-error", ({ message }) => {
reject(new Error(message));
});
if (mediaRecorder) {
mediaRecorder.stop();
mediaRecorder.stream.getTracks().forEach((track) => track.stop());
mediaRecorder = null;
}
socket.emit("stop-transcription");
});
}
return { socket, startRecording, stopRecording };
}
```
### Check Speech Availability
Before showing speech UI elements, check provider availability:
```typescript
async function checkSpeechHealth(token: string, workspaceId: string) {
const response = await fetch("/api/speech/health", {
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
});
const { data } = await response.json();
return {
canTranscribe: data.stt.available,
canSynthesize: data.tts.available,
};
}
```