All checks were successful
ci/woodpecker/push/api Pipeline was successful
Comprehensive documentation for the speech services module: - docs/SPEECH.md: Architecture, API reference, WebSocket protocol, environment variables, provider configuration, Docker setup, GPU VRAM budget, and frontend integration examples - apps/api/src/speech/AGENTS.md: Module structure, provider pattern, how to add new providers, gotchas, and test patterns - README.md: Speech capabilities section with quick start Fixes #406 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
930 lines
31 KiB
Markdown
930 lines
31 KiB
Markdown
# Speech Services
|
|
|
|
Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.
|
|
|
|
## Table of Contents
|
|
|
|
- [Architecture Overview](#architecture-overview)
|
|
- [Provider Abstraction](#provider-abstraction)
|
|
- [TTS Tier System and Fallback Chain](#tts-tier-system-and-fallback-chain)
|
|
- [API Endpoint Reference](#api-endpoint-reference)
|
|
- [WebSocket Streaming Protocol](#websocket-streaming-protocol)
|
|
- [Environment Variable Reference](#environment-variable-reference)
|
|
- [Provider Configuration](#provider-configuration)
|
|
- [Voice Cloning Setup (Chatterbox)](#voice-cloning-setup-chatterbox)
|
|
- [Docker Compose Setup](#docker-compose-setup)
|
|
- [GPU VRAM Budget](#gpu-vram-budget)
|
|
- [Frontend Integration](#frontend-integration)
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
+-------------------+
|
|
| SpeechController |
|
|
| (REST endpoints) |
|
|
+--------+----------+
|
|
|
|
|
+--------------+--------------+
|
|
| SpeechService |
|
|
| (provider selection, |
|
|
| fallback orchestration) |
|
|
+---------+----------+---------+
|
|
| |
|
|
+------------+ +-----+-------+
|
|
| | |
|
|
+------+------+ +-----+-----+ +-----+-----+
|
|
| STT Provider| |TTS Provider| |TTS Provider|
|
|
| (Speaches) | |Map<Tier,P> | |Map<Tier,P> |
|
|
+------+------+ +-----+-----+ +-----+-----+
|
|
| | |
|
|
+------+------+ +-----+-----+ +-----+-----+
|
|
| Speaches | | Kokoro | | Chatterbox |
|
|
| (Whisper) | | (default) | | (premium) |
|
|
+-------------+ +-----------+ +-----+------+
|
|
|
|
|
+-----+-----+
|
|
| Piper |
|
|
| (fallback)|
|
|
+-----------+
|
|
|
|
+-------------------+
|
|
| SpeechGateway |
|
|
| (WebSocket /speech)
|
|
+--------+----------+
|
|
|
|
|
Uses SpeechService.transcribe()
|
|
```
|
|
|
|
The speech module (`apps/api/src/speech/`) is a self-contained NestJS module consisting of:
|
|
|
|
| Component | File | Purpose |
|
|
| ---------- | ---------------------- | ------------------------------------------ |
|
|
| Module | `speech.module.ts` | Registers providers, controllers, gateway |
|
|
| Config | `speech.config.ts` | Environment validation and typed config |
|
|
| Service | `speech.service.ts` | High-level speech operations with fallback |
|
|
| Controller | `speech.controller.ts` | REST API endpoints |
|
|
| Gateway | `speech.gateway.ts` | WebSocket streaming transcription |
|
|
| Constants | `speech.constants.ts` | NestJS injection tokens |
|
|
|
|
### Key Design Decisions
|
|
|
|
1. **OpenAI-compatible APIs**: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom `baseURL`.
|
|
|
|
2. **Provider abstraction**: STT and TTS providers implement well-defined interfaces (`ISTTProvider`, `ITTSProvider`). New providers can be added without modifying the service layer.
|
|
|
|
3. **Conditional registration**: Providers are only instantiated when their corresponding `*_ENABLED` flag is `true`. The STT provider uses NestJS `@Optional()` injection.
|
|
|
|
4. **Fail-fast validation**: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.
|
|
|
|
---
|
|
|
|
## Provider Abstraction
|
|
|
|
### STT Provider Interface
|
|
|
|
```typescript
|
|
interface ISTTProvider {
|
|
readonly name: string;
|
|
transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
|
|
isHealthy(): Promise<boolean>;
|
|
}
|
|
```
|
|
|
|
Currently implemented by `SpeachesSttProvider` which connects to a Speaches (faster-whisper) server.
|
|
|
|
### TTS Provider Interface
|
|
|
|
```typescript
|
|
interface ITTSProvider {
|
|
readonly name: string;
|
|
readonly tier: SpeechTier;
|
|
synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
|
|
listVoices(): Promise<VoiceInfo[]>;
|
|
isHealthy(): Promise<boolean>;
|
|
}
|
|
```
|
|
|
|
All TTS providers extend `BaseTTSProvider`, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set `name` and `tier` and optionally override `listVoices()` or `synthesize()`.
|
|
|
|
### Provider Registration
|
|
|
|
Providers are created by the `TTS Provider Factory` (`providers/tts-provider.factory.ts`) based on configuration:
|
|
|
|
| Tier | Provider Class | Engine | Requirements |
|
|
| ---------- | ----------------------- | ------------------------- | ------------ |
|
|
| `default` | `KokoroTtsProvider` | Kokoro-FastAPI | CPU only |
|
|
| `premium` | `ChatterboxTTSProvider` | Chatterbox TTS Server | NVIDIA GPU |
|
|
| `fallback` | `PiperTtsProvider` | Piper via OpenedAI Speech | CPU only |
|
|
|
|
---
|
|
|
|
## TTS Tier System and Fallback Chain
|
|
|
|
TTS uses a tiered architecture with automatic fallback:
|
|
|
|
```
|
|
Request with tier="premium"
|
|
|
|
|
v
|
|
[premium] Chatterbox available? --yes--> Use Chatterbox
|
|
| |
|
|
no (success/fail)
|
|
|
|
|
v
|
|
[default] Kokoro available? ------yes--> Use Kokoro
|
|
| |
|
|
no (success/fail)
|
|
|
|
|
v
|
|
[fallback] Piper available? -----yes--> Use Piper
|
|
| |
|
|
no (success/fail)
|
|
|
|
|
v
|
|
ServiceUnavailableException
|
|
```
|
|
|
|
**Fallback order:** `premium` -> `default` -> `fallback`
|
|
|
|
The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:
|
|
|
|
1. It is enabled in configuration (`TTS_ENABLED`, `TTS_PREMIUM_ENABLED`, `TTS_FALLBACK_ENABLED`)
|
|
2. A provider is registered for that tier
|
|
|
|
If no tier is specified in the request, `default` is used as the starting point.
|
|
|
|
---
|
|
|
|
## API Endpoint Reference
|
|
|
|
All speech endpoints are under `/api/speech/` and require authentication (Bearer token) plus workspace context (`x-workspace-id` header).
|
|
|
|
### POST /api/speech/transcribe
|
|
|
|
Transcribe an uploaded audio file to text.
|
|
|
|
**Authentication:** Bearer token + workspace membership
|
|
**Content-Type:** `multipart/form-data`
|
|
|
|
**Form Fields:**
|
|
|
|
| Field | Type | Required | Description |
|
|
| ------------- | ------ | -------- | ------------------------------------------------------ |
|
|
| `file` | File | Yes | Audio file (max 25 MB) |
|
|
| `language` | string | No | Language code (e.g., "en", "fr"). Default: from config |
|
|
| `model` | string | No | Whisper model override. Default: from config |
|
|
| `prompt` | string | No | Prompt to guide transcription (max 1000 chars) |
|
|
| `temperature` | number | No | Temperature 0.0-1.0. Lower = more deterministic |
|
|
|
|
**Accepted Audio Formats:**
|
|
`audio/wav`, `audio/mp3`, `audio/mpeg`, `audio/webm`, `audio/ogg`, `audio/flac`, `audio/x-m4a`
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"data": {
|
|
"text": "Hello, this is a transcription test.",
|
|
"language": "en",
|
|
"durationSeconds": 3.5,
|
|
"confidence": 0.95,
|
|
"segments": [
|
|
{
|
|
"text": "Hello, this is a transcription test.",
|
|
"start": 0.0,
|
|
"end": 3.5,
|
|
"confidence": 0.95
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Example:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3001/api/speech/transcribe \
|
|
-H "Authorization: Bearer YOUR_TOKEN" \
|
|
-H "x-workspace-id: WORKSPACE_ID" \
|
|
-F "file=@recording.wav" \
|
|
-F "language=en"
|
|
```
|
|
|
|
### POST /api/speech/synthesize
|
|
|
|
Synthesize text to audio using TTS providers.
|
|
|
|
**Authentication:** Bearer token + workspace membership
|
|
**Content-Type:** `application/json`
|
|
|
|
**Request Body:**
|
|
|
|
| Field | Type | Required | Description |
|
|
| -------- | ------ | -------- | ----------------------------------------------------------- |
|
|
| `text` | string | Yes | Text to synthesize (max 4096 chars) |
|
|
| `voice` | string | No | Voice ID. Default: from config (e.g., "af_heart") |
|
|
| `speed` | number | No | Speed multiplier 0.5-2.0. Default: 1.0 |
|
|
| `format` | string | No | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 |
|
|
| `tier` | string | No | Provider tier: default, premium, fallback. Default: default |
|
|
|
|
**Response:** Binary audio data with appropriate `Content-Type` header.
|
|
|
|
| Format | Content-Type |
|
|
| ------ | ------------ |
|
|
| mp3 | `audio/mpeg` |
|
|
| wav | `audio/wav` |
|
|
| opus | `audio/opus` |
|
|
| flac | `audio/flac` |
|
|
| aac | `audio/aac` |
|
|
| pcm | `audio/pcm` |
|
|
|
|
**Example:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3001/api/speech/synthesize \
|
|
-H "Authorization: Bearer YOUR_TOKEN" \
|
|
-H "x-workspace-id: WORKSPACE_ID" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
|
|
--output speech.mp3
|
|
```
|
|
|
|
### GET /api/speech/voices
|
|
|
|
List available TTS voices across all tiers.
|
|
|
|
**Authentication:** Bearer token + workspace access
|
|
**Query Parameters:**
|
|
|
|
| Parameter | Type | Required | Description |
|
|
| --------- | ------ | -------- | ------------------------------------------ |
|
|
| `tier` | string | No | Filter by tier: default, premium, fallback |
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"data": [
|
|
{
|
|
"id": "af_heart",
|
|
"name": "Heart (American Female)",
|
|
"language": "en-US",
|
|
"tier": "default",
|
|
"isDefault": true
|
|
},
|
|
{
|
|
"id": "am_adam",
|
|
"name": "Adam (American Male)",
|
|
"language": "en-US",
|
|
"tier": "default",
|
|
"isDefault": false
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Example:**
|
|
|
|
```bash
|
|
curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
|
|
-H "Authorization: Bearer YOUR_TOKEN" \
|
|
-H "x-workspace-id: WORKSPACE_ID"
|
|
```
|
|
|
|
### GET /api/speech/health
|
|
|
|
Check availability of STT and TTS providers.
|
|
|
|
**Authentication:** Bearer token + workspace access
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"data": {
|
|
"stt": { "available": true },
|
|
"tts": { "available": true }
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## WebSocket Streaming Protocol
|
|
|
|
The speech module provides a WebSocket gateway at namespace `/speech` for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.
|
|
|
|
### Connection
|
|
|
|
Connect to the `/speech` namespace with authentication:
|
|
|
|
```typescript
|
|
import { io } from "socket.io-client";
|
|
|
|
const socket = io("http://localhost:3001/speech", {
|
|
auth: { token: "YOUR_SESSION_TOKEN" },
|
|
});
|
|
```
|
|
|
|
**Authentication methods** (checked in order):
|
|
|
|
1. `auth.token` in handshake
|
|
2. `query.token` in handshake URL
|
|
3. `Authorization: Bearer <token>` header
|
|
|
|
Connection is rejected if:
|
|
|
|
- No valid token is provided
|
|
- Session verification fails
|
|
- User has no workspace membership
|
|
|
|
**Connection timeout:** 5 seconds for authentication.
|
|
|
|
### Protocol Flow
|
|
|
|
```
|
|
Client Server
|
|
| |
|
|
|--- connect (with token) ----->|
|
|
| | (authenticate, check workspace)
|
|
|<--- connected ----------------|
|
|
| |
|
|
|--- start-transcription ------>| { language?: "en" }
|
|
|<--- transcription-started ----| { sessionId, language }
|
|
| |
|
|
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|
|
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|
|
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|
|
| |
|
|
|--- stop-transcription ------->|
|
|
| | (concatenate chunks, transcribe)
|
|
|<--- transcription-final ------| { text, language, durationSeconds, ... }
|
|
| |
|
|
```
|
|
|
|
### Client Events (emit)
|
|
|
|
| Event | Payload | Description |
|
|
| --------------------- | ------------------------ | ---------------------------------------- |
|
|
| `start-transcription` | `{ language?: string }` | Begin a new transcription session |
|
|
| `audio-chunk` | `Buffer` or `Uint8Array` | Send audio data chunk |
|
|
| `stop-transcription` | (none) | Stop recording and trigger transcription |
|
|
|
|
### Server Events (listen)
|
|
|
|
| Event | Payload | Description |
|
|
| ----------------------- | ----------------------------------------------------------- | -------------------------- |
|
|
| `transcription-started` | `{ sessionId, language }` | Session created |
|
|
| `transcription-final` | `{ text, language, durationSeconds, confidence, segments }` | Transcription result |
|
|
| `transcription-error` | `{ message }` | Error during transcription |
|
|
|
|
### Session Management
|
|
|
|
- One active transcription session per client connection
|
|
- Starting a new session replaces any existing session
|
|
- Sessions are cleaned up on client disconnect
|
|
- Audio chunks are accumulated in memory
|
|
- Total accumulated size is capped by `SPEECH_MAX_UPLOAD_SIZE` (default: 25 MB)
|
|
|
|
### Example Client Usage
|
|
|
|
```typescript
|
|
import { io } from "socket.io-client";
|
|
|
|
const socket = io("http://localhost:3001/speech", {
|
|
auth: { token: sessionToken },
|
|
});
|
|
|
|
// Start recording
|
|
socket.emit("start-transcription", { language: "en" });
|
|
|
|
socket.on("transcription-started", ({ sessionId }) => {
|
|
console.log("Session started:", sessionId);
|
|
});
|
|
|
|
// Stream audio chunks from MediaRecorder
|
|
mediaRecorder.ondataavailable = (event) => {
|
|
if (event.data.size > 0) {
|
|
event.data.arrayBuffer().then((buffer) => {
|
|
socket.emit("audio-chunk", new Uint8Array(buffer));
|
|
});
|
|
}
|
|
};
|
|
|
|
// Stop and get result
|
|
socket.emit("stop-transcription");
|
|
|
|
socket.on("transcription-final", (result) => {
|
|
console.log("Transcription:", result.text);
|
|
console.log("Duration:", result.durationSeconds, "seconds");
|
|
});
|
|
|
|
socket.on("transcription-error", ({ message }) => {
|
|
console.error("Transcription error:", message);
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## Environment Variable Reference
|
|
|
|
### Speech-to-Text (STT)
|
|
|
|
| Variable | Default | Description |
|
|
| -------------- | --------------------------------------- | ---------------------------------------------------- |
|
|
| `STT_ENABLED` | `false` | Enable speech-to-text transcription |
|
|
| `STT_BASE_URL` | `http://speaches:8000/v1` | Speaches server URL (required when STT_ENABLED=true) |
|
|
| `STT_MODEL` | `Systran/faster-whisper-large-v3-turbo` | Whisper model for transcription |
|
|
| `STT_LANGUAGE` | `en` | Default language code |
|
|
|
|
### Text-to-Speech (TTS) - Default Engine (Kokoro)
|
|
|
|
| Variable | Default | Description |
|
|
| -------------------- | --------------------------- | --------------------------------------------------- |
|
|
| `TTS_ENABLED` | `false` | Enable default TTS engine |
|
|
| `TTS_DEFAULT_URL` | `http://kokoro-tts:8880/v1` | Kokoro-FastAPI URL (required when TTS_ENABLED=true) |
|
|
| `TTS_DEFAULT_VOICE` | `af_heart` | Default Kokoro voice ID |
|
|
| `TTS_DEFAULT_FORMAT` | `mp3` | Default audio output format |
|
|
|
|
### Text-to-Speech (TTS) - Premium Engine (Chatterbox)
|
|
|
|
| Variable | Default | Description |
|
|
| --------------------- | ------------------------------- | ----------------------------------------------------------- |
|
|
| `TTS_PREMIUM_ENABLED` | `false` | Enable premium TTS engine |
|
|
| `TTS_PREMIUM_URL` | `http://chatterbox-tts:8881/v1` | Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) |
|
|
|
|
### Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)
|
|
|
|
| Variable | Default | Description |
|
|
| ---------------------- | -------------------------------- | ------------------------------------------------------------- |
|
|
| `TTS_FALLBACK_ENABLED` | `false` | Enable fallback TTS engine |
|
|
| `TTS_FALLBACK_URL` | `http://openedai-speech:8000/v1` | OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) |
|
|
|
|
### Service Limits
|
|
|
|
| Variable | Default | Description |
|
|
| ----------------------------- | ---------- | ---------------------------------------------- |
|
|
| `SPEECH_MAX_UPLOAD_SIZE` | `25000000` | Maximum upload file size in bytes (25 MB) |
|
|
| `SPEECH_MAX_DURATION_SECONDS` | `600` | Maximum audio duration in seconds (10 minutes) |
|
|
| `SPEECH_MAX_TEXT_LENGTH` | `4096` | Maximum text length for TTS in characters |
|
|
|
|
### Conditional Validation
|
|
|
|
When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:
|
|
|
|
```
|
|
STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
|
|
Either set these variables or disable by setting STT_ENABLED=false.
|
|
```
|
|
|
|
Boolean parsing: `value === "true"` or `value === "1"`. Unset or empty values default to `false`.
|
|
|
|
---
|
|
|
|
## Provider Configuration
|
|
|
|
### Kokoro (Default Tier)
|
|
|
|
**Engine:** [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
|
|
**License:** Apache 2.0
|
|
**Requirements:** CPU only
|
|
**Docker Image:** `ghcr.io/remsky/kokoro-fastapi:latest-cpu`
|
|
|
|
**Capabilities:**
|
|
|
|
- 54 built-in voices across 8 languages
|
|
- Speed control: 0.25x to 4.0x
|
|
- Output formats: mp3, wav, opus, flac
|
|
- Voice metadata derived from ID prefix (language, gender, accent)
|
|
|
|
**Voice ID Format:** `{lang}{gender}_{name}`
|
|
|
|
- First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
|
|
- Second character: gender (f=Female, m=Male)
|
|
|
|
**Example voices:**
|
|
| Voice ID | Name | Language | Gender |
|
|
|----------|------|----------|--------|
|
|
| `af_heart` | Heart | en-US | Female |
|
|
| `am_adam` | Adam | en-US | Male |
|
|
| `bf_alice` | Alice | en-GB | Female |
|
|
| `bm_daniel` | Daniel | en-GB | Male |
|
|
| `ef_dora` | Dora | es | Female |
|
|
| `ff_camille` | Camille | fr | Female |
|
|
| `jf_alpha` | Alpha | ja | Female |
|
|
| `zf_xiaobei` | Xiaobei | zh | Female |
|
|
|
|
### Chatterbox (Premium Tier)
|
|
|
|
**Engine:** [Chatterbox TTS Server](https://github.com/devnen/chatterbox-tts-server)
|
|
**License:** Proprietary
|
|
**Requirements:** NVIDIA GPU with CUDA
|
|
**Docker Image:** `devnen/chatterbox-tts-server:latest`
|
|
|
|
**Capabilities:**
|
|
|
|
- Voice cloning via reference audio sample
|
|
- Emotion exaggeration control (0.0 - 1.0)
|
|
- Cross-language voice transfer (23 languages)
|
|
- Higher quality synthesis than default tier
|
|
|
|
**Supported Languages:**
|
|
en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro
|
|
|
|
**Extended Options (Chatterbox-specific):**
|
|
|
|
| Option | Type | Description |
|
|
| --------------------- | ------ | --------------------------------------------------------- |
|
|
| `referenceAudio` | Buffer | Audio sample for voice cloning (5-30 seconds recommended) |
|
|
| `emotionExaggeration` | number | Emotion intensity 0.0-1.0 (clamped) |
|
|
|
|
These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.
|
|
|
|
### Piper (Fallback Tier)
|
|
|
|
**Engine:** [Piper](https://github.com/rhasspy/piper) via [OpenedAI Speech](https://github.com/matatonic/openedai-speech)
|
|
**License:** GPL (OpenedAI Speech)
|
|
**Requirements:** CPU only (runs on Raspberry Pi)
|
|
**Docker Image:** Use OpenedAI Speech image
|
|
|
|
**Capabilities:**
|
|
|
|
- 100+ voices across 40+ languages
|
|
- 6 standard OpenAI voice names (mapped to Piper voices)
|
|
- Output formats: mp3, wav, opus, flac
|
|
- Ultra-lightweight, designed for low-resource environments
|
|
|
|
**Standard Voice Mapping:**
|
|
|
|
| OpenAI Voice | Piper Voice | Gender | Description |
|
|
| ------------ | -------------------- | ------ | --------------------- |
|
|
| `alloy` | en_US-amy-medium | Female | Warm, balanced |
|
|
| `echo` | en_US-ryan-medium | Male | Clear, articulate |
|
|
| `fable` | en_GB-alan-medium | Male | British narrator |
|
|
| `onyx` | en_US-danny-low | Male | Deep, resonant |
|
|
| `nova` | en_US-lessac-medium | Female | Expressive, versatile |
|
|
| `shimmer` | en_US-kristin-medium | Female | Bright, energetic |
|
|
|
|
### Speaches (STT)
|
|
|
|
**Engine:** [Speaches](https://github.com/speaches-ai/speaches) (faster-whisper backend)
|
|
**License:** MIT
|
|
**Requirements:** CPU (GPU optional for faster inference)
|
|
**Docker Image:** `ghcr.io/speaches-ai/speaches:latest`
|
|
|
|
**Capabilities:**
|
|
|
|
- OpenAI-compatible `/v1/audio/transcriptions` endpoint
|
|
- Whisper models via faster-whisper
|
|
- Verbose JSON response with segments and timestamps
|
|
- Language detection
|
|
|
|
**Default model:** `Systran/faster-whisper-large-v3-turbo`
|
|
|
|
---
|
|
|
|
## Voice Cloning Setup (Chatterbox)
|
|
|
|
Voice cloning is available through the Chatterbox premium TTS provider.
|
|
|
|
### Prerequisites
|
|
|
|
1. NVIDIA GPU with CUDA support
|
|
2. `nvidia-container-toolkit` installed on the Docker host
|
|
3. Docker runtime configured for GPU access
|
|
4. TTS premium tier enabled (`TTS_PREMIUM_ENABLED=true`)
|
|
|
|
### Basic Voice Cloning
|
|
|
|
Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:
|
|
|
|
```typescript
|
|
import { SpeechService } from "./speech.service";
|
|
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";
|
|
|
|
const options: ChatterboxSynthesizeOptions = {
|
|
tier: "premium",
|
|
referenceAudio: myAudioBuffer, // 5-30 second audio sample
|
|
emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
|
|
};
|
|
|
|
const result = await speechService.synthesize("Hello, this is my cloned voice!", options);
|
|
```
|
|
|
|
### Voice Cloning Tips
|
|
|
|
- **Audio quality:** Use clean recordings without background noise
|
|
- **Duration:** 5-30 seconds works best; shorter clips may produce lower quality
|
|
- **Format:** WAV provides the best quality; MP3 is also accepted
|
|
- **Emotion:** Start with 0.5 (moderate) and adjust from there
|
|
- **Cross-language:** You can clone a voice in one language and synthesize in another
|
|
|
|
---
|
|
|
|
## Docker Compose Setup
|
|
|
|
### Development (Local)
|
|
|
|
Speech services are defined in a separate overlay file `docker-compose.speech.yml`. This keeps them optional and separate from core services.
|
|
|
|
**Start basic speech services (STT + default TTS):**
|
|
|
|
```bash
|
|
# Using docker compose directly
|
|
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d
|
|
|
|
# Using Makefile
|
|
make speech-up
|
|
```
|
|
|
|
**Start with premium TTS (requires NVIDIA GPU):**
|
|
|
|
```bash
|
|
docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d
|
|
```
|
|
|
|
**Stop speech services:**
|
|
|
|
```bash
|
|
# Using docker compose directly
|
|
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans
|
|
|
|
# Using Makefile
|
|
make speech-down
|
|
```
|
|
|
|
**View logs:**
|
|
|
|
```bash
|
|
make speech-logs
|
|
```
|
|
|
|
### Development Services
|
|
|
|
| Service | Container | Port | Image |
|
|
| -------------- | --------------------- | ------------------------------- | ------------------------------------------ |
|
|
| Speaches (STT) | mosaic-speaches | 8090 (host) -> 8000 (container) | `ghcr.io/speaches-ai/speaches:latest` |
|
|
| Kokoro TTS | mosaic-kokoro-tts | 8880 (host) -> 8880 (container) | `ghcr.io/remsky/kokoro-fastapi:latest-cpu` |
|
|
| Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | `devnen/chatterbox-tts-server:latest` |
|
|
|
|
### Production (Docker Swarm)
|
|
|
|
For production deployments, use `docker/docker-compose.sample.speech.yml`. This file is designed for Docker Swarm with Traefik integration.
|
|
|
|
**Required environment variables:**
|
|
|
|
```bash
|
|
STT_DOMAIN=stt.example.com
|
|
TTS_DOMAIN=tts.example.com
|
|
```
|
|
|
|
**Optional environment variables:**
|
|
|
|
```bash
|
|
WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
|
|
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
|
|
TRAEFIK_ENTRYPOINT=websecure
|
|
TRAEFIK_CERTRESOLVER=letsencrypt
|
|
TRAEFIK_DOCKER_NETWORK=traefik-public
|
|
TRAEFIK_TLS_ENABLED=true
|
|
```
|
|
|
|
**Deploy:**
|
|
|
|
```bash
|
|
docker stack deploy -c docker/docker-compose.sample.speech.yml speech
|
|
```
|
|
|
|
**Connecting to Mosaic Stack:** Set the speech URLs in your Mosaic Stack `.env`:
|
|
|
|
```bash
|
|
# Same Docker network
|
|
STT_BASE_URL=http://speaches:8000/v1
|
|
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1
|
|
|
|
# External / different network
|
|
STT_BASE_URL=https://stt.example.com/v1
|
|
TTS_DEFAULT_URL=https://tts.example.com/v1
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
All speech containers include health checks:
|
|
|
|
| Service | Endpoint | Interval | Start Period |
|
|
| -------------- | ------------------------------ | -------- | ------------ |
|
|
| Speaches | `http://localhost:8000/health` | 30s | 120s |
|
|
| Kokoro TTS | `http://localhost:8880/health` | 30s | 120s |
|
|
| Chatterbox TTS | `http://localhost:8000/health` | 30s | 180s |
|
|
|
|
Chatterbox has a longer start period (180s) because GPU model loading takes additional time.
|
|
|
|
---
|
|
|
|
## GPU VRAM Budget
|
|
|
|
Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.
|
|
|
|
### Chatterbox VRAM Requirements
|
|
|
|
| Component | Approximate VRAM |
|
|
| ----------------------- | ------------------ |
|
|
| Chatterbox TTS model | ~2-4 GB |
|
|
| Voice cloning inference | ~1-2 GB additional |
|
|
| **Total recommended** | **4-6 GB** |
|
|
|
|
### Shared GPU Considerations
|
|
|
|
If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):
|
|
|
|
| Service | VRAM Usage | Notes |
|
|
| -------------------- | ----------- | --------------------------------- |
|
|
| Ollama (7B model) | ~4-6 GB | Depends on model size |
|
|
| Ollama (13B model) | ~8-10 GB | Larger models need more |
|
|
| Chatterbox TTS | ~4-6 GB | Voice cloning is memory-intensive |
|
|
| **Combined minimum** | **8-12 GB** | For 7B LLM + Chatterbox |
|
|
|
|
**Recommendations:**
|
|
|
|
- 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
|
|
- 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
|
|
- 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom
|
|
|
|
If VRAM is limited, consider:
|
|
|
|
1. Disabling Chatterbox (`TTS_PREMIUM_ENABLED=false`) and using Kokoro (CPU) as default
|
|
2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy
|
|
3. Running Chatterbox on a separate GPU host
|
|
|
|
### Docker Swarm GPU Scheduling
|
|
|
|
For Docker Swarm deployments with GPU, configure generic resources on the node:
|
|
|
|
```json
|
|
// /etc/docker/daemon.json
|
|
{
|
|
"runtimes": {
|
|
"nvidia": {
|
|
"path": "nvidia-container-runtime"
|
|
}
|
|
},
|
|
"node-generic-resources": ["NVIDIA-GPU=0"]
|
|
}
|
|
```
|
|
|
|
See the [Docker GPU Swarm documentation](https://docs.docker.com/engine/daemon/nvidia-gpu/#configure-gpus-for-docker-swarm) for details.
|
|
|
|
---
|
|
|
|
## Frontend Integration
|
|
|
|
Speech services are consumed from the frontend through the REST API and WebSocket gateway.
|
|
|
|
### REST API Usage
|
|
|
|
**Transcribe audio:**
|
|
|
|
```typescript
|
|
async function transcribeAudio(file: File, token: string, workspaceId: string) {
|
|
const formData = new FormData();
|
|
formData.append("file", file);
|
|
formData.append("language", "en");
|
|
|
|
const response = await fetch("/api/speech/transcribe", {
|
|
method: "POST",
|
|
headers: {
|
|
Authorization: `Bearer ${token}`,
|
|
"x-workspace-id": workspaceId,
|
|
},
|
|
body: formData,
|
|
});
|
|
|
|
const { data } = await response.json();
|
|
return data.text;
|
|
}
|
|
```
|
|
|
|
**Synthesize speech:**
|
|
|
|
```typescript
|
|
async function synthesizeSpeech(
|
|
text: string,
|
|
token: string,
|
|
workspaceId: string,
|
|
voice = "af_heart"
|
|
) {
|
|
const response = await fetch("/api/speech/synthesize", {
|
|
method: "POST",
|
|
headers: {
|
|
Authorization: `Bearer ${token}`,
|
|
"x-workspace-id": workspaceId,
|
|
"Content-Type": "application/json",
|
|
},
|
|
body: JSON.stringify({ text, voice, format: "mp3" }),
|
|
});
|
|
|
|
const audioBlob = await response.blob();
|
|
const audioUrl = URL.createObjectURL(audioBlob);
|
|
const audio = new Audio(audioUrl);
|
|
audio.play();
|
|
}
|
|
```
|
|
|
|
**List voices:**
|
|
|
|
```typescript
|
|
async function listVoices(token: string, workspaceId: string, tier?: string) {
|
|
const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";
|
|
|
|
const response = await fetch(url, {
|
|
headers: {
|
|
Authorization: `Bearer ${token}`,
|
|
"x-workspace-id": workspaceId,
|
|
},
|
|
});
|
|
|
|
const { data } = await response.json();
|
|
return data; // VoiceInfo[]
|
|
}
|
|
```
|
|
|
|
### WebSocket Streaming Usage
|
|
|
|
For real-time transcription using the browser's MediaRecorder API:
|
|
|
|
```typescript
|
|
import { io } from "socket.io-client";
|
|
|
|
function createSpeechSocket(token: string) {
|
|
const socket = io("/speech", {
|
|
auth: { token },
|
|
});
|
|
|
|
let mediaRecorder: MediaRecorder | null = null;
|
|
|
|
async function startRecording() {
|
|
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
|
|
mediaRecorder = new MediaRecorder(stream, {
|
|
mimeType: "audio/webm;codecs=opus",
|
|
});
|
|
|
|
socket.emit("start-transcription", { language: "en" });
|
|
|
|
mediaRecorder.ondataavailable = (event) => {
|
|
if (event.data.size > 0) {
|
|
event.data.arrayBuffer().then((buffer) => {
|
|
socket.emit("audio-chunk", new Uint8Array(buffer));
|
|
});
|
|
}
|
|
};
|
|
|
|
mediaRecorder.start(250); // Send chunks every 250ms
|
|
}
|
|
|
|
async function stopRecording(): Promise<string> {
|
|
return new Promise((resolve, reject) => {
|
|
socket.once("transcription-final", (result) => {
|
|
resolve(result.text);
|
|
});
|
|
|
|
socket.once("transcription-error", ({ message }) => {
|
|
reject(new Error(message));
|
|
});
|
|
|
|
if (mediaRecorder) {
|
|
mediaRecorder.stop();
|
|
mediaRecorder.stream.getTracks().forEach((track) => track.stop());
|
|
mediaRecorder = null;
|
|
}
|
|
|
|
socket.emit("stop-transcription");
|
|
});
|
|
}
|
|
|
|
return { socket, startRecording, stopRecording };
|
|
}
|
|
```
|
|
|
|
### Check Speech Availability
|
|
|
|
Before showing speech UI elements, check provider availability:
|
|
|
|
```typescript
|
|
async function checkSpeechHealth(token: string, workspaceId: string) {
|
|
const response = await fetch("/api/speech/health", {
|
|
headers: {
|
|
Authorization: `Bearer ${token}`,
|
|
"x-workspace-id": workspaceId,
|
|
},
|
|
});
|
|
|
|
const { data } = await response.json();
|
|
return {
|
|
canTranscribe: data.stt.available,
|
|
canSynthesize: data.tts.available,
|
|
};
|
|
}
|
|
```
|