stack/docs/SPEECH.md

# Speech Services

Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.

## Table of Contents

- [Architecture Overview](#architecture-overview)
- [Provider Abstraction](#provider-abstraction)
- [TTS Tier System and Fallback Chain](#tts-tier-system-and-fallback-chain)
- [API Endpoint Reference](#api-endpoint-reference)
- [WebSocket Streaming Protocol](#websocket-streaming-protocol)
- [Environment Variable Reference](#environment-variable-reference)
- [Provider Configuration](#provider-configuration)
- [Voice Cloning Setup (Chatterbox)](#voice-cloning-setup-chatterbox)
- [Docker Compose Setup](#docker-compose-setup)
- [GPU VRAM Budget](#gpu-vram-budget)
- [Frontend Integration](#frontend-integration)

---

## Architecture Overview

```
                          +-------------------+
                          |  SpeechController |
                          |  (REST endpoints) |
                          +--------+----------+
                                   |
                    +--------------+--------------+
                    |         SpeechService        |
                    |  (provider selection,         |
                    |   fallback orchestration)     |
                    +---------+----------+---------+
                              |          |
                 +------------+    +-----+-------+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | STT Provider|   |TTS Provider| |TTS Provider|
          | (Speaches)  |   |Map<Tier,P> | |Map<Tier,P> |
          +------+------+   +-----+-----+ +-----+-----+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | Speaches    |   | Kokoro    | | Chatterbox |
          | (Whisper)   |   | (default) | | (premium)  |
          +-------------+   +-----------+ +-----+------+
                                                |
                                          +-----+-----+
                                          |   Piper   |
                                          | (fallback)|
                                          +-----------+

          +-------------------+
          |  SpeechGateway    |
          |  (WebSocket /speech)
          +--------+----------+
                   |
          Uses SpeechService.transcribe()
```

The speech module (`apps/api/src/speech/`) is a self-contained NestJS module consisting of:

| Component  | File                   | Purpose                                    |
| ---------- | ---------------------- | ------------------------------------------ |
| Module     | `speech.module.ts`     | Registers providers, controllers, gateway  |
| Config     | `speech.config.ts`     | Environment validation and typed config    |
| Service    | `speech.service.ts`    | High-level speech operations with fallback |
| Controller | `speech.controller.ts` | REST API endpoints                         |
| Gateway    | `speech.gateway.ts`    | WebSocket streaming transcription          |
| Constants  | `speech.constants.ts`  | NestJS injection tokens                    |

### Key Design Decisions

1. **OpenAI-compatible APIs**: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom `baseURL`.

2. **Provider abstraction**: STT and TTS providers implement well-defined interfaces (`ISTTProvider`, `ITTSProvider`). New providers can be added without modifying the service layer.

3. **Conditional registration**: Providers are only instantiated when their corresponding `*_ENABLED` flag is `true`. The STT provider uses NestJS `@Optional()` injection.

4. **Fail-fast validation**: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.

---

## Provider Abstraction

### STT Provider Interface

```typescript
interface ISTTProvider {
  readonly name: string;
  transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
  isHealthy(): Promise<boolean>;
}
```

Currently implemented by `SpeachesSttProvider` which connects to a Speaches (faster-whisper) server.

### TTS Provider Interface

```typescript
interface ITTSProvider {
  readonly name: string;
  readonly tier: SpeechTier;
  synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
  listVoices(): Promise<VoiceInfo[]>;
  isHealthy(): Promise<boolean>;
}
```

All TTS providers extend `BaseTTSProvider`, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set `name` and `tier` and optionally override `listVoices()` or `synthesize()`.

### Provider Registration

Providers are created by the `TTS Provider Factory` (`providers/tts-provider.factory.ts`) based on configuration:

| Tier       | Provider Class          | Engine                    | Requirements |
| ---------- | ----------------------- | ------------------------- | ------------ |
| `default`  | `KokoroTtsProvider`     | Kokoro-FastAPI            | CPU only     |
| `premium`  | `ChatterboxTTSProvider` | Chatterbox TTS Server     | NVIDIA GPU   |
| `fallback` | `PiperTtsProvider`      | Piper via OpenedAI Speech | CPU only     |

---

## TTS Tier System and Fallback Chain

TTS uses a tiered architecture with automatic fallback:

```
Request with tier="premium"
    |
    v
[premium] Chatterbox available? --yes--> Use Chatterbox
    |                                         |
    no                                   (success/fail)
    |
    v
[default] Kokoro available? ------yes--> Use Kokoro
    |                                         |
    no                                   (success/fail)
    |
    v
[fallback] Piper available? -----yes--> Use Piper
    |                                         |
    no                                   (success/fail)
    |
    v
ServiceUnavailableException
```

**Fallback order:** `premium` -> `default` -> `fallback`

The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:

1. It is enabled in configuration (`TTS_ENABLED`, `TTS_PREMIUM_ENABLED`, `TTS_FALLBACK_ENABLED`)
2. A provider is registered for that tier

If no tier is specified in the request, `default` is used as the starting point.

---

## API Endpoint Reference

All speech endpoints are under `/api/speech/` and require authentication (Bearer token) plus workspace context (`x-workspace-id` header).

### POST /api/speech/transcribe

Transcribe an uploaded audio file to text.

**Authentication:** Bearer token + workspace membership
**Content-Type:** `multipart/form-data`

**Form Fields:**

| Field         | Type   | Required | Description                                            |
| ------------- | ------ | -------- | ------------------------------------------------------ |
| `file`        | File   | Yes      | Audio file (max 25 MB)                                 |
| `language`    | string | No       | Language code (e.g., "en", "fr"). Default: from config |
| `model`       | string | No       | Whisper model override. Default: from config           |
| `prompt`      | string | No       | Prompt to guide transcription (max 1000 chars)         |
| `temperature` | number | No       | Temperature 0.0-1.0. Lower = more deterministic        |

**Accepted Audio Formats:**
`audio/wav`, `audio/mp3`, `audio/mpeg`, `audio/webm`, `audio/ogg`, `audio/flac`, `audio/x-m4a`

**Response:**

```json
{
  "data": {
    "text": "Hello, this is a transcription test.",
    "language": "en",
    "durationSeconds": 3.5,
    "confidence": 0.95,
    "segments": [
      {
        "text": "Hello, this is a transcription test.",
        "start": 0.0,
        "end": 3.5,
        "confidence": 0.95
      }
    ]
  }
}
```

**Example:**

```bash
curl -X POST http://localhost:3001/api/speech/transcribe \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -F "file=@recording.wav" \
  -F "language=en"
```

### POST /api/speech/synthesize

Synthesize text to audio using TTS providers.

**Authentication:** Bearer token + workspace membership
**Content-Type:** `application/json`

**Request Body:**

| Field    | Type   | Required | Description                                                 |
| -------- | ------ | -------- | ----------------------------------------------------------- |
| `text`   | string | Yes      | Text to synthesize (max 4096 chars)                         |
| `voice`  | string | No       | Voice ID. Default: from config (e.g., "af_heart")           |
| `speed`  | number | No       | Speed multiplier 0.5-2.0. Default: 1.0                      |
| `format` | string | No       | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 |
| `tier`   | string | No       | Provider tier: default, premium, fallback. Default: default |

**Response:** Binary audio data with appropriate `Content-Type` header.

| Format | Content-Type |
| ------ | ------------ |
| mp3    | `audio/mpeg` |
| wav    | `audio/wav`  |
| opus   | `audio/opus` |
| flac   | `audio/flac` |
| aac    | `audio/aac`  |
| pcm    | `audio/pcm`  |

**Example:**

```bash
curl -X POST http://localhost:3001/api/speech/synthesize \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
  --output speech.mp3
```

### GET /api/speech/voices

List available TTS voices across all tiers.

**Authentication:** Bearer token + workspace access
**Query Parameters:**

| Parameter | Type   | Required | Description                                |
| --------- | ------ | -------- | ------------------------------------------ |
| `tier`    | string | No       | Filter by tier: default, premium, fallback |

**Response:**

```json
{
  "data": [
    {
      "id": "af_heart",
      "name": "Heart (American Female)",
      "language": "en-US",
      "tier": "default",
      "isDefault": true
    },
    {
      "id": "am_adam",
      "name": "Adam (American Male)",
      "language": "en-US",
      "tier": "default",
      "isDefault": false
    }
  ]
}
```

**Example:**

```bash
curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID"
```

### GET /api/speech/health

Check availability of STT and TTS providers.

**Authentication:** Bearer token + workspace access

**Response:**

```json
{
  "data": {
    "stt": { "available": true },
    "tts": { "available": true }
  }
}
```

---

## WebSocket Streaming Protocol

The speech module provides a WebSocket gateway at namespace `/speech` for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.

### Connection

Connect to the `/speech` namespace with authentication:

```typescript
import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: "YOUR_SESSION_TOKEN" },
});
```

**Authentication methods** (checked in order):

1. `auth.token` in handshake
2. `query.token` in handshake URL
3. `Authorization: Bearer <token>` header

Connection is rejected if:

- No valid token is provided
- Session verification fails
- User has no workspace membership

**Connection timeout:** 5 seconds for authentication.

### Protocol Flow

```
Client                          Server
  |                               |
  |--- connect (with token) ----->|
  |                               |  (authenticate, check workspace)
  |<--- connected ----------------|
  |                               |
  |--- start-transcription ------>|  { language?: "en" }
  |<--- transcription-started ----|  { sessionId, language }
  |                               |
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |                               |
  |--- stop-transcription ------->|
  |                               |  (concatenate chunks, transcribe)
  |<--- transcription-final ------|  { text, language, durationSeconds, ... }
  |                               |
```

### Client Events (emit)

| Event                 | Payload                  | Description                              |
| --------------------- | ------------------------ | ---------------------------------------- |
| `start-transcription` | `{ language?: string }`  | Begin a new transcription session        |
| `audio-chunk`         | `Buffer` or `Uint8Array` | Send audio data chunk                    |
| `stop-transcription`  | (none)                   | Stop recording and trigger transcription |

### Server Events (listen)

| Event                   | Payload                                                     | Description                |
| ----------------------- | ----------------------------------------------------------- | -------------------------- |
| `transcription-started` | `{ sessionId, language }`                                   | Session created            |
| `transcription-final`   | `{ text, language, durationSeconds, confidence, segments }` | Transcription result       |
| `transcription-error`   | `{ message }`                                               | Error during transcription |

### Session Management

- One active transcription session per client connection
- Starting a new session replaces any existing session
- Sessions are cleaned up on client disconnect
- Audio chunks are accumulated in memory
- Total accumulated size is capped by `SPEECH_MAX_UPLOAD_SIZE` (default: 25 MB)

### Example Client Usage

```typescript
import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: sessionToken },
});

// Start recording
socket.emit("start-transcription", { language: "en" });

socket.on("transcription-started", ({ sessionId }) => {
  console.log("Session started:", sessionId);
});

// Stream audio chunks from MediaRecorder
mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    event.data.arrayBuffer().then((buffer) => {
      socket.emit("audio-chunk", new Uint8Array(buffer));
    });
  }
};

// Stop and get result
socket.emit("stop-transcription");

socket.on("transcription-final", (result) => {
  console.log("Transcription:", result.text);
  console.log("Duration:", result.durationSeconds, "seconds");
});

socket.on("transcription-error", ({ message }) => {
  console.error("Transcription error:", message);
});
```

---

## Environment Variable Reference

### Speech-to-Text (STT)

| Variable       | Default                                 | Description                                          |
| -------------- | --------------------------------------- | ---------------------------------------------------- |
| `STT_ENABLED`  | `false`                                 | Enable speech-to-text transcription                  |
| `STT_BASE_URL` | `http://speaches:8000/v1`               | Speaches server URL (required when STT_ENABLED=true) |
| `STT_MODEL`    | `Systran/faster-whisper-large-v3-turbo` | Whisper model for transcription                      |
| `STT_LANGUAGE` | `en`                                    | Default language code                                |

### Text-to-Speech (TTS) - Default Engine (Kokoro)

| Variable             | Default                     | Description                                         |
| -------------------- | --------------------------- | --------------------------------------------------- |
| `TTS_ENABLED`        | `false`                     | Enable default TTS engine                           |
| `TTS_DEFAULT_URL`    | `http://kokoro-tts:8880/v1` | Kokoro-FastAPI URL (required when TTS_ENABLED=true) |
| `TTS_DEFAULT_VOICE`  | `af_heart`                  | Default Kokoro voice ID                             |
| `TTS_DEFAULT_FORMAT` | `mp3`                       | Default audio output format                         |

### Text-to-Speech (TTS) - Premium Engine (Chatterbox)

| Variable              | Default                         | Description                                                 |
| --------------------- | ------------------------------- | ----------------------------------------------------------- |
| `TTS_PREMIUM_ENABLED` | `false`                         | Enable premium TTS engine                                   |
| `TTS_PREMIUM_URL`     | `http://chatterbox-tts:8881/v1` | Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) |

### Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)

| Variable               | Default                          | Description                                                   |
| ---------------------- | -------------------------------- | ------------------------------------------------------------- |
| `TTS_FALLBACK_ENABLED` | `false`                          | Enable fallback TTS engine                                    |
| `TTS_FALLBACK_URL`     | `http://openedai-speech:8000/v1` | OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) |

### Service Limits

| Variable                      | Default    | Description                                    |
| ----------------------------- | ---------- | ---------------------------------------------- |
| `SPEECH_MAX_UPLOAD_SIZE`      | `25000000` | Maximum upload file size in bytes (25 MB)      |
| `SPEECH_MAX_DURATION_SECONDS` | `600`      | Maximum audio duration in seconds (10 minutes) |
| `SPEECH_MAX_TEXT_LENGTH`      | `4096`     | Maximum text length for TTS in characters      |

### Conditional Validation

When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:

```
STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
Either set these variables or disable by setting STT_ENABLED=false.
```

Boolean parsing: `value === "true"` or `value === "1"`. Unset or empty values default to `false`.

---

## Provider Configuration

### Kokoro (Default Tier)

**Engine:** [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
**License:** Apache 2.0
**Requirements:** CPU only
**Docker Image:** `ghcr.io/remsky/kokoro-fastapi:latest-cpu`

**Capabilities:**

- 54 built-in voices across 8 languages
- Speed control: 0.25x to 4.0x
- Output formats: mp3, wav, opus, flac
- Voice metadata derived from ID prefix (language, gender, accent)

**Voice ID Format:** `{lang}{gender}_{name}`

- First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
- Second character: gender (f=Female, m=Male)

**Example voices:**
| Voice ID | Name | Language | Gender |
|----------|------|----------|--------|
| `af_heart` | Heart | en-US | Female |
| `am_adam` | Adam | en-US | Male |
| `bf_alice` | Alice | en-GB | Female |
| `bm_daniel` | Daniel | en-GB | Male |
| `ef_dora` | Dora | es | Female |
| `ff_camille` | Camille | fr | Female |
| `jf_alpha` | Alpha | ja | Female |
| `zf_xiaobei` | Xiaobei | zh | Female |

### Chatterbox (Premium Tier)

**Engine:** [Chatterbox TTS Server](https://github.com/devnen/chatterbox-tts-server)
**License:** Proprietary
**Requirements:** NVIDIA GPU with CUDA
**Docker Image:** `devnen/chatterbox-tts-server:latest`

**Capabilities:**

- Voice cloning via reference audio sample
- Emotion exaggeration control (0.0 - 1.0)
- Cross-language voice transfer (23 languages)
- Higher quality synthesis than default tier

**Supported Languages:**
en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro

**Extended Options (Chatterbox-specific):**

| Option                | Type   | Description                                               |
| --------------------- | ------ | --------------------------------------------------------- |
| `referenceAudio`      | Buffer | Audio sample for voice cloning (5-30 seconds recommended) |
| `emotionExaggeration` | number | Emotion intensity 0.0-1.0 (clamped)                       |

These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.

### Piper (Fallback Tier)

**Engine:** [Piper](https://github.com/rhasspy/piper) via [OpenedAI Speech](https://github.com/matatonic/openedai-speech)
**License:** GPL (OpenedAI Speech)
**Requirements:** CPU only (runs on Raspberry Pi)
**Docker Image:** Use OpenedAI Speech image

**Capabilities:**

- 100+ voices across 40+ languages
- 6 standard OpenAI voice names (mapped to Piper voices)
- Output formats: mp3, wav, opus, flac
- Ultra-lightweight, designed for low-resource environments

**Standard Voice Mapping:**

| OpenAI Voice | Piper Voice          | Gender | Description           |
| ------------ | -------------------- | ------ | --------------------- |
| `alloy`      | en_US-amy-medium     | Female | Warm, balanced        |
| `echo`       | en_US-ryan-medium    | Male   | Clear, articulate     |
| `fable`      | en_GB-alan-medium    | Male   | British narrator      |
| `onyx`       | en_US-danny-low      | Male   | Deep, resonant        |
| `nova`       | en_US-lessac-medium  | Female | Expressive, versatile |
| `shimmer`    | en_US-kristin-medium | Female | Bright, energetic     |

### Speaches (STT)

**Engine:** [Speaches](https://github.com/speaches-ai/speaches) (faster-whisper backend)
**License:** MIT
**Requirements:** CPU (GPU optional for faster inference)
**Docker Image:** `ghcr.io/speaches-ai/speaches:latest`

**Capabilities:**

- OpenAI-compatible `/v1/audio/transcriptions` endpoint
- Whisper models via faster-whisper
- Verbose JSON response with segments and timestamps
- Language detection

**Default model:** `Systran/faster-whisper-large-v3-turbo`

---

## Voice Cloning Setup (Chatterbox)

Voice cloning is available through the Chatterbox premium TTS provider.

### Prerequisites

1. NVIDIA GPU with CUDA support
2. `nvidia-container-toolkit` installed on the Docker host
3. Docker runtime configured for GPU access
4. TTS premium tier enabled (`TTS_PREMIUM_ENABLED=true`)

### Basic Voice Cloning

Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:

```typescript
import { SpeechService } from "./speech.service";
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";

const options: ChatterboxSynthesizeOptions = {
  tier: "premium",
  referenceAudio: myAudioBuffer, // 5-30 second audio sample
  emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
};

const result = await speechService.synthesize("Hello, this is my cloned voice!", options);
```

### Voice Cloning Tips

- **Audio quality:** Use clean recordings without background noise
- **Duration:** 5-30 seconds works best; shorter clips may produce lower quality
- **Format:** WAV provides the best quality; MP3 is also accepted
- **Emotion:** Start with 0.5 (moderate) and adjust from there
- **Cross-language:** You can clone a voice in one language and synthesize in another

---

## Docker Compose Setup

### Development (Local)

Speech services are defined in a separate overlay file `docker-compose.speech.yml`. This keeps them optional and separate from core services.

**Start basic speech services (STT + default TTS):**

```bash
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d

# Using Makefile
make speech-up
```

**Start with premium TTS (requires NVIDIA GPU):**

```bash
docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d
```

**Stop speech services:**

```bash
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans

# Using Makefile
make speech-down
```

**View logs:**

```bash
make speech-logs
```

### Development Services

| Service        | Container             | Port                            | Image                                      |
| -------------- | --------------------- | ------------------------------- | ------------------------------------------ |
| Speaches (STT) | mosaic-speaches       | 8090 (host) -> 8000 (container) | `ghcr.io/speaches-ai/speaches:latest`      |
| Kokoro TTS     | mosaic-kokoro-tts     | 8880 (host) -> 8880 (container) | `ghcr.io/remsky/kokoro-fastapi:latest-cpu` |
| Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | `devnen/chatterbox-tts-server:latest`      |

### Production (Docker Swarm)

For production deployments, use `docker/docker-compose.sample.speech.yml`. This file is designed for Docker Swarm with Traefik integration.

**Required environment variables:**

```bash
STT_DOMAIN=stt.example.com
TTS_DOMAIN=tts.example.com
```

**Optional environment variables:**

```bash
WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
TRAEFIK_ENTRYPOINT=websecure
TRAEFIK_CERTRESOLVER=letsencrypt
TRAEFIK_DOCKER_NETWORK=traefik-public
TRAEFIK_TLS_ENABLED=true
```

**Deploy:**

```bash
docker stack deploy -c docker/docker-compose.sample.speech.yml speech
```

**Connecting to Mosaic Stack:** Set the speech URLs in your Mosaic Stack `.env`:

```bash
# Same Docker network
STT_BASE_URL=http://speaches:8000/v1
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1

# External / different network
STT_BASE_URL=https://stt.example.com/v1
TTS_DEFAULT_URL=https://tts.example.com/v1
```

### Health Checks

All speech containers include health checks:

| Service        | Endpoint                       | Interval | Start Period |
| -------------- | ------------------------------ | -------- | ------------ |
| Speaches       | `http://localhost:8000/health` | 30s      | 120s         |
| Kokoro TTS     | `http://localhost:8880/health` | 30s      | 120s         |
| Chatterbox TTS | `http://localhost:8000/health` | 30s      | 180s         |

Chatterbox has a longer start period (180s) because GPU model loading takes additional time.

---

## GPU VRAM Budget

Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.

### Chatterbox VRAM Requirements

| Component               | Approximate VRAM   |
| ----------------------- | ------------------ |
| Chatterbox TTS model    | ~2-4 GB            |
| Voice cloning inference | ~1-2 GB additional |
| **Total recommended**   | **4-6 GB**         |

### Shared GPU Considerations

If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):

| Service              | VRAM Usage  | Notes                             |
| -------------------- | ----------- | --------------------------------- |
| Ollama (7B model)    | ~4-6 GB     | Depends on model size             |
| Ollama (13B model)   | ~8-10 GB    | Larger models need more           |
| Chatterbox TTS       | ~4-6 GB     | Voice cloning is memory-intensive |
| **Combined minimum** | **8-12 GB** | For 7B LLM + Chatterbox           |

**Recommendations:**

- 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
- 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
- 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom

If VRAM is limited, consider:

1. Disabling Chatterbox (`TTS_PREMIUM_ENABLED=false`) and using Kokoro (CPU) as default
2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy
3. Running Chatterbox on a separate GPU host

### Docker Swarm GPU Scheduling

For Docker Swarm deployments with GPU, configure generic resources on the node:

```json
// /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime"
    }
  },
  "node-generic-resources": ["NVIDIA-GPU=0"]
}
```

See the [Docker GPU Swarm documentation](https://docs.docker.com/engine/daemon/nvidia-gpu/#configure-gpus-for-docker-swarm) for details.

---

## Frontend Integration

Speech services are consumed from the frontend through the REST API and WebSocket gateway.

### REST API Usage

**Transcribe audio:**

```typescript
async function transcribeAudio(file: File, token: string, workspaceId: string) {
  const formData = new FormData();
  formData.append("file", file);
  formData.append("language", "en");

  const response = await fetch("/api/speech/transcribe", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
    body: formData,
  });

  const { data } = await response.json();
  return data.text;
}
```

**Synthesize speech:**

```typescript
async function synthesizeSpeech(
  text: string,
  token: string,
  workspaceId: string,
  voice = "af_heart"
) {
  const response = await fetch("/api/speech/synthesize", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ text, voice, format: "mp3" }),
  });

  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}
```

**List voices:**

```typescript
async function listVoices(token: string, workspaceId: string, tier?: string) {
  const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return data; // VoiceInfo[]
}
```

### WebSocket Streaming Usage

For real-time transcription using the browser's MediaRecorder API:

```typescript
import { io } from "socket.io-client";

function createSpeechSocket(token: string) {
  const socket = io("/speech", {
    auth: { token },
  });

  let mediaRecorder: MediaRecorder | null = null;

  async function startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    mediaRecorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    socket.emit("start-transcription", { language: "en" });

    mediaRecorder.ondataavailable = (event) => {
      if (event.data.size > 0) {
        event.data.arrayBuffer().then((buffer) => {
          socket.emit("audio-chunk", new Uint8Array(buffer));
        });
      }
    };

    mediaRecorder.start(250); // Send chunks every 250ms
  }

  async function stopRecording(): Promise<string> {
    return new Promise((resolve, reject) => {
      socket.once("transcription-final", (result) => {
        resolve(result.text);
      });

      socket.once("transcription-error", ({ message }) => {
        reject(new Error(message));
      });

      if (mediaRecorder) {
        mediaRecorder.stop();
        mediaRecorder.stream.getTracks().forEach((track) => track.stop());
        mediaRecorder = null;
      }

      socket.emit("stop-transcription");
    });
  }

  return { socket, startRecording, stopRecording };
}
```

### Check Speech Availability

Before showing speech UI elements, check provider availability:

```typescript
async function checkSpeechHealth(token: string, workspaceId: string) {
  const response = await fetch("/api/speech/health", {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return {
    canTranscribe: data.stt.available,
    canSynthesize: data.tts.available,
  };
}
```