# Speech Services Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure. ## Table of Contents - [Architecture Overview](#architecture-overview) - [Provider Abstraction](#provider-abstraction) - [TTS Tier System and Fallback Chain](#tts-tier-system-and-fallback-chain) - [API Endpoint Reference](#api-endpoint-reference) - [WebSocket Streaming Protocol](#websocket-streaming-protocol) - [Environment Variable Reference](#environment-variable-reference) - [Provider Configuration](#provider-configuration) - [Voice Cloning Setup (Chatterbox)](#voice-cloning-setup-chatterbox) - [Docker Compose Setup](#docker-compose-setup) - [GPU VRAM Budget](#gpu-vram-budget) - [Frontend Integration](#frontend-integration) --- ## Architecture Overview ``` +-------------------+ | SpeechController | | (REST endpoints) | +--------+----------+ | +--------------+--------------+ | SpeechService | | (provider selection, | | fallback orchestration) | +---------+----------+---------+ | | +------------+ +-----+-------+ | | | +------+------+ +-----+-----+ +-----+-----+ | STT Provider| |TTS Provider| |TTS Provider| | (Speaches) | |Map | |Map | +------+------+ +-----+-----+ +-----+-----+ | | | +------+------+ +-----+-----+ +-----+-----+ | Speaches | | Kokoro | | Chatterbox | | (Whisper) | | (default) | | (premium) | +-------------+ +-----------+ +-----+------+ | +-----+-----+ | Piper | | (fallback)| +-----------+ +-------------------+ | SpeechGateway | | (WebSocket /speech) +--------+----------+ | Uses SpeechService.transcribe() ``` The speech module (`apps/api/src/speech/`) is a self-contained NestJS module consisting of: | Component | File | Purpose | | ---------- | ---------------------- | ------------------------------------------ | | Module | `speech.module.ts` | Registers providers, controllers, gateway | | Config | `speech.config.ts` | Environment validation and typed config | | Service | `speech.service.ts` | High-level speech operations with fallback | | Controller | `speech.controller.ts` | REST API endpoints | | Gateway | `speech.gateway.ts` | WebSocket streaming transcription | | Constants | `speech.constants.ts` | NestJS injection tokens | ### Key Design Decisions 1. **OpenAI-compatible APIs**: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom `baseURL`. 2. **Provider abstraction**: STT and TTS providers implement well-defined interfaces (`ISTTProvider`, `ITTSProvider`). New providers can be added without modifying the service layer. 3. **Conditional registration**: Providers are only instantiated when their corresponding `*_ENABLED` flag is `true`. The STT provider uses NestJS `@Optional()` injection. 4. **Fail-fast validation**: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error. --- ## Provider Abstraction ### STT Provider Interface ```typescript interface ISTTProvider { readonly name: string; transcribe(audio: Buffer, options?: TranscribeOptions): Promise; isHealthy(): Promise; } ``` Currently implemented by `SpeachesSttProvider` which connects to a Speaches (faster-whisper) server. ### TTS Provider Interface ```typescript interface ITTSProvider { readonly name: string; readonly tier: SpeechTier; synthesize(text: string, options?: SynthesizeOptions): Promise; listVoices(): Promise; isHealthy(): Promise; } ``` All TTS providers extend `BaseTTSProvider`, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set `name` and `tier` and optionally override `listVoices()` or `synthesize()`. ### Provider Registration Providers are created by the `TTS Provider Factory` (`providers/tts-provider.factory.ts`) based on configuration: | Tier | Provider Class | Engine | Requirements | | ---------- | ----------------------- | ------------------------- | ------------ | | `default` | `KokoroTtsProvider` | Kokoro-FastAPI | CPU only | | `premium` | `ChatterboxTTSProvider` | Chatterbox TTS Server | NVIDIA GPU | | `fallback` | `PiperTtsProvider` | Piper via OpenedAI Speech | CPU only | --- ## TTS Tier System and Fallback Chain TTS uses a tiered architecture with automatic fallback: ``` Request with tier="premium" | v [premium] Chatterbox available? --yes--> Use Chatterbox | | no (success/fail) | v [default] Kokoro available? ------yes--> Use Kokoro | | no (success/fail) | v [fallback] Piper available? -----yes--> Use Piper | | no (success/fail) | v ServiceUnavailableException ``` **Fallback order:** `premium` -> `default` -> `fallback` The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if: 1. It is enabled in configuration (`TTS_ENABLED`, `TTS_PREMIUM_ENABLED`, `TTS_FALLBACK_ENABLED`) 2. A provider is registered for that tier If no tier is specified in the request, `default` is used as the starting point. --- ## API Endpoint Reference All speech endpoints are under `/api/speech/` and require authentication (Bearer token) plus workspace context (`x-workspace-id` header). ### POST /api/speech/transcribe Transcribe an uploaded audio file to text. **Authentication:** Bearer token + workspace membership **Content-Type:** `multipart/form-data` **Form Fields:** | Field | Type | Required | Description | | ------------- | ------ | -------- | ------------------------------------------------------ | | `file` | File | Yes | Audio file (max 25 MB) | | `language` | string | No | Language code (e.g., "en", "fr"). Default: from config | | `model` | string | No | Whisper model override. Default: from config | | `prompt` | string | No | Prompt to guide transcription (max 1000 chars) | | `temperature` | number | No | Temperature 0.0-1.0. Lower = more deterministic | **Accepted Audio Formats:** `audio/wav`, `audio/mp3`, `audio/mpeg`, `audio/webm`, `audio/ogg`, `audio/flac`, `audio/x-m4a` **Response:** ```json { "data": { "text": "Hello, this is a transcription test.", "language": "en", "durationSeconds": 3.5, "confidence": 0.95, "segments": [ { "text": "Hello, this is a transcription test.", "start": 0.0, "end": 3.5, "confidence": 0.95 } ] } } ``` **Example:** ```bash curl -X POST http://localhost:3001/api/speech/transcribe \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "x-workspace-id: WORKSPACE_ID" \ -F "file=@recording.wav" \ -F "language=en" ``` ### POST /api/speech/synthesize Synthesize text to audio using TTS providers. **Authentication:** Bearer token + workspace membership **Content-Type:** `application/json` **Request Body:** | Field | Type | Required | Description | | -------- | ------ | -------- | ----------------------------------------------------------- | | `text` | string | Yes | Text to synthesize (max 4096 chars) | | `voice` | string | No | Voice ID. Default: from config (e.g., "af_heart") | | `speed` | number | No | Speed multiplier 0.5-2.0. Default: 1.0 | | `format` | string | No | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 | | `tier` | string | No | Provider tier: default, premium, fallback. Default: default | **Response:** Binary audio data with appropriate `Content-Type` header. | Format | Content-Type | | ------ | ------------ | | mp3 | `audio/mpeg` | | wav | `audio/wav` | | opus | `audio/opus` | | flac | `audio/flac` | | aac | `audio/aac` | | pcm | `audio/pcm` | **Example:** ```bash curl -X POST http://localhost:3001/api/speech/synthesize \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "x-workspace-id: WORKSPACE_ID" \ -H "Content-Type: application/json" \ -d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \ --output speech.mp3 ``` ### GET /api/speech/voices List available TTS voices across all tiers. **Authentication:** Bearer token + workspace access **Query Parameters:** | Parameter | Type | Required | Description | | --------- | ------ | -------- | ------------------------------------------ | | `tier` | string | No | Filter by tier: default, premium, fallback | **Response:** ```json { "data": [ { "id": "af_heart", "name": "Heart (American Female)", "language": "en-US", "tier": "default", "isDefault": true }, { "id": "am_adam", "name": "Adam (American Male)", "language": "en-US", "tier": "default", "isDefault": false } ] } ``` **Example:** ```bash curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "x-workspace-id: WORKSPACE_ID" ``` ### GET /api/speech/health Check availability of STT and TTS providers. **Authentication:** Bearer token + workspace access **Response:** ```json { "data": { "stt": { "available": true }, "tts": { "available": true } } } ``` --- ## WebSocket Streaming Protocol The speech module provides a WebSocket gateway at namespace `/speech` for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped. ### Connection Connect to the `/speech` namespace with authentication: ```typescript import { io } from "socket.io-client"; const socket = io("http://localhost:3001/speech", { auth: { token: "YOUR_SESSION_TOKEN" }, }); ``` **Authentication methods** (checked in order): 1. `auth.token` in handshake 2. `query.token` in handshake URL 3. `Authorization: Bearer ` header Connection is rejected if: - No valid token is provided - Session verification fails - User has no workspace membership **Connection timeout:** 5 seconds for authentication. ### Protocol Flow ``` Client Server | | |--- connect (with token) ----->| | | (authenticate, check workspace) |<--- connected ----------------| | | |--- start-transcription ------>| { language?: "en" } |<--- transcription-started ----| { sessionId, language } | | |--- audio-chunk -------------->| (Buffer/Uint8Array) |--- audio-chunk -------------->| (Buffer/Uint8Array) |--- audio-chunk -------------->| (Buffer/Uint8Array) | | |--- stop-transcription ------->| | | (concatenate chunks, transcribe) |<--- transcription-final ------| { text, language, durationSeconds, ... } | | ``` ### Client Events (emit) | Event | Payload | Description | | --------------------- | ------------------------ | ---------------------------------------- | | `start-transcription` | `{ language?: string }` | Begin a new transcription session | | `audio-chunk` | `Buffer` or `Uint8Array` | Send audio data chunk | | `stop-transcription` | (none) | Stop recording and trigger transcription | ### Server Events (listen) | Event | Payload | Description | | ----------------------- | ----------------------------------------------------------- | -------------------------- | | `transcription-started` | `{ sessionId, language }` | Session created | | `transcription-final` | `{ text, language, durationSeconds, confidence, segments }` | Transcription result | | `transcription-error` | `{ message }` | Error during transcription | ### Session Management - One active transcription session per client connection - Starting a new session replaces any existing session - Sessions are cleaned up on client disconnect - Audio chunks are accumulated in memory - Total accumulated size is capped by `SPEECH_MAX_UPLOAD_SIZE` (default: 25 MB) ### Example Client Usage ```typescript import { io } from "socket.io-client"; const socket = io("http://localhost:3001/speech", { auth: { token: sessionToken }, }); // Start recording socket.emit("start-transcription", { language: "en" }); socket.on("transcription-started", ({ sessionId }) => { console.log("Session started:", sessionId); }); // Stream audio chunks from MediaRecorder mediaRecorder.ondataavailable = (event) => { if (event.data.size > 0) { event.data.arrayBuffer().then((buffer) => { socket.emit("audio-chunk", new Uint8Array(buffer)); }); } }; // Stop and get result socket.emit("stop-transcription"); socket.on("transcription-final", (result) => { console.log("Transcription:", result.text); console.log("Duration:", result.durationSeconds, "seconds"); }); socket.on("transcription-error", ({ message }) => { console.error("Transcription error:", message); }); ``` --- ## Environment Variable Reference ### Speech-to-Text (STT) | Variable | Default | Description | | -------------- | --------------------------------------- | ---------------------------------------------------- | | `STT_ENABLED` | `false` | Enable speech-to-text transcription | | `STT_BASE_URL` | `http://speaches:8000/v1` | Speaches server URL (required when STT_ENABLED=true) | | `STT_MODEL` | `Systran/faster-whisper-large-v3-turbo` | Whisper model for transcription | | `STT_LANGUAGE` | `en` | Default language code | ### Text-to-Speech (TTS) - Default Engine (Kokoro) | Variable | Default | Description | | -------------------- | --------------------------- | --------------------------------------------------- | | `TTS_ENABLED` | `false` | Enable default TTS engine | | `TTS_DEFAULT_URL` | `http://kokoro-tts:8880/v1` | Kokoro-FastAPI URL (required when TTS_ENABLED=true) | | `TTS_DEFAULT_VOICE` | `af_heart` | Default Kokoro voice ID | | `TTS_DEFAULT_FORMAT` | `mp3` | Default audio output format | ### Text-to-Speech (TTS) - Premium Engine (Chatterbox) | Variable | Default | Description | | --------------------- | ------------------------------- | ----------------------------------------------------------- | | `TTS_PREMIUM_ENABLED` | `false` | Enable premium TTS engine | | `TTS_PREMIUM_URL` | `http://chatterbox-tts:8881/v1` | Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) | ### Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI) | Variable | Default | Description | | ---------------------- | -------------------------------- | ------------------------------------------------------------- | | `TTS_FALLBACK_ENABLED` | `false` | Enable fallback TTS engine | | `TTS_FALLBACK_URL` | `http://openedai-speech:8000/v1` | OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) | ### Service Limits | Variable | Default | Description | | ----------------------------- | ---------- | ---------------------------------------------- | | `SPEECH_MAX_UPLOAD_SIZE` | `25000000` | Maximum upload file size in bytes (25 MB) | | `SPEECH_MAX_DURATION_SECONDS` | `600` | Maximum audio duration in seconds (10 minutes) | | `SPEECH_MAX_TEXT_LENGTH` | `4096` | Maximum text length for TTS in characters | ### Conditional Validation When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like: ``` STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL. Either set these variables or disable by setting STT_ENABLED=false. ``` Boolean parsing: `value === "true"` or `value === "1"`. Unset or empty values default to `false`. --- ## Provider Configuration ### Kokoro (Default Tier) **Engine:** [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) **License:** Apache 2.0 **Requirements:** CPU only **Docker Image:** `ghcr.io/remsky/kokoro-fastapi:latest-cpu` **Capabilities:** - 53 built-in voices across 8 languages - Speed control: 0.25x to 4.0x - Output formats: mp3, wav, opus, flac - Voice metadata derived from ID prefix (language, gender, accent) **Voice ID Format:** `{lang}{gender}_{name}` - First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese) - Second character: gender (f=Female, m=Male) **Example voices:** | Voice ID | Name | Language | Gender | |----------|------|----------|--------| | `af_heart` | Heart | en-US | Female | | `am_adam` | Adam | en-US | Male | | `bf_alice` | Alice | en-GB | Female | | `bm_daniel` | Daniel | en-GB | Male | | `ef_dora` | Dora | es | Female | | `ff_camille` | Camille | fr | Female | | `jf_alpha` | Alpha | ja | Female | | `zf_xiaobei` | Xiaobei | zh | Female | ### Chatterbox (Premium Tier) **Engine:** [Chatterbox TTS Server](https://github.com/devnen/chatterbox-tts-server) **License:** Proprietary **Requirements:** NVIDIA GPU with CUDA **Docker Image:** `devnen/chatterbox-tts-server:latest` **Capabilities:** - Voice cloning via reference audio sample - Emotion exaggeration control (0.0 - 1.0) - Cross-language voice transfer (23 languages) - Higher quality synthesis than default tier **Supported Languages:** en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro **Extended Options (Chatterbox-specific):** | Option | Type | Description | | --------------------- | ------ | --------------------------------------------------------- | | `referenceAudio` | Buffer | Audio sample for voice cloning (5-30 seconds recommended) | | `emotionExaggeration` | number | Emotion intensity 0.0-1.0 (clamped) | These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending. ### Piper (Fallback Tier) **Engine:** [Piper](https://github.com/rhasspy/piper) via [OpenedAI Speech](https://github.com/matatonic/openedai-speech) **License:** GPL (OpenedAI Speech) **Requirements:** CPU only (runs on Raspberry Pi) **Docker Image:** Use OpenedAI Speech image **Capabilities:** - 100+ voices across 40+ languages - 6 standard OpenAI voice names (mapped to Piper voices) - Output formats: mp3, wav, opus, flac - Ultra-lightweight, designed for low-resource environments **Standard Voice Mapping:** | OpenAI Voice | Piper Voice | Gender | Description | | ------------ | -------------------- | ------ | --------------------- | | `alloy` | en_US-amy-medium | Female | Warm, balanced | | `echo` | en_US-ryan-medium | Male | Clear, articulate | | `fable` | en_GB-alan-medium | Male | British narrator | | `onyx` | en_US-danny-low | Male | Deep, resonant | | `nova` | en_US-lessac-medium | Female | Expressive, versatile | | `shimmer` | en_US-kristin-medium | Female | Bright, energetic | ### Speaches (STT) **Engine:** [Speaches](https://github.com/speaches-ai/speaches) (faster-whisper backend) **License:** MIT **Requirements:** CPU (GPU optional for faster inference) **Docker Image:** `ghcr.io/speaches-ai/speaches:latest` **Capabilities:** - OpenAI-compatible `/v1/audio/transcriptions` endpoint - Whisper models via faster-whisper - Verbose JSON response with segments and timestamps - Language detection **Default model:** `Systran/faster-whisper-large-v3-turbo` --- ## Voice Cloning Setup (Chatterbox) Voice cloning is available through the Chatterbox premium TTS provider. ### Prerequisites 1. NVIDIA GPU with CUDA support 2. `nvidia-container-toolkit` installed on the Docker host 3. Docker runtime configured for GPU access 4. TTS premium tier enabled (`TTS_PREMIUM_ENABLED=true`) ### Basic Voice Cloning Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize: ```typescript import { SpeechService } from "./speech.service"; import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types"; const options: ChatterboxSynthesizeOptions = { tier: "premium", referenceAudio: myAudioBuffer, // 5-30 second audio sample emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion }; const result = await speechService.synthesize("Hello, this is my cloned voice!", options); ``` ### Voice Cloning Tips - **Audio quality:** Use clean recordings without background noise - **Duration:** 5-30 seconds works best; shorter clips may produce lower quality - **Format:** WAV provides the best quality; MP3 is also accepted - **Emotion:** Start with 0.5 (moderate) and adjust from there - **Cross-language:** You can clone a voice in one language and synthesize in another --- ## Docker Compose Setup ### Development (Local) Speech services are defined in a separate overlay file `docker-compose.speech.yml`. This keeps them optional and separate from core services. **Start basic speech services (STT + default TTS):** ```bash # Using docker compose directly docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d # Using Makefile make speech-up ``` **Start with premium TTS (requires NVIDIA GPU):** ```bash docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d ``` **Stop speech services:** ```bash # Using docker compose directly docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans # Using Makefile make speech-down ``` **View logs:** ```bash make speech-logs ``` ### Development Services | Service | Container | Port | Image | | -------------- | --------------------- | ------------------------------- | ------------------------------------------ | | Speaches (STT) | mosaic-speaches | 8090 (host) -> 8000 (container) | `ghcr.io/speaches-ai/speaches:latest` | | Kokoro TTS | mosaic-kokoro-tts | 8880 (host) -> 8880 (container) | `ghcr.io/remsky/kokoro-fastapi:latest-cpu` | | Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | `devnen/chatterbox-tts-server:latest` | ### Production (Docker Swarm) For production deployments, use `docker/docker-compose.sample.speech.yml`. This file is designed for Docker Swarm with Traefik integration. **Required environment variables:** ```bash STT_DOMAIN=stt.example.com TTS_DOMAIN=tts.example.com ``` **Optional environment variables:** ```bash WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo CHATTERBOX_TTS_DOMAIN=tts-premium.example.com TRAEFIK_ENTRYPOINT=websecure TRAEFIK_CERTRESOLVER=letsencrypt TRAEFIK_DOCKER_NETWORK=traefik-public TRAEFIK_TLS_ENABLED=true ``` **Deploy:** ```bash docker stack deploy -c docker/docker-compose.sample.speech.yml speech ``` **Connecting to Mosaic Stack:** Set the speech URLs in your Mosaic Stack `.env`: ```bash # Same Docker network STT_BASE_URL=http://speaches:8000/v1 TTS_DEFAULT_URL=http://kokoro-tts:8880/v1 # External / different network STT_BASE_URL=https://stt.example.com/v1 TTS_DEFAULT_URL=https://tts.example.com/v1 ``` ### Health Checks All speech containers include health checks: | Service | Endpoint | Interval | Start Period | | -------------- | ------------------------------ | -------- | ------------ | | Speaches | `http://localhost:8000/health` | 30s | 120s | | Kokoro TTS | `http://localhost:8880/health` | 30s | 120s | | Chatterbox TTS | `http://localhost:8000/health` | 30s | 180s | Chatterbox has a longer start period (180s) because GPU model loading takes additional time. --- ## GPU VRAM Budget Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only. ### Chatterbox VRAM Requirements | Component | Approximate VRAM | | ----------------------- | ------------------ | | Chatterbox TTS model | ~2-4 GB | | Voice cloning inference | ~1-2 GB additional | | **Total recommended** | **4-6 GB** | ### Shared GPU Considerations If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS): | Service | VRAM Usage | Notes | | -------------------- | ----------- | --------------------------------- | | Ollama (7B model) | ~4-6 GB | Depends on model size | | Ollama (13B model) | ~8-10 GB | Larger models need more | | Chatterbox TTS | ~4-6 GB | Voice cloning is memory-intensive | | **Combined minimum** | **8-12 GB** | For 7B LLM + Chatterbox | **Recommendations:** - 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate) - 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously - 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom If VRAM is limited, consider: 1. Disabling Chatterbox (`TTS_PREMIUM_ENABLED=false`) and using Kokoro (CPU) as default 2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy 3. Running Chatterbox on a separate GPU host ### Docker Swarm GPU Scheduling For Docker Swarm deployments with GPU, configure generic resources on the node: ```json // /etc/docker/daemon.json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime" } }, "node-generic-resources": ["NVIDIA-GPU=0"] } ``` See the [Docker GPU Swarm documentation](https://docs.docker.com/engine/daemon/nvidia-gpu/#configure-gpus-for-docker-swarm) for details. --- ## Frontend Integration Speech services are consumed from the frontend through the REST API and WebSocket gateway. ### REST API Usage **Transcribe audio:** ```typescript async function transcribeAudio(file: File, token: string, workspaceId: string) { const formData = new FormData(); formData.append("file", file); formData.append("language", "en"); const response = await fetch("/api/speech/transcribe", { method: "POST", headers: { Authorization: `Bearer ${token}`, "x-workspace-id": workspaceId, }, body: formData, }); const { data } = await response.json(); return data.text; } ``` **Synthesize speech:** ```typescript async function synthesizeSpeech( text: string, token: string, workspaceId: string, voice = "af_heart" ) { const response = await fetch("/api/speech/synthesize", { method: "POST", headers: { Authorization: `Bearer ${token}`, "x-workspace-id": workspaceId, "Content-Type": "application/json", }, body: JSON.stringify({ text, voice, format: "mp3" }), }); const audioBlob = await response.blob(); const audioUrl = URL.createObjectURL(audioBlob); const audio = new Audio(audioUrl); audio.play(); } ``` **List voices:** ```typescript async function listVoices(token: string, workspaceId: string, tier?: string) { const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices"; const response = await fetch(url, { headers: { Authorization: `Bearer ${token}`, "x-workspace-id": workspaceId, }, }); const { data } = await response.json(); return data; // VoiceInfo[] } ``` ### WebSocket Streaming Usage For real-time transcription using the browser's MediaRecorder API: ```typescript import { io } from "socket.io-client"; function createSpeechSocket(token: string) { const socket = io("/speech", { auth: { token }, }); let mediaRecorder: MediaRecorder | null = null; async function startRecording() { const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); mediaRecorder = new MediaRecorder(stream, { mimeType: "audio/webm;codecs=opus", }); socket.emit("start-transcription", { language: "en" }); mediaRecorder.ondataavailable = (event) => { if (event.data.size > 0) { event.data.arrayBuffer().then((buffer) => { socket.emit("audio-chunk", new Uint8Array(buffer)); }); } }; mediaRecorder.start(250); // Send chunks every 250ms } async function stopRecording(): Promise { return new Promise((resolve, reject) => { socket.once("transcription-final", (result) => { resolve(result.text); }); socket.once("transcription-error", ({ message }) => { reject(new Error(message)); }); if (mediaRecorder) { mediaRecorder.stop(); mediaRecorder.stream.getTracks().forEach((track) => track.stop()); mediaRecorder = null; } socket.emit("stop-transcription"); }); } return { socket, startRecording, stopRecording }; } ``` ### Check Speech Availability Before showing speech UI elements, check provider availability: ```typescript async function checkSpeechHealth(token: string, workspaceId: string) { const response = await fetch("/api/speech/health", { headers: { Authorization: `Bearer ${token}`, "x-workspace-id": workspaceId, }, }); const { data } = await response.json(); return { canTranscribe: data.stt.available, canSynthesize: data.tts.available, }; } ```