Comprehensive documentation for the speech services module: - docs/SPEECH.md: Architecture, API reference, WebSocket protocol, environment variables, provider configuration, Docker setup, GPU VRAM budget, and frontend integration examples - apps/api/src/speech/AGENTS.md: Module structure, provider pattern, how to add new providers, gotchas, and test patterns - README.md: Speech capabilities section with quick start Fixes #406 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
31 KiB
Speech Services
Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.
Table of Contents
- Architecture Overview
- Provider Abstraction
- TTS Tier System and Fallback Chain
- API Endpoint Reference
- WebSocket Streaming Protocol
- Environment Variable Reference
- Provider Configuration
- Voice Cloning Setup (Chatterbox)
- Docker Compose Setup
- GPU VRAM Budget
- Frontend Integration
Architecture Overview
+-------------------+
| SpeechController |
| (REST endpoints) |
+--------+----------+
|
+--------------+--------------+
| SpeechService |
| (provider selection, |
| fallback orchestration) |
+---------+----------+---------+
| |
+------------+ +-----+-------+
| | |
+------+------+ +-----+-----+ +-----+-----+
| STT Provider| |TTS Provider| |TTS Provider|
| (Speaches) | |Map<Tier,P> | |Map<Tier,P> |
+------+------+ +-----+-----+ +-----+-----+
| | |
+------+------+ +-----+-----+ +-----+-----+
| Speaches | | Kokoro | | Chatterbox |
| (Whisper) | | (default) | | (premium) |
+-------------+ +-----------+ +-----+------+
|
+-----+-----+
| Piper |
| (fallback)|
+-----------+
+-------------------+
| SpeechGateway |
| (WebSocket /speech)
+--------+----------+
|
Uses SpeechService.transcribe()
The speech module (apps/api/src/speech/) is a self-contained NestJS module consisting of:
| Component | File | Purpose |
|---|---|---|
| Module | speech.module.ts |
Registers providers, controllers, gateway |
| Config | speech.config.ts |
Environment validation and typed config |
| Service | speech.service.ts |
High-level speech operations with fallback |
| Controller | speech.controller.ts |
REST API endpoints |
| Gateway | speech.gateway.ts |
WebSocket streaming transcription |
| Constants | speech.constants.ts |
NestJS injection tokens |
Key Design Decisions
-
OpenAI-compatible APIs: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom
baseURL. -
Provider abstraction: STT and TTS providers implement well-defined interfaces (
ISTTProvider,ITTSProvider). New providers can be added without modifying the service layer. -
Conditional registration: Providers are only instantiated when their corresponding
*_ENABLEDflag istrue. The STT provider uses NestJS@Optional()injection. -
Fail-fast validation: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.
Provider Abstraction
STT Provider Interface
interface ISTTProvider {
readonly name: string;
transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
isHealthy(): Promise<boolean>;
}
Currently implemented by SpeachesSttProvider which connects to a Speaches (faster-whisper) server.
TTS Provider Interface
interface ITTSProvider {
readonly name: string;
readonly tier: SpeechTier;
synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
listVoices(): Promise<VoiceInfo[]>;
isHealthy(): Promise<boolean>;
}
All TTS providers extend BaseTTSProvider, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set name and tier and optionally override listVoices() or synthesize().
Provider Registration
Providers are created by the TTS Provider Factory (providers/tts-provider.factory.ts) based on configuration:
| Tier | Provider Class | Engine | Requirements |
|---|---|---|---|
default |
KokoroTtsProvider |
Kokoro-FastAPI | CPU only |
premium |
ChatterboxTTSProvider |
Chatterbox TTS Server | NVIDIA GPU |
fallback |
PiperTtsProvider |
Piper via OpenedAI Speech | CPU only |
TTS Tier System and Fallback Chain
TTS uses a tiered architecture with automatic fallback:
Request with tier="premium"
|
v
[premium] Chatterbox available? --yes--> Use Chatterbox
| |
no (success/fail)
|
v
[default] Kokoro available? ------yes--> Use Kokoro
| |
no (success/fail)
|
v
[fallback] Piper available? -----yes--> Use Piper
| |
no (success/fail)
|
v
ServiceUnavailableException
Fallback order: premium -> default -> fallback
The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:
- It is enabled in configuration (
TTS_ENABLED,TTS_PREMIUM_ENABLED,TTS_FALLBACK_ENABLED) - A provider is registered for that tier
If no tier is specified in the request, default is used as the starting point.
API Endpoint Reference
All speech endpoints are under /api/speech/ and require authentication (Bearer token) plus workspace context (x-workspace-id header).
POST /api/speech/transcribe
Transcribe an uploaded audio file to text.
Authentication: Bearer token + workspace membership
Content-Type: multipart/form-data
Form Fields:
| Field | Type | Required | Description |
|---|---|---|---|
file |
File | Yes | Audio file (max 25 MB) |
language |
string | No | Language code (e.g., "en", "fr"). Default: from config |
model |
string | No | Whisper model override. Default: from config |
prompt |
string | No | Prompt to guide transcription (max 1000 chars) |
temperature |
number | No | Temperature 0.0-1.0. Lower = more deterministic |
Accepted Audio Formats:
audio/wav, audio/mp3, audio/mpeg, audio/webm, audio/ogg, audio/flac, audio/x-m4a
Response:
{
"data": {
"text": "Hello, this is a transcription test.",
"language": "en",
"durationSeconds": 3.5,
"confidence": 0.95,
"segments": [
{
"text": "Hello, this is a transcription test.",
"start": 0.0,
"end": 3.5,
"confidence": 0.95
}
]
}
}
Example:
curl -X POST http://localhost:3001/api/speech/transcribe \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID" \
-F "file=@recording.wav" \
-F "language=en"
POST /api/speech/synthesize
Synthesize text to audio using TTS providers.
Authentication: Bearer token + workspace membership
Content-Type: application/json
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
text |
string | Yes | Text to synthesize (max 4096 chars) |
voice |
string | No | Voice ID. Default: from config (e.g., "af_heart") |
speed |
number | No | Speed multiplier 0.5-2.0. Default: 1.0 |
format |
string | No | Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3 |
tier |
string | No | Provider tier: default, premium, fallback. Default: default |
Response: Binary audio data with appropriate Content-Type header.
| Format | Content-Type |
|---|---|
| mp3 | audio/mpeg |
| wav | audio/wav |
| opus | audio/opus |
| flac | audio/flac |
| aac | audio/aac |
| pcm | audio/pcm |
Example:
curl -X POST http://localhost:3001/api/speech/synthesize \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
--output speech.mp3
GET /api/speech/voices
List available TTS voices across all tiers.
Authentication: Bearer token + workspace access Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
tier |
string | No | Filter by tier: default, premium, fallback |
Response:
{
"data": [
{
"id": "af_heart",
"name": "Heart (American Female)",
"language": "en-US",
"tier": "default",
"isDefault": true
},
{
"id": "am_adam",
"name": "Adam (American Male)",
"language": "en-US",
"tier": "default",
"isDefault": false
}
]
}
Example:
curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "x-workspace-id: WORKSPACE_ID"
GET /api/speech/health
Check availability of STT and TTS providers.
Authentication: Bearer token + workspace access
Response:
{
"data": {
"stt": { "available": true },
"tts": { "available": true }
}
}
WebSocket Streaming Protocol
The speech module provides a WebSocket gateway at namespace /speech for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.
Connection
Connect to the /speech namespace with authentication:
import { io } from "socket.io-client";
const socket = io("http://localhost:3001/speech", {
auth: { token: "YOUR_SESSION_TOKEN" },
});
Authentication methods (checked in order):
auth.tokenin handshakequery.tokenin handshake URLAuthorization: Bearer <token>header
Connection is rejected if:
- No valid token is provided
- Session verification fails
- User has no workspace membership
Connection timeout: 5 seconds for authentication.
Protocol Flow
Client Server
| |
|--- connect (with token) ----->|
| | (authenticate, check workspace)
|<--- connected ----------------|
| |
|--- start-transcription ------>| { language?: "en" }
|<--- transcription-started ----| { sessionId, language }
| |
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|--- audio-chunk -------------->| (Buffer/Uint8Array)
|--- audio-chunk -------------->| (Buffer/Uint8Array)
| |
|--- stop-transcription ------->|
| | (concatenate chunks, transcribe)
|<--- transcription-final ------| { text, language, durationSeconds, ... }
| |
Client Events (emit)
| Event | Payload | Description |
|---|---|---|
start-transcription |
{ language?: string } |
Begin a new transcription session |
audio-chunk |
Buffer or Uint8Array |
Send audio data chunk |
stop-transcription |
(none) | Stop recording and trigger transcription |
Server Events (listen)
| Event | Payload | Description |
|---|---|---|
transcription-started |
{ sessionId, language } |
Session created |
transcription-final |
{ text, language, durationSeconds, confidence, segments } |
Transcription result |
transcription-error |
{ message } |
Error during transcription |
Session Management
- One active transcription session per client connection
- Starting a new session replaces any existing session
- Sessions are cleaned up on client disconnect
- Audio chunks are accumulated in memory
- Total accumulated size is capped by
SPEECH_MAX_UPLOAD_SIZE(default: 25 MB)
Example Client Usage
import { io } from "socket.io-client";
const socket = io("http://localhost:3001/speech", {
auth: { token: sessionToken },
});
// Start recording
socket.emit("start-transcription", { language: "en" });
socket.on("transcription-started", ({ sessionId }) => {
console.log("Session started:", sessionId);
});
// Stream audio chunks from MediaRecorder
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
event.data.arrayBuffer().then((buffer) => {
socket.emit("audio-chunk", new Uint8Array(buffer));
});
}
};
// Stop and get result
socket.emit("stop-transcription");
socket.on("transcription-final", (result) => {
console.log("Transcription:", result.text);
console.log("Duration:", result.durationSeconds, "seconds");
});
socket.on("transcription-error", ({ message }) => {
console.error("Transcription error:", message);
});
Environment Variable Reference
Speech-to-Text (STT)
| Variable | Default | Description |
|---|---|---|
STT_ENABLED |
false |
Enable speech-to-text transcription |
STT_BASE_URL |
http://speaches:8000/v1 |
Speaches server URL (required when STT_ENABLED=true) |
STT_MODEL |
Systran/faster-whisper-large-v3-turbo |
Whisper model for transcription |
STT_LANGUAGE |
en |
Default language code |
Text-to-Speech (TTS) - Default Engine (Kokoro)
| Variable | Default | Description |
|---|---|---|
TTS_ENABLED |
false |
Enable default TTS engine |
TTS_DEFAULT_URL |
http://kokoro-tts:8880/v1 |
Kokoro-FastAPI URL (required when TTS_ENABLED=true) |
TTS_DEFAULT_VOICE |
af_heart |
Default Kokoro voice ID |
TTS_DEFAULT_FORMAT |
mp3 |
Default audio output format |
Text-to-Speech (TTS) - Premium Engine (Chatterbox)
| Variable | Default | Description |
|---|---|---|
TTS_PREMIUM_ENABLED |
false |
Enable premium TTS engine |
TTS_PREMIUM_URL |
http://chatterbox-tts:8881/v1 |
Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true) |
Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)
| Variable | Default | Description |
|---|---|---|
TTS_FALLBACK_ENABLED |
false |
Enable fallback TTS engine |
TTS_FALLBACK_URL |
http://openedai-speech:8000/v1 |
OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true) |
Service Limits
| Variable | Default | Description |
|---|---|---|
SPEECH_MAX_UPLOAD_SIZE |
25000000 |
Maximum upload file size in bytes (25 MB) |
SPEECH_MAX_DURATION_SECONDS |
600 |
Maximum audio duration in seconds (10 minutes) |
SPEECH_MAX_TEXT_LENGTH |
4096 |
Maximum text length for TTS in characters |
Conditional Validation
When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:
STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
Either set these variables or disable by setting STT_ENABLED=false.
Boolean parsing: value === "true" or value === "1". Unset or empty values default to false.
Provider Configuration
Kokoro (Default Tier)
Engine: Kokoro-FastAPI
License: Apache 2.0
Requirements: CPU only
Docker Image: ghcr.io/remsky/kokoro-fastapi:latest-cpu
Capabilities:
- 54 built-in voices across 8 languages
- Speed control: 0.25x to 4.0x
- Output formats: mp3, wav, opus, flac
- Voice metadata derived from ID prefix (language, gender, accent)
Voice ID Format: {lang}{gender}_{name}
- First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
- Second character: gender (f=Female, m=Male)
Example voices:
| Voice ID | Name | Language | Gender |
|---|---|---|---|
af_heart |
Heart | en-US | Female |
am_adam |
Adam | en-US | Male |
bf_alice |
Alice | en-GB | Female |
bm_daniel |
Daniel | en-GB | Male |
ef_dora |
Dora | es | Female |
ff_camille |
Camille | fr | Female |
jf_alpha |
Alpha | ja | Female |
zf_xiaobei |
Xiaobei | zh | Female |
Chatterbox (Premium Tier)
Engine: Chatterbox TTS Server
License: Proprietary
Requirements: NVIDIA GPU with CUDA
Docker Image: devnen/chatterbox-tts-server:latest
Capabilities:
- Voice cloning via reference audio sample
- Emotion exaggeration control (0.0 - 1.0)
- Cross-language voice transfer (23 languages)
- Higher quality synthesis than default tier
Supported Languages: en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro
Extended Options (Chatterbox-specific):
| Option | Type | Description |
|---|---|---|
referenceAudio |
Buffer | Audio sample for voice cloning (5-30 seconds recommended) |
emotionExaggeration |
number | Emotion intensity 0.0-1.0 (clamped) |
These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.
Piper (Fallback Tier)
Engine: Piper via OpenedAI Speech License: GPL (OpenedAI Speech) Requirements: CPU only (runs on Raspberry Pi) Docker Image: Use OpenedAI Speech image
Capabilities:
- 100+ voices across 40+ languages
- 6 standard OpenAI voice names (mapped to Piper voices)
- Output formats: mp3, wav, opus, flac
- Ultra-lightweight, designed for low-resource environments
Standard Voice Mapping:
| OpenAI Voice | Piper Voice | Gender | Description |
|---|---|---|---|
alloy |
en_US-amy-medium | Female | Warm, balanced |
echo |
en_US-ryan-medium | Male | Clear, articulate |
fable |
en_GB-alan-medium | Male | British narrator |
onyx |
en_US-danny-low | Male | Deep, resonant |
nova |
en_US-lessac-medium | Female | Expressive, versatile |
shimmer |
en_US-kristin-medium | Female | Bright, energetic |
Speaches (STT)
Engine: Speaches (faster-whisper backend)
License: MIT
Requirements: CPU (GPU optional for faster inference)
Docker Image: ghcr.io/speaches-ai/speaches:latest
Capabilities:
- OpenAI-compatible
/v1/audio/transcriptionsendpoint - Whisper models via faster-whisper
- Verbose JSON response with segments and timestamps
- Language detection
Default model: Systran/faster-whisper-large-v3-turbo
Voice Cloning Setup (Chatterbox)
Voice cloning is available through the Chatterbox premium TTS provider.
Prerequisites
- NVIDIA GPU with CUDA support
nvidia-container-toolkitinstalled on the Docker host- Docker runtime configured for GPU access
- TTS premium tier enabled (
TTS_PREMIUM_ENABLED=true)
Basic Voice Cloning
Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:
import { SpeechService } from "./speech.service";
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";
const options: ChatterboxSynthesizeOptions = {
tier: "premium",
referenceAudio: myAudioBuffer, // 5-30 second audio sample
emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
};
const result = await speechService.synthesize("Hello, this is my cloned voice!", options);
Voice Cloning Tips
- Audio quality: Use clean recordings without background noise
- Duration: 5-30 seconds works best; shorter clips may produce lower quality
- Format: WAV provides the best quality; MP3 is also accepted
- Emotion: Start with 0.5 (moderate) and adjust from there
- Cross-language: You can clone a voice in one language and synthesize in another
Docker Compose Setup
Development (Local)
Speech services are defined in a separate overlay file docker-compose.speech.yml. This keeps them optional and separate from core services.
Start basic speech services (STT + default TTS):
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d
# Using Makefile
make speech-up
Start with premium TTS (requires NVIDIA GPU):
docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d
Stop speech services:
# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans
# Using Makefile
make speech-down
View logs:
make speech-logs
Development Services
| Service | Container | Port | Image |
|---|---|---|---|
| Speaches (STT) | mosaic-speaches | 8090 (host) -> 8000 (container) | ghcr.io/speaches-ai/speaches:latest |
| Kokoro TTS | mosaic-kokoro-tts | 8880 (host) -> 8880 (container) | ghcr.io/remsky/kokoro-fastapi:latest-cpu |
| Chatterbox TTS | mosaic-chatterbox-tts | 8881 (host) -> 8000 (container) | devnen/chatterbox-tts-server:latest |
Production (Docker Swarm)
For production deployments, use docker/docker-compose.sample.speech.yml. This file is designed for Docker Swarm with Traefik integration.
Required environment variables:
STT_DOMAIN=stt.example.com
TTS_DOMAIN=tts.example.com
Optional environment variables:
WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
TRAEFIK_ENTRYPOINT=websecure
TRAEFIK_CERTRESOLVER=letsencrypt
TRAEFIK_DOCKER_NETWORK=traefik-public
TRAEFIK_TLS_ENABLED=true
Deploy:
docker stack deploy -c docker/docker-compose.sample.speech.yml speech
Connecting to Mosaic Stack: Set the speech URLs in your Mosaic Stack .env:
# Same Docker network
STT_BASE_URL=http://speaches:8000/v1
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1
# External / different network
STT_BASE_URL=https://stt.example.com/v1
TTS_DEFAULT_URL=https://tts.example.com/v1
Health Checks
All speech containers include health checks:
| Service | Endpoint | Interval | Start Period |
|---|---|---|---|
| Speaches | http://localhost:8000/health |
30s | 120s |
| Kokoro TTS | http://localhost:8880/health |
30s | 120s |
| Chatterbox TTS | http://localhost:8000/health |
30s | 180s |
Chatterbox has a longer start period (180s) because GPU model loading takes additional time.
GPU VRAM Budget
Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.
Chatterbox VRAM Requirements
| Component | Approximate VRAM |
|---|---|
| Chatterbox TTS model | ~2-4 GB |
| Voice cloning inference | ~1-2 GB additional |
| Total recommended | 4-6 GB |
Shared GPU Considerations
If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):
| Service | VRAM Usage | Notes |
|---|---|---|
| Ollama (7B model) | ~4-6 GB | Depends on model size |
| Ollama (13B model) | ~8-10 GB | Larger models need more |
| Chatterbox TTS | ~4-6 GB | Voice cloning is memory-intensive |
| Combined minimum | 8-12 GB | For 7B LLM + Chatterbox |
Recommendations:
- 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
- 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
- 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom
If VRAM is limited, consider:
- Disabling Chatterbox (
TTS_PREMIUM_ENABLED=false) and using Kokoro (CPU) as default - Using the fallback chain so Kokoro handles requests when Chatterbox is busy
- Running Chatterbox on a separate GPU host
Docker Swarm GPU Scheduling
For Docker Swarm deployments with GPU, configure generic resources on the node:
// /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime"
}
},
"node-generic-resources": ["NVIDIA-GPU=0"]
}
See the Docker GPU Swarm documentation for details.
Frontend Integration
Speech services are consumed from the frontend through the REST API and WebSocket gateway.
REST API Usage
Transcribe audio:
async function transcribeAudio(file: File, token: string, workspaceId: string) {
const formData = new FormData();
formData.append("file", file);
formData.append("language", "en");
const response = await fetch("/api/speech/transcribe", {
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
body: formData,
});
const { data } = await response.json();
return data.text;
}
Synthesize speech:
async function synthesizeSpeech(
text: string,
token: string,
workspaceId: string,
voice = "af_heart"
) {
const response = await fetch("/api/speech/synthesize", {
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
"Content-Type": "application/json",
},
body: JSON.stringify({ text, voice, format: "mp3" }),
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
}
List voices:
async function listVoices(token: string, workspaceId: string, tier?: string) {
const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";
const response = await fetch(url, {
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
});
const { data } = await response.json();
return data; // VoiceInfo[]
}
WebSocket Streaming Usage
For real-time transcription using the browser's MediaRecorder API:
import { io } from "socket.io-client";
function createSpeechSocket(token: string) {
const socket = io("/speech", {
auth: { token },
});
let mediaRecorder: MediaRecorder | null = null;
async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus",
});
socket.emit("start-transcription", { language: "en" });
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
event.data.arrayBuffer().then((buffer) => {
socket.emit("audio-chunk", new Uint8Array(buffer));
});
}
};
mediaRecorder.start(250); // Send chunks every 250ms
}
async function stopRecording(): Promise<string> {
return new Promise((resolve, reject) => {
socket.once("transcription-final", (result) => {
resolve(result.text);
});
socket.once("transcription-error", ({ message }) => {
reject(new Error(message));
});
if (mediaRecorder) {
mediaRecorder.stop();
mediaRecorder.stream.getTracks().forEach((track) => track.stop());
mediaRecorder = null;
}
socket.emit("stop-transcription");
});
}
return { socket, startRecording, stopRecording };
}
Check Speech Availability
Before showing speech UI elements, check provider availability:
async function checkSpeechHealth(token: string, workspaceId: string) {
const response = await fetch("/api/speech/health", {
headers: {
Authorization: `Bearer ${token}`,
"x-workspace-id": workspaceId,
},
});
const { data } = await response.json();
return {
canTranscribe: data.stt.available,
canSynthesize: data.tts.available,
};
}