Files

ci/woodpecker/push/api Pipeline was successful

Details

docs(#406 ): add speech services documentation

Comprehensive documentation for the speech services module:
- docs/SPEECH.md: Architecture, API reference, WebSocket protocol,
  environment variables, provider configuration, Docker setup,
  GPU VRAM budget, and frontend integration examples
- apps/api/src/speech/AGENTS.md: Module structure, provider pattern,
  how to add new providers, gotchas, and test patterns
- README.md: Speech capabilities section with quick start

Fixes #406

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 03:23:22 -06:00

31 KiB

Raw Blame History

Speech Services

Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.

Architecture Overview
Provider Abstraction
TTS Tier System and Fallback Chain
API Endpoint Reference
WebSocket Streaming Protocol
Environment Variable Reference
Provider Configuration
Voice Cloning Setup (Chatterbox)
Docker Compose Setup
GPU VRAM Budget
Frontend Integration

Architecture Overview

                          +-------------------+
                          |  SpeechController |
                          |  (REST endpoints) |
                          +--------+----------+
                                   |
                    +--------------+--------------+
                    |         SpeechService        |
                    |  (provider selection,         |
                    |   fallback orchestration)     |
                    +---------+----------+---------+
                              |          |
                 +------------+    +-----+-------+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | STT Provider|   |TTS Provider| |TTS Provider|
          | (Speaches)  |   |Map<Tier,P> | |Map<Tier,P> |
          +------+------+   +-----+-----+ +-----+-----+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | Speaches    |   | Kokoro    | | Chatterbox |
          | (Whisper)   |   | (default) | | (premium)  |
          +-------------+   +-----------+ +-----+------+
                                                |
                                          +-----+-----+
                                          |   Piper   |
                                          | (fallback)|
                                          +-----------+

          +-------------------+
          |  SpeechGateway    |
          |  (WebSocket /speech)
          +--------+----------+
                   |
          Uses SpeechService.transcribe()

The speech module (apps/api/src/speech/) is a self-contained NestJS module consisting of:

Component	File	Purpose
Module	`speech.module.ts`	Registers providers, controllers, gateway
Config	`speech.config.ts`	Environment validation and typed config
Service	`speech.service.ts`	High-level speech operations with fallback
Controller	`speech.controller.ts`	REST API endpoints
Gateway	`speech.gateway.ts`	WebSocket streaming transcription
Constants	`speech.constants.ts`	NestJS injection tokens

Key Design Decisions

OpenAI-compatible APIs: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom baseURL.
Provider abstraction: STT and TTS providers implement well-defined interfaces (ISTTProvider, ITTSProvider). New providers can be added without modifying the service layer.
Conditional registration: Providers are only instantiated when their corresponding *_ENABLED flag is true. The STT provider uses NestJS @Optional() injection.
Fail-fast validation: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.

Provider Abstraction

STT Provider Interface

interface ISTTProvider {
  readonly name: string;
  transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
  isHealthy(): Promise<boolean>;
}

Currently implemented by SpeachesSttProvider which connects to a Speaches (faster-whisper) server.

TTS Provider Interface

interface ITTSProvider {
  readonly name: string;
  readonly tier: SpeechTier;
  synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
  listVoices(): Promise<VoiceInfo[]>;
  isHealthy(): Promise<boolean>;
}

All TTS providers extend BaseTTSProvider, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set name and tier and optionally override listVoices() or synthesize().

Provider Registration

Providers are created by the TTS Provider Factory (providers/tts-provider.factory.ts) based on configuration:

Tier	Provider Class	Engine	Requirements
`default`	`KokoroTtsProvider`	Kokoro-FastAPI	CPU only
`premium`	`ChatterboxTTSProvider`	Chatterbox TTS Server	NVIDIA GPU
`fallback`	`PiperTtsProvider`	Piper via OpenedAI Speech	CPU only

TTS Tier System and Fallback Chain

TTS uses a tiered architecture with automatic fallback:

Request with tier="premium"
    |
    v
[premium] Chatterbox available? --yes--> Use Chatterbox
    |                                         |
    no                                   (success/fail)
    |
    v
[default] Kokoro available? ------yes--> Use Kokoro
    |                                         |
    no                                   (success/fail)
    |
    v
[fallback] Piper available? -----yes--> Use Piper
    |                                         |
    no                                   (success/fail)
    |
    v
ServiceUnavailableException

Fallback order: premium -> default -> fallback

The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:

It is enabled in configuration (TTS_ENABLED, TTS_PREMIUM_ENABLED, TTS_FALLBACK_ENABLED)
A provider is registered for that tier

If no tier is specified in the request, default is used as the starting point.

API Endpoint Reference

All speech endpoints are under /api/speech/ and require authentication (Bearer token) plus workspace context (x-workspace-id header).

POST /api/speech/transcribe

Transcribe an uploaded audio file to text.

Authentication: Bearer token + workspace membership Content-Type: multipart/form-data

Form Fields:

Field	Type	Required	Description
`file`	File	Yes	Audio file (max 25 MB)
`language`	string	No	Language code (e.g., "en", "fr"). Default: from config
`model`	string	No	Whisper model override. Default: from config
`prompt`	string	No	Prompt to guide transcription (max 1000 chars)
`temperature`	number	No	Temperature 0.0-1.0. Lower = more deterministic

Accepted Audio Formats: audio/wav, audio/mp3, audio/mpeg, audio/webm, audio/ogg, audio/flac, audio/x-m4a

Response:

{
  "data": {
    "text": "Hello, this is a transcription test.",
    "language": "en",
    "durationSeconds": 3.5,
    "confidence": 0.95,
    "segments": [
      {
        "text": "Hello, this is a transcription test.",
        "start": 0.0,
        "end": 3.5,
        "confidence": 0.95
      }
    ]
  }
}

Example:

curl -X POST http://localhost:3001/api/speech/transcribe \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -F "file=@recording.wav" \
  -F "language=en"

POST /api/speech/synthesize

Synthesize text to audio using TTS providers.

Authentication: Bearer token + workspace membership Content-Type: application/json

Request Body:

Field	Type	Required	Description
`text`	string	Yes	Text to synthesize (max 4096 chars)
`voice`	string	No	Voice ID. Default: from config (e.g., "af_heart")
`speed`	number	No	Speed multiplier 0.5-2.0. Default: 1.0
`format`	string	No	Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3
`tier`	string	No	Provider tier: default, premium, fallback. Default: default

Response: Binary audio data with appropriate Content-Type header.

Format	Content-Type
mp3	`audio/mpeg`
wav	`audio/wav`
opus	`audio/opus`
flac	`audio/flac`
aac	`audio/aac`
pcm	`audio/pcm`

Example:

curl -X POST http://localhost:3001/api/speech/synthesize \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
  --output speech.mp3

GET /api/speech/voices

List available TTS voices across all tiers.

Authentication: Bearer token + workspace access Query Parameters:

Parameter	Type	Required	Description
`tier`	string	No	Filter by tier: default, premium, fallback

Response:

{
  "data": [
    {
      "id": "af_heart",
      "name": "Heart (American Female)",
      "language": "en-US",
      "tier": "default",
      "isDefault": true
    },
    {
      "id": "am_adam",
      "name": "Adam (American Male)",
      "language": "en-US",
      "tier": "default",
      "isDefault": false
    }
  ]
}

Example:

curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID"

GET /api/speech/health

Check availability of STT and TTS providers.

Authentication: Bearer token + workspace access

Response:

{
  "data": {
    "stt": { "available": true },
    "tts": { "available": true }
  }
}

WebSocket Streaming Protocol

The speech module provides a WebSocket gateway at namespace /speech for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.

Connection

Connect to the /speech namespace with authentication:

import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: "YOUR_SESSION_TOKEN" },
});

Authentication methods (checked in order):

auth.token in handshake
query.token in handshake URL
Authorization: Bearer <token> header

Connection is rejected if:

No valid token is provided
Session verification fails
User has no workspace membership

Connection timeout: 5 seconds for authentication.

Protocol Flow

Client                          Server
  |                               |
  |--- connect (with token) ----->|
  |                               |  (authenticate, check workspace)
  |<--- connected ----------------|
  |                               |
  |--- start-transcription ------>|  { language?: "en" }
  |<--- transcription-started ----|  { sessionId, language }
  |                               |
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |                               |
  |--- stop-transcription ------->|
  |                               |  (concatenate chunks, transcribe)
  |<--- transcription-final ------|  { text, language, durationSeconds, ... }
  |                               |

Client Events (emit)

Event	Payload	Description
`start-transcription`	`{ language?: string }`	Begin a new transcription session
`audio-chunk`	`Buffer` or `Uint8Array`	Send audio data chunk
`stop-transcription`	(none)	Stop recording and trigger transcription

Server Events (listen)

Event	Payload	Description
`transcription-started`	`{ sessionId, language }`	Session created
`transcription-final`	`{ text, language, durationSeconds, confidence, segments }`	Transcription result
`transcription-error`	`{ message }`	Error during transcription

Session Management

One active transcription session per client connection
Starting a new session replaces any existing session
Sessions are cleaned up on client disconnect
Audio chunks are accumulated in memory
Total accumulated size is capped by SPEECH_MAX_UPLOAD_SIZE (default: 25 MB)

Example Client Usage

import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: sessionToken },
});

// Start recording
socket.emit("start-transcription", { language: "en" });

socket.on("transcription-started", ({ sessionId }) => {
  console.log("Session started:", sessionId);
});

// Stream audio chunks from MediaRecorder
mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    event.data.arrayBuffer().then((buffer) => {
      socket.emit("audio-chunk", new Uint8Array(buffer));
    });
  }
};

// Stop and get result
socket.emit("stop-transcription");

socket.on("transcription-final", (result) => {
  console.log("Transcription:", result.text);
  console.log("Duration:", result.durationSeconds, "seconds");
});

socket.on("transcription-error", ({ message }) => {
  console.error("Transcription error:", message);
});

Environment Variable Reference

Speech-to-Text (STT)

Variable	Default	Description
`STT_ENABLED`	`false`	Enable speech-to-text transcription
`STT_BASE_URL`	`http://speaches:8000/v1`	Speaches server URL (required when STT_ENABLED=true)
`STT_MODEL`	`Systran/faster-whisper-large-v3-turbo`	Whisper model for transcription
`STT_LANGUAGE`	`en`	Default language code

Text-to-Speech (TTS) - Default Engine (Kokoro)

Variable	Default	Description
`TTS_ENABLED`	`false`	Enable default TTS engine
`TTS_DEFAULT_URL`	`http://kokoro-tts:8880/v1`	Kokoro-FastAPI URL (required when TTS_ENABLED=true)
`TTS_DEFAULT_VOICE`	`af_heart`	Default Kokoro voice ID
`TTS_DEFAULT_FORMAT`	`mp3`	Default audio output format

Text-to-Speech (TTS) - Premium Engine (Chatterbox)

Variable	Default	Description
`TTS_PREMIUM_ENABLED`	`false`	Enable premium TTS engine
`TTS_PREMIUM_URL`	`http://chatterbox-tts:8881/v1`	Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true)

Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)

Variable	Default	Description
`TTS_FALLBACK_ENABLED`	`false`	Enable fallback TTS engine
`TTS_FALLBACK_URL`	`http://openedai-speech:8000/v1`	OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true)

Service Limits

Variable	Default	Description
`SPEECH_MAX_UPLOAD_SIZE`	`25000000`	Maximum upload file size in bytes (25 MB)
`SPEECH_MAX_DURATION_SECONDS`	`600`	Maximum audio duration in seconds (10 minutes)
`SPEECH_MAX_TEXT_LENGTH`	`4096`	Maximum text length for TTS in characters

Conditional Validation

When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:

STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
Either set these variables or disable by setting STT_ENABLED=false.

Boolean parsing: value === "true" or value === "1". Unset or empty values default to false.

Provider Configuration

Kokoro (Default Tier)

Engine: Kokoro-FastAPI License: Apache 2.0 Requirements: CPU only Docker Image: ghcr.io/remsky/kokoro-fastapi:latest-cpu

Capabilities:

54 built-in voices across 8 languages
Speed control: 0.25x to 4.0x
Output formats: mp3, wav, opus, flac
Voice metadata derived from ID prefix (language, gender, accent)

Voice ID Format: {lang}{gender}_{name}

First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
Second character: gender (f=Female, m=Male)

Example voices:

Voice ID	Name	Language	Gender
`af_heart`	Heart	en-US	Female
`am_adam`	Adam	en-US	Male
`bf_alice`	Alice	en-GB	Female
`bm_daniel`	Daniel	en-GB	Male
`ef_dora`	Dora	es	Female
`ff_camille`	Camille	fr	Female
`jf_alpha`	Alpha	ja	Female
`zf_xiaobei`	Xiaobei	zh	Female

Chatterbox (Premium Tier)

Engine: Chatterbox TTS Server License: Proprietary Requirements: NVIDIA GPU with CUDA Docker Image: devnen/chatterbox-tts-server:latest

Capabilities:

Voice cloning via reference audio sample
Emotion exaggeration control (0.0 - 1.0)
Cross-language voice transfer (23 languages)
Higher quality synthesis than default tier

Supported Languages: en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro

Extended Options (Chatterbox-specific):

Option	Type	Description
`referenceAudio`	Buffer	Audio sample for voice cloning (5-30 seconds recommended)
`emotionExaggeration`	number	Emotion intensity 0.0-1.0 (clamped)

These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.

Piper (Fallback Tier)

Engine: Piper via OpenedAI Speech License: GPL (OpenedAI Speech) Requirements: CPU only (runs on Raspberry Pi) Docker Image: Use OpenedAI Speech image

Capabilities:

100+ voices across 40+ languages
6 standard OpenAI voice names (mapped to Piper voices)
Output formats: mp3, wav, opus, flac
Ultra-lightweight, designed for low-resource environments

Standard Voice Mapping:

OpenAI Voice	Piper Voice	Gender	Description
`alloy`	en_US-amy-medium	Female	Warm, balanced
`echo`	en_US-ryan-medium	Male	Clear, articulate
`fable`	en_GB-alan-medium	Male	British narrator
`onyx`	en_US-danny-low	Male	Deep, resonant
`nova`	en_US-lessac-medium	Female	Expressive, versatile
`shimmer`	en_US-kristin-medium	Female	Bright, energetic

Speaches (STT)

Engine: Speaches (faster-whisper backend) License: MIT Requirements: CPU (GPU optional for faster inference) Docker Image: ghcr.io/speaches-ai/speaches:latest

Capabilities:

OpenAI-compatible /v1/audio/transcriptions endpoint
Whisper models via faster-whisper
Verbose JSON response with segments and timestamps
Language detection

Default model: Systran/faster-whisper-large-v3-turbo

Voice Cloning Setup (Chatterbox)

Voice cloning is available through the Chatterbox premium TTS provider.

Prerequisites

NVIDIA GPU with CUDA support
nvidia-container-toolkit installed on the Docker host
Docker runtime configured for GPU access
TTS premium tier enabled (TTS_PREMIUM_ENABLED=true)

Basic Voice Cloning

Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:

import { SpeechService } from "./speech.service";
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";

const options: ChatterboxSynthesizeOptions = {
  tier: "premium",
  referenceAudio: myAudioBuffer, // 5-30 second audio sample
  emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
};

const result = await speechService.synthesize("Hello, this is my cloned voice!", options);

Voice Cloning Tips

Audio quality: Use clean recordings without background noise
Duration: 5-30 seconds works best; shorter clips may produce lower quality
Format: WAV provides the best quality; MP3 is also accepted
Emotion: Start with 0.5 (moderate) and adjust from there
Cross-language: You can clone a voice in one language and synthesize in another

Docker Compose Setup

Development (Local)

Speech services are defined in a separate overlay file docker-compose.speech.yml. This keeps them optional and separate from core services.

Start basic speech services (STT + default TTS):

# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d

# Using Makefile
make speech-up

Start with premium TTS (requires NVIDIA GPU):

docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d

Stop speech services:

# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans

# Using Makefile
make speech-down

View logs:

make speech-logs

Development Services

Service	Container	Port	Image
Speaches (STT)	mosaic-speaches	8090 (host) -> 8000 (container)	`ghcr.io/speaches-ai/speaches:latest`
Kokoro TTS	mosaic-kokoro-tts	8880 (host) -> 8880 (container)	`ghcr.io/remsky/kokoro-fastapi:latest-cpu`
Chatterbox TTS	mosaic-chatterbox-tts	8881 (host) -> 8000 (container)	`devnen/chatterbox-tts-server:latest`

Production (Docker Swarm)

For production deployments, use docker/docker-compose.sample.speech.yml. This file is designed for Docker Swarm with Traefik integration.

Required environment variables:

STT_DOMAIN=stt.example.com
TTS_DOMAIN=tts.example.com

Optional environment variables:

WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
TRAEFIK_ENTRYPOINT=websecure
TRAEFIK_CERTRESOLVER=letsencrypt
TRAEFIK_DOCKER_NETWORK=traefik-public
TRAEFIK_TLS_ENABLED=true

Deploy:

docker stack deploy -c docker/docker-compose.sample.speech.yml speech

Connecting to Mosaic Stack: Set the speech URLs in your Mosaic Stack .env:

# Same Docker network
STT_BASE_URL=http://speaches:8000/v1
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1

# External / different network
STT_BASE_URL=https://stt.example.com/v1
TTS_DEFAULT_URL=https://tts.example.com/v1

Health Checks

All speech containers include health checks:

Service	Endpoint	Interval	Start Period
Speaches	`http://localhost:8000/health`	30s	120s
Kokoro TTS	`http://localhost:8880/health`	30s	120s
Chatterbox TTS	`http://localhost:8000/health`	30s	180s

Chatterbox has a longer start period (180s) because GPU model loading takes additional time.

GPU VRAM Budget

Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.

Chatterbox VRAM Requirements

Component	Approximate VRAM
Chatterbox TTS model	~2-4 GB
Voice cloning inference	~1-2 GB additional
Total recommended	4-6 GB

Shared GPU Considerations

If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):

Service	VRAM Usage	Notes
Ollama (7B model)	~4-6 GB	Depends on model size
Ollama (13B model)	~8-10 GB	Larger models need more
Chatterbox TTS	~4-6 GB	Voice cloning is memory-intensive
Combined minimum	8-12 GB	For 7B LLM + Chatterbox

Recommendations:

8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
24 GB VRAM: Supports larger LLMs + Chatterbox with headroom

If VRAM is limited, consider:

Disabling Chatterbox (TTS_PREMIUM_ENABLED=false) and using Kokoro (CPU) as default
Using the fallback chain so Kokoro handles requests when Chatterbox is busy
Running Chatterbox on a separate GPU host

Docker Swarm GPU Scheduling

For Docker Swarm deployments with GPU, configure generic resources on the node:

// /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime"
    }
  },
  "node-generic-resources": ["NVIDIA-GPU=0"]
}

See the Docker GPU Swarm documentation for details.

Frontend Integration

Speech services are consumed from the frontend through the REST API and WebSocket gateway.

REST API Usage

Transcribe audio:

async function transcribeAudio(file: File, token: string, workspaceId: string) {
  const formData = new FormData();
  formData.append("file", file);
  formData.append("language", "en");

  const response = await fetch("/api/speech/transcribe", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
    body: formData,
  });

  const { data } = await response.json();
  return data.text;
}

Synthesize speech:

async function synthesizeSpeech(
  text: string,
  token: string,
  workspaceId: string,
  voice = "af_heart"
) {
  const response = await fetch("/api/speech/synthesize", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ text, voice, format: "mp3" }),
  });

  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}

List voices:

async function listVoices(token: string, workspaceId: string, tier?: string) {
  const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return data; // VoiceInfo[]
}

WebSocket Streaming Usage

For real-time transcription using the browser's MediaRecorder API:

import { io } from "socket.io-client";

function createSpeechSocket(token: string) {
  const socket = io("/speech", {
    auth: { token },
  });

  let mediaRecorder: MediaRecorder | null = null;

  async function startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    mediaRecorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    socket.emit("start-transcription", { language: "en" });

    mediaRecorder.ondataavailable = (event) => {
      if (event.data.size > 0) {
        event.data.arrayBuffer().then((buffer) => {
          socket.emit("audio-chunk", new Uint8Array(buffer));
        });
      }
    };

    mediaRecorder.start(250); // Send chunks every 250ms
  }

  async function stopRecording(): Promise<string> {
    return new Promise((resolve, reject) => {
      socket.once("transcription-final", (result) => {
        resolve(result.text);
      });

      socket.once("transcription-error", ({ message }) => {
        reject(new Error(message));
      });

      if (mediaRecorder) {
        mediaRecorder.stop();
        mediaRecorder.stream.getTracks().forEach((track) => track.stop());
        mediaRecorder = null;
      }

      socket.emit("stop-transcription");
    });
  }

  return { socket, startRecording, stopRecording };
}

Check Speech Availability

Before showing speech UI elements, check provider availability:

async function checkSpeechHealth(token: string, workspaceId: string) {
  const response = await fetch("/api/speech/health", {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return {
    canTranscribe: data.stt.available,
    canSynthesize: data.tts.available,
  };
}

31 KiB Raw Blame History

Speech Services

Table of Contents

Architecture Overview

Key Design Decisions

Provider Abstraction

STT Provider Interface

TTS Provider Interface

Provider Registration

TTS Tier System and Fallback Chain

API Endpoint Reference

POST /api/speech/transcribe

POST /api/speech/synthesize

GET /api/speech/voices

GET /api/speech/health

WebSocket Streaming Protocol

Connection

Protocol Flow

Client Events (emit)

Server Events (listen)

Session Management

Example Client Usage

Environment Variable Reference

Speech-to-Text (STT)

Text-to-Speech (TTS) - Default Engine (Kokoro)

Text-to-Speech (TTS) - Premium Engine (Chatterbox)

Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)

Service Limits

Conditional Validation

Provider Configuration

Kokoro (Default Tier)

Chatterbox (Premium Tier)

Piper (Fallback Tier)

Speaches (STT)

Voice Cloning Setup (Chatterbox)

Prerequisites

Basic Voice Cloning

Voice Cloning Tips

Docker Compose Setup

Development (Local)

Development Services

Production (Docker Swarm)

Health Checks

GPU VRAM Budget

Chatterbox VRAM Requirements

Shared GPU Considerations

Docker Swarm GPU Scheduling

Frontend Integration

REST API Usage

WebSocket Streaming Usage

Check Speech Availability

31 KiB

Raw Blame History