Files
stack/docs/SPEECH.md
Jason Woltje 24065aa199
All checks were successful
ci/woodpecker/push/api Pipeline was successful
docs(#406): add speech services documentation
Comprehensive documentation for the speech services module:
- docs/SPEECH.md: Architecture, API reference, WebSocket protocol,
  environment variables, provider configuration, Docker setup,
  GPU VRAM budget, and frontend integration examples
- apps/api/src/speech/AGENTS.md: Module structure, provider pattern,
  how to add new providers, gotchas, and test patterns
- README.md: Speech capabilities section with quick start

Fixes #406

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 03:23:22 -06:00

31 KiB

Speech Services

Mosaic Stack provides integrated speech-to-text (STT) and text-to-speech (TTS) services through a provider abstraction layer. Speech services are optional and modular -- each component can be independently enabled, disabled, or pointed at external infrastructure.

Table of Contents


Architecture Overview

                          +-------------------+
                          |  SpeechController |
                          |  (REST endpoints) |
                          +--------+----------+
                                   |
                    +--------------+--------------+
                    |         SpeechService        |
                    |  (provider selection,         |
                    |   fallback orchestration)     |
                    +---------+----------+---------+
                              |          |
                 +------------+    +-----+-------+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | STT Provider|   |TTS Provider| |TTS Provider|
          | (Speaches)  |   |Map<Tier,P> | |Map<Tier,P> |
          +------+------+   +-----+-----+ +-----+-----+
                 |                 |             |
          +------+------+   +-----+-----+ +-----+-----+
          | Speaches    |   | Kokoro    | | Chatterbox |
          | (Whisper)   |   | (default) | | (premium)  |
          +-------------+   +-----------+ +-----+------+
                                                |
                                          +-----+-----+
                                          |   Piper   |
                                          | (fallback)|
                                          +-----------+

          +-------------------+
          |  SpeechGateway    |
          |  (WebSocket /speech)
          +--------+----------+
                   |
          Uses SpeechService.transcribe()

The speech module (apps/api/src/speech/) is a self-contained NestJS module consisting of:

Component File Purpose
Module speech.module.ts Registers providers, controllers, gateway
Config speech.config.ts Environment validation and typed config
Service speech.service.ts High-level speech operations with fallback
Controller speech.controller.ts REST API endpoints
Gateway speech.gateway.ts WebSocket streaming transcription
Constants speech.constants.ts NestJS injection tokens

Key Design Decisions

  1. OpenAI-compatible APIs: All providers (Speaches, Kokoro, Chatterbox, Piper/OpenedAI) expose OpenAI-compatible endpoints. The official OpenAI SDK is used as the HTTP client with a custom baseURL.

  2. Provider abstraction: STT and TTS providers implement well-defined interfaces (ISTTProvider, ITTSProvider). New providers can be added without modifying the service layer.

  3. Conditional registration: Providers are only instantiated when their corresponding *_ENABLED flag is true. The STT provider uses NestJS @Optional() injection.

  4. Fail-fast validation: Configuration is validated at module initialization. If a service is enabled but its URL is missing, the application fails on startup with a descriptive error.


Provider Abstraction

STT Provider Interface

interface ISTTProvider {
  readonly name: string;
  transcribe(audio: Buffer, options?: TranscribeOptions): Promise<TranscriptionResult>;
  isHealthy(): Promise<boolean>;
}

Currently implemented by SpeachesSttProvider which connects to a Speaches (faster-whisper) server.

TTS Provider Interface

interface ITTSProvider {
  readonly name: string;
  readonly tier: SpeechTier;
  synthesize(text: string, options?: SynthesizeOptions): Promise<SynthesisResult>;
  listVoices(): Promise<VoiceInfo[]>;
  isHealthy(): Promise<boolean>;
}

All TTS providers extend BaseTTSProvider, an abstract class that implements common OpenAI-compatible synthesis logic. Concrete providers only need to set name and tier and optionally override listVoices() or synthesize().

Provider Registration

Providers are created by the TTS Provider Factory (providers/tts-provider.factory.ts) based on configuration:

Tier Provider Class Engine Requirements
default KokoroTtsProvider Kokoro-FastAPI CPU only
premium ChatterboxTTSProvider Chatterbox TTS Server NVIDIA GPU
fallback PiperTtsProvider Piper via OpenedAI Speech CPU only

TTS Tier System and Fallback Chain

TTS uses a tiered architecture with automatic fallback:

Request with tier="premium"
    |
    v
[premium] Chatterbox available? --yes--> Use Chatterbox
    |                                         |
    no                                   (success/fail)
    |
    v
[default] Kokoro available? ------yes--> Use Kokoro
    |                                         |
    no                                   (success/fail)
    |
    v
[fallback] Piper available? -----yes--> Use Piper
    |                                         |
    no                                   (success/fail)
    |
    v
ServiceUnavailableException

Fallback order: premium -> default -> fallback

The fallback chain starts from the requested tier and proceeds downward. A tier is only attempted if:

  1. It is enabled in configuration (TTS_ENABLED, TTS_PREMIUM_ENABLED, TTS_FALLBACK_ENABLED)
  2. A provider is registered for that tier

If no tier is specified in the request, default is used as the starting point.


API Endpoint Reference

All speech endpoints are under /api/speech/ and require authentication (Bearer token) plus workspace context (x-workspace-id header).

POST /api/speech/transcribe

Transcribe an uploaded audio file to text.

Authentication: Bearer token + workspace membership Content-Type: multipart/form-data

Form Fields:

Field Type Required Description
file File Yes Audio file (max 25 MB)
language string No Language code (e.g., "en", "fr"). Default: from config
model string No Whisper model override. Default: from config
prompt string No Prompt to guide transcription (max 1000 chars)
temperature number No Temperature 0.0-1.0. Lower = more deterministic

Accepted Audio Formats: audio/wav, audio/mp3, audio/mpeg, audio/webm, audio/ogg, audio/flac, audio/x-m4a

Response:

{
  "data": {
    "text": "Hello, this is a transcription test.",
    "language": "en",
    "durationSeconds": 3.5,
    "confidence": 0.95,
    "segments": [
      {
        "text": "Hello, this is a transcription test.",
        "start": 0.0,
        "end": 3.5,
        "confidence": 0.95
      }
    ]
  }
}

Example:

curl -X POST http://localhost:3001/api/speech/transcribe \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -F "file=@recording.wav" \
  -F "language=en"

POST /api/speech/synthesize

Synthesize text to audio using TTS providers.

Authentication: Bearer token + workspace membership Content-Type: application/json

Request Body:

Field Type Required Description
text string Yes Text to synthesize (max 4096 chars)
voice string No Voice ID. Default: from config (e.g., "af_heart")
speed number No Speed multiplier 0.5-2.0. Default: 1.0
format string No Output format: mp3, wav, opus, flac, aac, pcm. Default: mp3
tier string No Provider tier: default, premium, fallback. Default: default

Response: Binary audio data with appropriate Content-Type header.

Format Content-Type
mp3 audio/mpeg
wav audio/wav
opus audio/opus
flac audio/flac
aac audio/aac
pcm audio/pcm

Example:

curl -X POST http://localhost:3001/api/speech/synthesize \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "af_heart", "format": "mp3"}' \
  --output speech.mp3

GET /api/speech/voices

List available TTS voices across all tiers.

Authentication: Bearer token + workspace access Query Parameters:

Parameter Type Required Description
tier string No Filter by tier: default, premium, fallback

Response:

{
  "data": [
    {
      "id": "af_heart",
      "name": "Heart (American Female)",
      "language": "en-US",
      "tier": "default",
      "isDefault": true
    },
    {
      "id": "am_adam",
      "name": "Adam (American Male)",
      "language": "en-US",
      "tier": "default",
      "isDefault": false
    }
  ]
}

Example:

curl -X GET 'http://localhost:3001/api/speech/voices?tier=default' \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "x-workspace-id: WORKSPACE_ID"

GET /api/speech/health

Check availability of STT and TTS providers.

Authentication: Bearer token + workspace access

Response:

{
  "data": {
    "stt": { "available": true },
    "tts": { "available": true }
  }
}

WebSocket Streaming Protocol

The speech module provides a WebSocket gateway at namespace /speech for real-time streaming transcription. Audio chunks are accumulated on the server and transcribed when the session is stopped.

Connection

Connect to the /speech namespace with authentication:

import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: "YOUR_SESSION_TOKEN" },
});

Authentication methods (checked in order):

  1. auth.token in handshake
  2. query.token in handshake URL
  3. Authorization: Bearer <token> header

Connection is rejected if:

  • No valid token is provided
  • Session verification fails
  • User has no workspace membership

Connection timeout: 5 seconds for authentication.

Protocol Flow

Client                          Server
  |                               |
  |--- connect (with token) ----->|
  |                               |  (authenticate, check workspace)
  |<--- connected ----------------|
  |                               |
  |--- start-transcription ------>|  { language?: "en" }
  |<--- transcription-started ----|  { sessionId, language }
  |                               |
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |--- audio-chunk -------------->|  (Buffer/Uint8Array)
  |                               |
  |--- stop-transcription ------->|
  |                               |  (concatenate chunks, transcribe)
  |<--- transcription-final ------|  { text, language, durationSeconds, ... }
  |                               |

Client Events (emit)

Event Payload Description
start-transcription { language?: string } Begin a new transcription session
audio-chunk Buffer or Uint8Array Send audio data chunk
stop-transcription (none) Stop recording and trigger transcription

Server Events (listen)

Event Payload Description
transcription-started { sessionId, language } Session created
transcription-final { text, language, durationSeconds, confidence, segments } Transcription result
transcription-error { message } Error during transcription

Session Management

  • One active transcription session per client connection
  • Starting a new session replaces any existing session
  • Sessions are cleaned up on client disconnect
  • Audio chunks are accumulated in memory
  • Total accumulated size is capped by SPEECH_MAX_UPLOAD_SIZE (default: 25 MB)

Example Client Usage

import { io } from "socket.io-client";

const socket = io("http://localhost:3001/speech", {
  auth: { token: sessionToken },
});

// Start recording
socket.emit("start-transcription", { language: "en" });

socket.on("transcription-started", ({ sessionId }) => {
  console.log("Session started:", sessionId);
});

// Stream audio chunks from MediaRecorder
mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    event.data.arrayBuffer().then((buffer) => {
      socket.emit("audio-chunk", new Uint8Array(buffer));
    });
  }
};

// Stop and get result
socket.emit("stop-transcription");

socket.on("transcription-final", (result) => {
  console.log("Transcription:", result.text);
  console.log("Duration:", result.durationSeconds, "seconds");
});

socket.on("transcription-error", ({ message }) => {
  console.error("Transcription error:", message);
});

Environment Variable Reference

Speech-to-Text (STT)

Variable Default Description
STT_ENABLED false Enable speech-to-text transcription
STT_BASE_URL http://speaches:8000/v1 Speaches server URL (required when STT_ENABLED=true)
STT_MODEL Systran/faster-whisper-large-v3-turbo Whisper model for transcription
STT_LANGUAGE en Default language code

Text-to-Speech (TTS) - Default Engine (Kokoro)

Variable Default Description
TTS_ENABLED false Enable default TTS engine
TTS_DEFAULT_URL http://kokoro-tts:8880/v1 Kokoro-FastAPI URL (required when TTS_ENABLED=true)
TTS_DEFAULT_VOICE af_heart Default Kokoro voice ID
TTS_DEFAULT_FORMAT mp3 Default audio output format

Text-to-Speech (TTS) - Premium Engine (Chatterbox)

Variable Default Description
TTS_PREMIUM_ENABLED false Enable premium TTS engine
TTS_PREMIUM_URL http://chatterbox-tts:8881/v1 Chatterbox TTS URL (required when TTS_PREMIUM_ENABLED=true)

Text-to-Speech (TTS) - Fallback Engine (Piper/OpenedAI)

Variable Default Description
TTS_FALLBACK_ENABLED false Enable fallback TTS engine
TTS_FALLBACK_URL http://openedai-speech:8000/v1 OpenedAI Speech URL (required when TTS_FALLBACK_ENABLED=true)

Service Limits

Variable Default Description
SPEECH_MAX_UPLOAD_SIZE 25000000 Maximum upload file size in bytes (25 MB)
SPEECH_MAX_DURATION_SECONDS 600 Maximum audio duration in seconds (10 minutes)
SPEECH_MAX_TEXT_LENGTH 4096 Maximum text length for TTS in characters

Conditional Validation

When a service is enabled, its URL variable is required. If missing, the application fails at startup with a message like:

STT is enabled (STT_ENABLED=true) but required environment variables are missing or empty: STT_BASE_URL.
Either set these variables or disable by setting STT_ENABLED=false.

Boolean parsing: value === "true" or value === "1". Unset or empty values default to false.


Provider Configuration

Kokoro (Default Tier)

Engine: Kokoro-FastAPI License: Apache 2.0 Requirements: CPU only Docker Image: ghcr.io/remsky/kokoro-fastapi:latest-cpu

Capabilities:

  • 54 built-in voices across 8 languages
  • Speed control: 0.25x to 4.0x
  • Output formats: mp3, wav, opus, flac
  • Voice metadata derived from ID prefix (language, gender, accent)

Voice ID Format: {lang}{gender}_{name}

  • First character: language/accent (a=American, b=British, e=Spanish, f=French, h=Hindi, j=Japanese, p=Portuguese, z=Chinese)
  • Second character: gender (f=Female, m=Male)

Example voices:

Voice ID Name Language Gender
af_heart Heart en-US Female
am_adam Adam en-US Male
bf_alice Alice en-GB Female
bm_daniel Daniel en-GB Male
ef_dora Dora es Female
ff_camille Camille fr Female
jf_alpha Alpha ja Female
zf_xiaobei Xiaobei zh Female

Chatterbox (Premium Tier)

Engine: Chatterbox TTS Server License: Proprietary Requirements: NVIDIA GPU with CUDA Docker Image: devnen/chatterbox-tts-server:latest

Capabilities:

  • Voice cloning via reference audio sample
  • Emotion exaggeration control (0.0 - 1.0)
  • Cross-language voice transfer (23 languages)
  • Higher quality synthesis than default tier

Supported Languages: en, fr, de, es, it, pt, nl, pl, ru, uk, ja, zh, ko, ar, hi, tr, sv, da, fi, no, cs, el, ro

Extended Options (Chatterbox-specific):

Option Type Description
referenceAudio Buffer Audio sample for voice cloning (5-30 seconds recommended)
emotionExaggeration number Emotion intensity 0.0-1.0 (clamped)

These are passed as extra body parameters to the OpenAI-compatible endpoint. Reference audio is base64-encoded before sending.

Piper (Fallback Tier)

Engine: Piper via OpenedAI Speech License: GPL (OpenedAI Speech) Requirements: CPU only (runs on Raspberry Pi) Docker Image: Use OpenedAI Speech image

Capabilities:

  • 100+ voices across 40+ languages
  • 6 standard OpenAI voice names (mapped to Piper voices)
  • Output formats: mp3, wav, opus, flac
  • Ultra-lightweight, designed for low-resource environments

Standard Voice Mapping:

OpenAI Voice Piper Voice Gender Description
alloy en_US-amy-medium Female Warm, balanced
echo en_US-ryan-medium Male Clear, articulate
fable en_GB-alan-medium Male British narrator
onyx en_US-danny-low Male Deep, resonant
nova en_US-lessac-medium Female Expressive, versatile
shimmer en_US-kristin-medium Female Bright, energetic

Speaches (STT)

Engine: Speaches (faster-whisper backend) License: MIT Requirements: CPU (GPU optional for faster inference) Docker Image: ghcr.io/speaches-ai/speaches:latest

Capabilities:

  • OpenAI-compatible /v1/audio/transcriptions endpoint
  • Whisper models via faster-whisper
  • Verbose JSON response with segments and timestamps
  • Language detection

Default model: Systran/faster-whisper-large-v3-turbo


Voice Cloning Setup (Chatterbox)

Voice cloning is available through the Chatterbox premium TTS provider.

Prerequisites

  1. NVIDIA GPU with CUDA support
  2. nvidia-container-toolkit installed on the Docker host
  3. Docker runtime configured for GPU access
  4. TTS premium tier enabled (TTS_PREMIUM_ENABLED=true)

Basic Voice Cloning

Provide a reference audio sample (WAV or MP3, 5-30 seconds) when calling synthesize:

import { SpeechService } from "./speech.service";
import type { ChatterboxSynthesizeOptions } from "./interfaces/speech-types";

const options: ChatterboxSynthesizeOptions = {
  tier: "premium",
  referenceAudio: myAudioBuffer, // 5-30 second audio sample
  emotionExaggeration: 0.5, // 0.0 = neutral, 1.0 = maximum emotion
};

const result = await speechService.synthesize("Hello, this is my cloned voice!", options);

Voice Cloning Tips

  • Audio quality: Use clean recordings without background noise
  • Duration: 5-30 seconds works best; shorter clips may produce lower quality
  • Format: WAV provides the best quality; MP3 is also accepted
  • Emotion: Start with 0.5 (moderate) and adjust from there
  • Cross-language: You can clone a voice in one language and synthesize in another

Docker Compose Setup

Development (Local)

Speech services are defined in a separate overlay file docker-compose.speech.yml. This keeps them optional and separate from core services.

Start basic speech services (STT + default TTS):

# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml up -d

# Using Makefile
make speech-up

Start with premium TTS (requires NVIDIA GPU):

docker compose -f docker-compose.yml -f docker-compose.speech.yml --profile premium-tts up -d

Stop speech services:

# Using docker compose directly
docker compose -f docker-compose.yml -f docker-compose.speech.yml down --remove-orphans

# Using Makefile
make speech-down

View logs:

make speech-logs

Development Services

Service Container Port Image
Speaches (STT) mosaic-speaches 8090 (host) -> 8000 (container) ghcr.io/speaches-ai/speaches:latest
Kokoro TTS mosaic-kokoro-tts 8880 (host) -> 8880 (container) ghcr.io/remsky/kokoro-fastapi:latest-cpu
Chatterbox TTS mosaic-chatterbox-tts 8881 (host) -> 8000 (container) devnen/chatterbox-tts-server:latest

Production (Docker Swarm)

For production deployments, use docker/docker-compose.sample.speech.yml. This file is designed for Docker Swarm with Traefik integration.

Required environment variables:

STT_DOMAIN=stt.example.com
TTS_DOMAIN=tts.example.com

Optional environment variables:

WHISPER_MODEL=Systran/faster-whisper-large-v3-turbo
CHATTERBOX_TTS_DOMAIN=tts-premium.example.com
TRAEFIK_ENTRYPOINT=websecure
TRAEFIK_CERTRESOLVER=letsencrypt
TRAEFIK_DOCKER_NETWORK=traefik-public
TRAEFIK_TLS_ENABLED=true

Deploy:

docker stack deploy -c docker/docker-compose.sample.speech.yml speech

Connecting to Mosaic Stack: Set the speech URLs in your Mosaic Stack .env:

# Same Docker network
STT_BASE_URL=http://speaches:8000/v1
TTS_DEFAULT_URL=http://kokoro-tts:8880/v1

# External / different network
STT_BASE_URL=https://stt.example.com/v1
TTS_DEFAULT_URL=https://tts.example.com/v1

Health Checks

All speech containers include health checks:

Service Endpoint Interval Start Period
Speaches http://localhost:8000/health 30s 120s
Kokoro TTS http://localhost:8880/health 30s 120s
Chatterbox TTS http://localhost:8000/health 30s 180s

Chatterbox has a longer start period (180s) because GPU model loading takes additional time.


GPU VRAM Budget

Only Chatterbox requires GPU resources. The other providers (Speaches, Kokoro, Piper) are CPU-only.

Chatterbox VRAM Requirements

Component Approximate VRAM
Chatterbox TTS model ~2-4 GB
Voice cloning inference ~1-2 GB additional
Total recommended 4-6 GB

Shared GPU Considerations

If running multiple GPU services (e.g., Ollama for LLM + Chatterbox for TTS):

Service VRAM Usage Notes
Ollama (7B model) ~4-6 GB Depends on model size
Ollama (13B model) ~8-10 GB Larger models need more
Chatterbox TTS ~4-6 GB Voice cloning is memory-intensive
Combined minimum 8-12 GB For 7B LLM + Chatterbox

Recommendations:

  • 8 GB VRAM: Adequate for small LLM + Chatterbox (may need to alternate)
  • 12 GB VRAM: Comfortable for 7B LLM + Chatterbox simultaneously
  • 24 GB VRAM: Supports larger LLMs + Chatterbox with headroom

If VRAM is limited, consider:

  1. Disabling Chatterbox (TTS_PREMIUM_ENABLED=false) and using Kokoro (CPU) as default
  2. Using the fallback chain so Kokoro handles requests when Chatterbox is busy
  3. Running Chatterbox on a separate GPU host

Docker Swarm GPU Scheduling

For Docker Swarm deployments with GPU, configure generic resources on the node:

// /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime"
    }
  },
  "node-generic-resources": ["NVIDIA-GPU=0"]
}

See the Docker GPU Swarm documentation for details.


Frontend Integration

Speech services are consumed from the frontend through the REST API and WebSocket gateway.

REST API Usage

Transcribe audio:

async function transcribeAudio(file: File, token: string, workspaceId: string) {
  const formData = new FormData();
  formData.append("file", file);
  formData.append("language", "en");

  const response = await fetch("/api/speech/transcribe", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
    body: formData,
  });

  const { data } = await response.json();
  return data.text;
}

Synthesize speech:

async function synthesizeSpeech(
  text: string,
  token: string,
  workspaceId: string,
  voice = "af_heart"
) {
  const response = await fetch("/api/speech/synthesize", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ text, voice, format: "mp3" }),
  });

  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}

List voices:

async function listVoices(token: string, workspaceId: string, tier?: string) {
  const url = tier ? `/api/speech/voices?tier=${tier}` : "/api/speech/voices";

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return data; // VoiceInfo[]
}

WebSocket Streaming Usage

For real-time transcription using the browser's MediaRecorder API:

import { io } from "socket.io-client";

function createSpeechSocket(token: string) {
  const socket = io("/speech", {
    auth: { token },
  });

  let mediaRecorder: MediaRecorder | null = null;

  async function startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    mediaRecorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    socket.emit("start-transcription", { language: "en" });

    mediaRecorder.ondataavailable = (event) => {
      if (event.data.size > 0) {
        event.data.arrayBuffer().then((buffer) => {
          socket.emit("audio-chunk", new Uint8Array(buffer));
        });
      }
    };

    mediaRecorder.start(250); // Send chunks every 250ms
  }

  async function stopRecording(): Promise<string> {
    return new Promise((resolve, reject) => {
      socket.once("transcription-final", (result) => {
        resolve(result.text);
      });

      socket.once("transcription-error", ({ message }) => {
        reject(new Error(message));
      });

      if (mediaRecorder) {
        mediaRecorder.stop();
        mediaRecorder.stream.getTracks().forEach((track) => track.stop());
        mediaRecorder = null;
      }

      socket.emit("stop-transcription");
    });
  }

  return { socket, startRecording, stopRecording };
}

Check Speech Availability

Before showing speech UI elements, check provider availability:

async function checkSpeechHealth(token: string, workspaceId: string) {
  const response = await fetch("/api/speech/health", {
    headers: {
      Authorization: `Bearer ${token}`,
      "x-workspace-id": workspaceId,
    },
  });

  const { data } = await response.json();
  return {
    canTranscribe: data.stt.available,
    canSynthesize: data.tts.available,
  };
}