EPIC: M13-SpeechServices — TTS & STT Integration #388

Closed
opened 2026-02-15 07:33:00 +00:00 by jason.woltje · 1 comment
Owner

Overview

Integrate Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities into Mosaic Stack using a tiered, OpenAI-compatible API architecture.

Architecture

All speech providers expose OpenAI-compatible endpoints ( for TTS, for STT), enabling a single NestJS integration pattern via the npm package with configurable base URLs.

TTS Tiers

Tier Provider License Runs On Use Case
Default Kokoro-82M via Kokoro-FastAPI Apache 2.0 CPU Fast, good quality, always available
Premium Chatterbox via Chatterbox-TTS-Server MIT GPU Voice cloning, best quality, emotion control
Fallback Piper via OpenedAI Speech GPL CPU Ultra-lightweight, Home Assistant compatible

STT

Provider License Runs On Use Case
Speaches + faster-whisper MIT GPU (CPU fallback) Primary STT, OpenAI-compatible, 7-8% WER

Key Insight

Speaches can serve both STT (faster-whisper) and TTS (Kokoro/Piper) in a single container for simplified deployment.

Research

Full research: jarvis-brain

Scope

  • NestJS SpeechModule with provider abstraction
  • STT transcription (REST + WebSocket streaming)
  • TTS synthesis (REST + streaming)
  • Three TTS providers (Kokoro, Chatterbox, Piper)
  • Docker Compose dev + swarm deployment
  • Frontend voice I/O components
  • E2E integration tests

Issues

Track sub-issues in this milestone.

## Overview Integrate Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities into Mosaic Stack using a tiered, OpenAI-compatible API architecture. ## Architecture All speech providers expose OpenAI-compatible endpoints ( for TTS, for STT), enabling a single NestJS integration pattern via the npm package with configurable base URLs. ### TTS Tiers | Tier | Provider | License | Runs On | Use Case | |------|----------|---------|---------|----------| | Default | Kokoro-82M via Kokoro-FastAPI | Apache 2.0 | CPU | Fast, good quality, always available | | Premium | Chatterbox via Chatterbox-TTS-Server | MIT | GPU | Voice cloning, best quality, emotion control | | Fallback | Piper via OpenedAI Speech | GPL | CPU | Ultra-lightweight, Home Assistant compatible | ### STT | Provider | License | Runs On | Use Case | |----------|---------|---------|----------| | Speaches + faster-whisper | MIT | GPU (CPU fallback) | Primary STT, OpenAI-compatible, 7-8% WER | ### Key Insight Speaches can serve both STT (faster-whisper) and TTS (Kokoro/Piper) in a single container for simplified deployment. ## Research Full research: jarvis-brain ## Scope - NestJS SpeechModule with provider abstraction - STT transcription (REST + WebSocket streaming) - TTS synthesis (REST + streaming) - Three TTS providers (Kokoro, Chatterbox, Piper) - Docker Compose dev + swarm deployment - Frontend voice I/O components - E2E integration tests ## Issues Track sub-issues in this milestone.
jason.woltje added this to the M13-SpeechServices (0.0.13) milestone 2026-02-15 07:33:00 +00:00
Author
Owner

M13-SpeechServices milestone complete. All 18 sub-issues (#389-#406) implemented and closed. 62 files changed, 13,613 lines added. 500+ tests across API and web packages. Branch: feature/m13-speech-services. PR to develop pending.

M13-SpeechServices milestone complete. All 18 sub-issues (#389-#406) implemented and closed. 62 files changed, 13,613 lines added. 500+ tests across API and web packages. Branch: feature/m13-speech-services. PR to develop pending.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaic/stack#388