feat: M13-SpeechServices — TTS & STT integration #409

jason.woltje · 2026-02-15T09:30:58Z

jason.woltje commented

2026-02-15 09:30:58 +00:00

Summary

Implements the complete M13-SpeechServices milestone (0.0.13) — speech-to-text and text-to-speech integration for Mosaic Stack.

Epic: #388 | Issues: #389-#406 (18 issues, all closed)
Stats: 62 files changed, 13,613 insertions, 500+ tests

What's included

API Backend (NestJS):

SpeechModule with provider abstraction layer (ISTTProvider, ITTSProvider interfaces)
STT provider using Speaches/faster-whisper via OpenAI-compatible API
Tiered TTS architecture: Default (Kokoro-FastAPI), Premium (Chatterbox with voice cloning), Fallback (Piper via OpenedAI Speech)
REST endpoints: POST /speech/transcribe, POST /speech/synthesize, GET /speech/voices, GET /speech/health
WebSocket gateway at /speech namespace for streaming transcription
Audio/text validation pipes, DTOs, ConfigModule integration with 14 env vars

Frontend (Next.js):

VoiceInput component with microphone capture, audio visualization, WebSocket streaming
AudioPlayer component with progress bar, speed control, download
TextToSpeechButton for inline TTS playback
SpeechSettings page with provider selection, voice config, health status

DevOps:

Docker Compose dev overlay (speaches, kokoro-tts, chatterbox-tts containers)
Docker Compose Swarm/prod deployment with Traefik labels, GPU reservation, health checks

Documentation:

Comprehensive docs/SPEECH.md (architecture, API reference, deployment guide)

Test plan

All 500+ unit tests pass (pnpm test in worktree)
Lint and typecheck pass (pnpm lint && pnpm typecheck)
Docker Compose dev overlay starts correctly
REST endpoints respond with proper auth guards
WebSocket streaming transcription works with audio chunks
Frontend components render and handle state correctly
E2E integration tests pass

Closes #388

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

## Summary Implements the complete M13-SpeechServices milestone (0.0.13) — speech-to-text and text-to-speech integration for Mosaic Stack. **Epic:** #388 | **Issues:** #389-#406 (18 issues, all closed) **Stats:** 62 files changed, 13,613 insertions, 500+ tests ### What's included **API Backend (NestJS):** - SpeechModule with provider abstraction layer (ISTTProvider, ITTSProvider interfaces) - STT provider using Speaches/faster-whisper via OpenAI-compatible API - Tiered TTS architecture: Default (Kokoro-FastAPI), Premium (Chatterbox with voice cloning), Fallback (Piper via OpenedAI Speech) - REST endpoints: POST /speech/transcribe, POST /speech/synthesize, GET /speech/voices, GET /speech/health - WebSocket gateway at /speech namespace for streaming transcription - Audio/text validation pipes, DTOs, ConfigModule integration with 14 env vars **Frontend (Next.js):** - VoiceInput component with microphone capture, audio visualization, WebSocket streaming - AudioPlayer component with progress bar, speed control, download - TextToSpeechButton for inline TTS playback - SpeechSettings page with provider selection, voice config, health status **DevOps:** - Docker Compose dev overlay (speaches, kokoro-tts, chatterbox-tts containers) - Docker Compose Swarm/prod deployment with Traefik labels, GPU reservation, health checks **Documentation:** - Comprehensive docs/SPEECH.md (architecture, API reference, deployment guide) ## Test plan - [ ] All 500+ unit tests pass (pnpm test in worktree) - [ ] Lint and typecheck pass (pnpm lint && pnpm typecheck) - [ ] Docker Compose dev overlay starts correctly - [ ] REST endpoints respond with proper auth guards - [ ] WebSocket streaming transcription works with audio chunks - [ ] Frontend components render and handle state correctly - [ ] E2E integration tests pass Closes #388 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jason.woltje added 18 commits 2026-02-15 09:30:59 +00:00

chore(orchestrator): Bootstrap M13-SpeechServices tasks.md fb53272fa9

18 tasks across 7 phases for TTS & STT integration.
Estimated total: ~322K tokens.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#401 ): add speech services config and env vars

ci/woodpecker/push/api Pipeline was successful

Details

4cc43bece6

Add SpeechConfig with typed configuration and startup validation for
STT (Whisper/Speaches), TTS default (Kokoro), TTS premium (Chatterbox),
and TTS fallback (Piper/OpenedAI). Includes registerAs factory for
NestJS ConfigModule integration, .env.example documentation, and 51
unit tests covering all validation paths.

Refs #401

feat(#399 ): add Docker Compose dev overlay for speech services 52553c8266

Add docker-compose.speech.yml with three speech services:
- Speaches (STT via Whisper + basic TTS) on port 8090
- Kokoro-FastAPI (default TTS) on port 8880
- Chatterbox TTS (premium, GPU-required) on port 8881 behind
  the premium-tts profile

All services include health checks, connect to the mosaic-internal
network, and follow existing naming/labeling conventions. Makefile
targets added: speech-up, speech-down, speech-logs.

Fixes #399

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#389 ): create SpeechModule with provider abstraction layer

ci/woodpecker/push/api Pipeline was successful

Details

c40373fa3b

Add SpeechModule with provider interfaces and service skeleton for
multi-tier TTS fallback (premium -> default -> fallback) and STT
transcription support. Includes 27 unit tests covering provider
selection, fallback logic, and availability checks.

- ISTTProvider interface with transcribe/isHealthy methods
- ITTSProvider interface with synthesize/listVoices/isHealthy methods
- Shared types: SpeechTier, TranscriptionResult, SynthesisResult, etc.
- SpeechService with graceful TTS fallback chain
- NestJS injection tokens (STT_PROVIDER, TTS_PROVIDERS)
- SpeechModule registered in AppModule
- ConfigModule integration via speechConfig registerAs factory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#391 ): implement tiered TTS provider architecture with base class 3ae9e53bcc

Add abstract BaseTTSProvider class that implements common OpenAI-compatible
TTS logic using the OpenAI SDK with configurable baseURL. Includes synthesize(),
listVoices(), and isHealthy() methods. Create TTS provider factory that
dynamically registers Kokoro (default), Chatterbox (premium), and Piper
(fallback) providers based on configuration. Update SpeechModule to use
the factory for TTS_PROVIDERS injection token.

Also fixes lint error in speaches-stt.provider.ts (Array<T> -> T[]).

30 tests added (22 base provider + 8 factory), all passing.

Fixes #391

feat(#391 ): add base TTS provider and factory classes

ci/woodpecker/push/api Pipeline was successful

Details

b5edb4f37e

Add the BaseTTSProvider abstract class and TTS provider factory that were
part of the tiered TTS architecture but missed from the previous commit.

- BaseTTSProvider: abstract base with synthesize(), listVoices(), isHealthy()
- tts-provider.factory: creates Kokoro/Chatterbox/Piper providers from config
- 30 tests (22 base provider + 8 factory)

Refs #391

feat(#393 ): implement Kokoro-FastAPI TTS provider with voice catalog

ci/woodpecker/push/api Pipeline failed

Details

79b1d81d27

Extract KokoroTtsProvider from factory into its own module with:
- Full voice catalog of 54 built-in voices across 8 languages
- Voice metadata parsing from ID prefix (language, gender, accent)
- Exported constants for supported formats and speed range
- Comprehensive unit tests (48 tests)
- Fix lint/type errors in chatterbox provider (Prettier + unsafe cast)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#394 ): implement Chatterbox TTS provider with voice cloning

ci/woodpecker/push/api Pipeline was successful

Details

d37c78f503

Add ChatterboxSynthesizeOptions interface with referenceAudio and
emotionExaggeration fields, and comprehensive unit tests (26 tests)
covering voice cloning, emotion control, clamping, graceful degradation,
and cross-language support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#398 ): add audio/text validation pipes and speech DTOs

ci/woodpecker/push/api Pipeline was successful

Details

7b4fda6011

Create AudioValidationPipe for MIME type and file size validation,
TextValidationPipe for TTS text input validation, and DTOs for
transcribe/synthesize endpoints. Includes 36 unit tests.

Fixes #398

feat(#395 ): implement Piper TTS provider via OpenedAI Speech

ci/woodpecker/push/api Pipeline was successful

Details

6c465566f6

Add fallback-tier TTS provider using Piper via OpenedAI Speech for
ultra-lightweight CPU-only synthesis. Maps 6 standard OpenAI voice
names (alloy, echo, fable, onyx, nova, shimmer) to Piper voices.
Update factory to use the new PiperTtsProvider class, replacing the
inline stub. Includes 37 unit tests covering provider identity,
voice mapping, and voice listing.

Fixes #395

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#392 ): create /api/speech/transcribe REST endpoint

ci/woodpecker/push/api Pipeline was successful

Details

527262af38

Add SpeechController with POST /api/speech/transcribe for audio
transcription and GET /api/speech/health for provider status.
Uses AudioValidationPipe for file upload validation and returns
results in standard { data: T } envelope.

Includes 10 unit tests covering transcribe with options, error
propagation, and all health status combinations.

Fixes #392

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#400 ): add Docker Compose swarm/prod deployment for speech services

ci/woodpecker/push/infra Pipeline was successful

Details

b3d6d73348

Add docker/docker-compose.sample.speech.yml for standalone speech services
deployment in Docker Swarm with Portainer compatibility:

- Speaches (STT + basic TTS) with Whisper model configuration
- Kokoro TTS (default high-quality TTS) always deployed
- Chatterbox TTS (premium, GPU) commented out as optional
- Traefik labels for reverse proxy routing with TLS
- Health checks on all services
- Volume persistence for Whisper models
- GPU reservation via Swarm generic resources for Chatterbox
- Environment variable substitution for Portainer
- Comprehensive header documentation

Fixes #400

feat(#397 ): implement WebSocket streaming transcription gateway

ci/woodpecker/push/api Pipeline was successful

Details

28c9e6fe65

Add SpeechGateway with Socket.IO namespace /speech for real-time
streaming transcription. Supports start-transcription, audio-chunk,
and stop-transcription events with session management, authentication,
and buffer size rate limiting. Includes 29 unit tests covering
authentication, session lifecycle, error handling, cleanup, and
client isolation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#403 ): add audio playback component for TTS output

ci/woodpecker/push/web Pipeline was successful

Details

74d6c1092e

Implements AudioPlayer inline component with play/pause, progress bar,
speed control (0.5x-2x), download, and duration display. Adds
TextToSpeechButton "Read aloud" component that synthesizes text via
the speech API and integrates AudioPlayer for playback. Includes
useTextToSpeech hook with API integration, audio caching, and
playback state management. All 32 tests passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(#404 ): add speech settings page with provider config

ci/woodpecker/push/web Pipeline was successful

Details

bc86947d01

Implements the SpeechSettings component with four sections:
- STT settings (enable/disable, language preference)
- TTS settings (enable/disable, voice selector, tier preference, auto-play, speed control)
- Voice preview with test button
- Provider status with health indicators

Also adds Slider UI component and getHealthStatus API client function.
30 unit tests covering all sections, toggles, voice loading, and PDA-friendly design.

Fixes #404

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs(#406 ): add speech services documentation

ci/woodpecker/push/api Pipeline was successful

Details

24065aa199

Comprehensive documentation for the speech services module:
- docs/SPEECH.md: Architecture, API reference, WebSocket protocol,
  environment variables, provider configuration, Docker setup,
  GPU VRAM budget, and frontend integration examples
- apps/api/src/speech/AGENTS.md: Module structure, provider pattern,
  how to add new providers, gotchas, and test patterns
- README.md: Speech capabilities section with quick start

Fixes #406

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(#405 ): add E2E integration tests for speech services

ci/woodpecker/push/api Pipeline was successful

Details

d2c7602430

Adds comprehensive integration tests covering all 9 required scenarios:
1. REST transcription (POST /speech/transcribe)
2. REST synthesis (POST /speech/synthesize)
3. Provider fallback (premium -> default -> fallback chain)
4. WebSocket streaming transcription lifecycle
5. Audio MIME type validation (reject invalid formats)
6. File size limit enforcement (25 MB max)
7. Authentication on all endpoints (401 without token)
8. Voice listing with tier filtering (GET /speech/voices)
9. Health check status (GET /speech/health)

Uses NestJS testing module with mocked providers (CI-compatible).
30 test cases, all passing.

Fixes #405

chore(orchestrator): finalize M13-SpeechServices tasks.md — all 18/18 done dcbc8d1053

All tasks completed successfully across 7 phases:
- Phase 1: Config + Module foundation (2/2)
- Phase 2: STT + TTS providers (5/5)
- Phase 3: Middleware + REST endpoints (3/3)
- Phase 4: WebSocket streaming (1/1)
- Phase 5: Docker/DevOps (2/2)
- Phase 6: Frontend components (3/3)
- Phase 7: E2E tests + Documentation (2/2)

Total: ~500+ tests across API and web packages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jason.woltje added this to the M13-SpeechServices (0.0.13) milestone 2026-02-15 09:31:08 +00:00

jason.woltje added 1 commit 2026-02-15 09:45:08 +00:00

fix(#388 ): address PR review findings — fix WebSocket/REST bugs, improve error handling, fix types and comments

ci/woodpecker/push/web Pipeline was successful

Details

ci/woodpecker/push/api Pipeline was successful

Details

af9c5799af

Critical fixes:
- Fix FormData field name mismatch (audio -> file) to match backend FileInterceptor
- Add /speech namespace to WebSocket connection URL
- Pass auth token in WebSocket handshake options
- Wrap audio.play() in try-catch for NotAllowedError and DOMException handling
- Replace bare catch block with named error parameter and descriptive message
- Add connect_error and disconnect event handlers to WebSocket
- Update JSDoc to accurately describe batch transcription (not real-time partial)

Important fixes:
- Emit transcription-error before disconnect in gateway auth failures
- Capture MediaRecorder error details and clean up media tracks on error
- Change TtsDefaultConfig.format type from string to AudioFormat
- Define canonical SPEECH_TIERS and AUDIO_FORMATS arrays as single source of truth
- Fix voice count from 54 to 53 in provider, AGENTS.md, and docs
- Fix inaccurate comments (Piper formats, tier prop, SpeachesProvider, TextValidationPipe)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jason.woltje added 1 commit 2026-02-15 18:31:51 +00:00

merge: resolve conflicts with develop (M10-Telemetry + M12-MatrixBridge)

ci/woodpecker/push/infra Pipeline was successful

Details

ci/woodpecker/push/coordinator Pipeline was successful

Details

ci/woodpecker/push/orchestrator Pipeline was successful

Details

ci/woodpecker/push/api Pipeline was successful

Details

ci/woodpecker/push/web Pipeline was successful

Details

cf28efa880

Merge origin/develop into feature/m13-speech-services to incorporate
M10-Telemetry and M12-MatrixBridge changes. Resolved 4 conflicts:
- .env.example: Added speech config alongside telemetry + matrix config
- Makefile: Added speech targets alongside matrix targets
- app.module.ts: Import both MosaicTelemetryModule and SpeechModule
- docs/tasks.md: Combined all milestone task tracking sections

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jason.woltje merged commit 1fde25760a into develop

2026-02-15 18:37:54 +00:00

jason.woltje referenced this issue from a commit

2026-02-15 18:37:56 +00:00

Merge pull request 'feat: M13-SpeechServices — TTS & STT integration' (#409) from feature/m13-speech-services into develop

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: mosaic/stack#409