- Create comprehensive telemetry documentation at docs/telemetry.md - Cover configuration, event schema, predictions, SDK reference - Include development guide with dry-run mode and troubleshooting - Link from main README.md Refs #376 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
25 KiB
Mosaic Telemetry Integration Guide
1. Overview
What is Mosaic Telemetry?
Mosaic Telemetry is a task completion tracking system purpose-built for AI operations within Mosaic Stack. It captures detailed metrics about every AI task execution -- token usage, cost, duration, outcome, and quality gate results -- and submits them to a central telemetry API for aggregation and analysis.
The aggregated data powers a prediction system that provides pre-task estimates for cost, token usage, and expected quality, enabling informed decisions before dispatching work to AI agents.
How It Differs from OpenTelemetry
Mosaic Stack uses two separate telemetry systems that serve different purposes:
| Aspect | OpenTelemetry (OTEL) | Mosaic Telemetry |
|---|---|---|
| Purpose | Distributed request tracing and observability | AI task completion metrics and predictions |
| What it tracks | HTTP requests, spans, latency, errors | Token counts, costs, outcomes, quality gates |
| Data destination | OTEL Collector (Jaeger, Grafana, etc.) | Mosaic Telemetry API (PostgreSQL-backed) |
| Module location (API) | apps/api/src/telemetry/ |
apps/api/src/mosaic-telemetry/ |
| Module location (Coordinator) | apps/coordinator/src/telemetry.py |
apps/coordinator/src/mosaic_telemetry.py |
Both systems can run simultaneously. They are completely independent.
Architecture
+------------------+ +------------------+
| Mosaic API | | Coordinator |
| (NestJS) | | (FastAPI) |
+--------+---------+ +--------+---------+
| |
Track events Track events
| |
v v
+------------------------------------------+
| Telemetry Client SDK |
| (JS: @mosaicstack/telemetry-client) |
| (Py: mosaicstack-telemetry) |
| |
| - Event queue (in-memory) |
| - Batch submission (5-min intervals) |
| - Prediction cache (6hr TTL) |
+-------------------+----------------------+
|
HTTP POST /events
HTTP POST /predictions
|
v
+------------------------------------------+
| Mosaic Telemetry API |
| (Separate service) |
| |
| - Event ingestion & validation |
| - Aggregation & statistics |
| - Prediction generation |
+-------------------+----------------------+
|
v
+---------------+
| PostgreSQL |
+---------------+
Data flow:
- Application code calls
trackTaskCompletion()(JS) orclient.track()(Python) - Events are queued in memory (up to 1,000 events)
- A background timer flushes the queue every 5 minutes in batches of up to 100
- The telemetry API ingests events, validates them, and stores them in PostgreSQL
- Prediction queries are served from aggregated data with a 6-hour cache TTL
2. Configuration Guide
Environment Variables
All configuration is done through environment variables prefixed with MOSAIC_TELEMETRY_:
| Variable | Type | Default | Description |
|---|---|---|---|
MOSAIC_TELEMETRY_ENABLED |
boolean | true |
Master switch. Set to false to completely disable telemetry (no HTTP calls). |
MOSAIC_TELEMETRY_SERVER_URL |
string | (none) | URL of the telemetry API server. For Docker Compose: http://telemetry-api:8000. For production: https://tel-api.mosaicstack.dev. |
MOSAIC_TELEMETRY_API_KEY |
string | (none) | API key for authenticating with the telemetry server. Generate with: openssl rand -hex 32 (64-char hex string). |
MOSAIC_TELEMETRY_INSTANCE_ID |
string | (none) | Unique UUID identifying this Mosaic Stack instance. Generate with: uuidgen or python -c "import uuid; print(uuid.uuid4())". |
MOSAIC_TELEMETRY_DRY_RUN |
boolean | false |
When true, events are logged to console instead of being sent via HTTP. Useful for development. |
Enabling Telemetry
To enable telemetry, set all three required variables in your .env file:
MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_SERVER_URL=http://telemetry-api:8000
MOSAIC_TELEMETRY_API_KEY=<your-64-char-hex-api-key>
MOSAIC_TELEMETRY_INSTANCE_ID=<your-uuid>
If MOSAIC_TELEMETRY_ENABLED is true but any of SERVER_URL, API_KEY, or INSTANCE_ID is missing, the service logs a warning and disables telemetry gracefully. This is intentional: telemetry configuration issues never prevent the application from starting.
Disabling Telemetry
Set MOSAIC_TELEMETRY_ENABLED=false in your .env. No HTTP calls will be made, and all tracking methods become safe no-ops.
Dry-Run Mode
For local development and debugging, enable dry-run mode:
MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000 # Not actually called
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000
In dry-run mode, the SDK logs event payloads to the console instead of submitting them via HTTP. This lets you verify that tracking points are firing correctly without needing a running telemetry API.
Docker Compose Configuration
Both docker-compose.yml (root) and docker/docker-compose.yml pass telemetry environment variables to the API service:
services:
mosaic-api:
environment:
# Telemetry (task completion tracking & predictions)
MOSAIC_TELEMETRY_ENABLED: ${MOSAIC_TELEMETRY_ENABLED:-false}
MOSAIC_TELEMETRY_SERVER_URL: ${MOSAIC_TELEMETRY_SERVER_URL:-http://telemetry-api:8000}
MOSAIC_TELEMETRY_API_KEY: ${MOSAIC_TELEMETRY_API_KEY:-}
MOSAIC_TELEMETRY_INSTANCE_ID: ${MOSAIC_TELEMETRY_INSTANCE_ID:-}
MOSAIC_TELEMETRY_DRY_RUN: ${MOSAIC_TELEMETRY_DRY_RUN:-false}
Note that telemetry defaults to false in Docker Compose. Set MOSAIC_TELEMETRY_ENABLED=true in your .env to activate it.
An optional local telemetry API service is available (commented out in docker/docker-compose.yml). Uncomment it to run a self-contained development environment:
# Uncomment in docker/docker-compose.yml
telemetry-api:
image: git.mosaicstack.dev/mosaic/telemetry-api:latest
container_name: mosaic-telemetry-api
restart: unless-stopped
environment:
HOST: 0.0.0.0
PORT: 8000
ports:
- "8001:8000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
networks:
- mosaic-network
3. What Gets Tracked
TaskCompletionEvent Schema
Every tracked event conforms to the TaskCompletionEvent interface. This is the core data structure submitted to the telemetry API:
| Field | Type | Description |
|---|---|---|
instance_id |
string |
UUID of the Mosaic Stack instance that generated the event |
event_id |
string |
Unique UUID for this event (auto-generated by the SDK) |
schema_version |
string |
Schema version for forward compatibility (auto-set by the SDK) |
timestamp |
string |
ISO 8601 timestamp of event creation (auto-set by the SDK) |
task_duration_ms |
number |
How long the task took in milliseconds |
task_type |
TaskType |
Type of task performed (see enum below) |
complexity |
Complexity |
Complexity level of the task |
harness |
Harness |
The coding harness or tool used |
model |
string |
AI model name (e.g., "claude-sonnet-4-5") |
provider |
Provider |
AI model provider |
estimated_input_tokens |
number |
Pre-task estimated input tokens (from predictions) |
estimated_output_tokens |
number |
Pre-task estimated output tokens (from predictions) |
actual_input_tokens |
number |
Actual input tokens consumed |
actual_output_tokens |
number |
Actual output tokens generated |
estimated_cost_usd_micros |
number |
Pre-task estimated cost in microdollars (USD * 1,000,000) |
actual_cost_usd_micros |
number |
Actual cost in microdollars |
quality_gate_passed |
boolean |
Whether all quality gates passed |
quality_gates_run |
QualityGate[] |
List of quality gates that were executed |
quality_gates_failed |
QualityGate[] |
List of quality gates that failed |
context_compactions |
number |
Number of context window compactions during the task |
context_rotations |
number |
Number of context window rotations during the task |
context_utilization_final |
number |
Final context window utilization (0.0 to 1.0) |
outcome |
Outcome |
Task outcome |
retry_count |
number |
Number of retries before completion |
language |
string? |
Primary programming language (optional) |
repo_size_category |
RepoSizeCategory? |
Repository size category (optional) |
Enum Values
TaskType:
planning, implementation, code_review, testing, debugging, refactoring, documentation, configuration, security_audit, unknown
Complexity:
low, medium, high, critical
Harness:
claude_code, opencode, kilo_code, aider, api_direct, ollama_local, custom, unknown
Provider:
anthropic, openai, openrouter, ollama, google, mistral, custom, unknown
QualityGate:
build, lint, test, coverage, typecheck, security
Outcome:
success, failure, partial, timeout
RepoSizeCategory:
tiny, small, medium, large, huge
API Service: LLM Call Tracking
The NestJS API tracks every LLM service call (chat, streaming chat, and embeddings) via LlmTelemetryTrackerService at apps/api/src/llm/llm-telemetry-tracker.service.ts.
Tracked operations:
chat-- Synchronous chat completionschatStream-- Streaming chat completionsembed-- Embedding generation
For each call, the tracker captures:
- Model name and provider type
- Input and output token counts
- Duration in milliseconds
- Success or failure outcome
- Calculated cost from the built-in cost table (
apps/api/src/llm/llm-cost-table.ts) - Task type inferred from calling context (e.g.,
"brain"maps toplanning,"review"maps tocode_review)
The cost table uses longest-prefix matching on model names and covers all major Anthropic and OpenAI models. Ollama/local models are treated as zero-cost.
Coordinator: Agent Task Dispatch Tracking
The FastAPI coordinator tracks agent task completions in apps/coordinator/src/mosaic_telemetry.py and apps/coordinator/src/coordinator.py.
After each agent task dispatch (success or failure), the coordinator emits a TaskCompletionEvent capturing:
- Task duration from start to finish
- Agent model, provider, and harness (resolved from the
assigned_agentfield) - Task outcome (
success,failure,partial,timeout) - Quality gate results (build, lint, test, etc.)
- Retry count for the issue
- Complexity level from issue metadata
The coordinator uses the build_task_event() helper function which provides sensible defaults for the coordinator context (Claude Code harness, Anthropic provider, TypeScript language).
Event Lifecycle
1. Application code calls trackTaskCompletion() or client.track()
|
v
2. Event is added to in-memory queue (max 1,000 events)
|
v
3. Background timer fires every 5 minutes (submitIntervalMs)
|
v
4. Queue is drained in batches of up to 100 events (batchSize)
|
v
5. Each batch is POSTed to the telemetry API
|
v
6. API validates, stores, and acknowledges each event
If the telemetry API is unreachable, events remain in the queue and are retried on the next interval (up to 3 retries per submission). Telemetry errors are logged but never propagated to calling code.
4. Prediction System
How Predictions Work
The Mosaic Telemetry API aggregates historical task completion data across all contributing instances. From this data, it generates statistical predictions for new tasks based on their characteristics (task type, model, provider, complexity).
Predictions include percentile distributions (p10, p25, median, p75, p90) for token usage and cost, plus quality metrics (gate pass rate, success rate).
Querying Predictions via API
The API exposes a prediction endpoint at:
GET /api/telemetry/estimate?taskType=<taskType>&model=<model>&provider=<provider>&complexity=<complexity>
Authentication: Requires a valid session (Bearer token via AuthGuard).
Query Parameters (all required):
| Parameter | Type | Example | Description |
|---|---|---|---|
taskType |
TaskType |
implementation |
Task type to estimate |
model |
string |
claude-sonnet-4-5 |
Model name |
provider |
Provider |
anthropic |
Provider name |
complexity |
Complexity |
medium |
Complexity level |
Example Request:
curl -X GET \
'http://localhost:3001/api/telemetry/estimate?taskType=implementation&model=claude-sonnet-4-5&provider=anthropic&complexity=medium' \
-H 'Authorization: Bearer YOUR_SESSION_TOKEN'
Response:
{
"data": {
"prediction": {
"input_tokens": {
"p10": 500,
"p25": 1200,
"median": 2500,
"p75": 5000,
"p90": 10000
},
"output_tokens": {
"p10": 200,
"p25": 800,
"median": 1500,
"p75": 3000,
"p90": 6000
},
"cost_usd_micros": {
"median": 30000
},
"duration_ms": {
"median": 5000
},
"correction_factors": {
"input": 1.0,
"output": 1.0
},
"quality": {
"gate_pass_rate": 0.85,
"success_rate": 0.92
}
},
"metadata": {
"sample_size": 150,
"fallback_level": 0,
"confidence": "high",
"last_updated": "2026-02-15T10:00:00Z",
"cache_hit": true
}
}
}
If no prediction data is available, the response returns { "data": null }.
Confidence Levels
The prediction system reports a confidence level based on sample size and data freshness:
| Confidence | Meaning |
|---|---|
high |
Substantial sample size, recent data, all dimensions matched |
medium |
Moderate sample, some dimension fallback |
low |
Small sample or significant fallback from requested dimensions |
none |
No data available for this combination |
Fallback Behavior
When exact matches are unavailable, the prediction system falls back through progressively broader aggregations:
- Exact match -- task_type + model + provider + complexity
- Drop complexity -- task_type + model + provider
- Drop model -- task_type + provider
- Global -- task_type only
The fallback_level field in metadata indicates which level was used (0 = exact match).
Cache Strategy
Predictions are cached in-memory by the SDK with a 6-hour TTL (predictionCacheTtlMs: 21_600_000). The PredictionService pre-fetches common combinations on startup to warm the cache:
- Models: claude-sonnet-4-5, claude-opus-4, claude-haiku-4-5, gpt-4o, gpt-4o-mini
- Task types: implementation, planning, code_review
- Complexities: low, medium
This produces 30 pre-cached queries (5 models x 3 task types x 2 complexities). Subsequent requests for these combinations are served from cache without any HTTP call.
5. SDK Reference
JavaScript: @mosaicstack/telemetry-client
Registry: Gitea npm registry at git.mosaicstack.dev
Version: 0.1.0
Installation:
pnpm add @mosaicstack/telemetry-client
Key Exports:
// Client
import {
TelemetryClient,
EventBuilder,
resolveConfig,
} from "@mosaicstack/telemetry-client";
// Types
import type {
TelemetryConfig,
TaskCompletionEvent,
EventBuilderParams,
PredictionQuery,
PredictionResponse,
PredictionData,
PredictionMetadata,
TokenDistribution,
} from "@mosaicstack/telemetry-client";
// Enums
import {
TaskType,
Complexity,
Harness,
Provider,
QualityGate,
Outcome,
RepoSizeCategory,
} from "@mosaicstack/telemetry-client";
TelemetryClient API:
| Method | Description |
|---|---|
constructor(config: TelemetryConfig) |
Create a new client with the given configuration |
start(): void |
Start background batch submission (idempotent) |
stop(): Promise<void> |
Stop background submission, flush remaining events |
track(event: TaskCompletionEvent): void |
Queue an event for batch submission (never throws) |
getPrediction(query: PredictionQuery): PredictionResponse | null |
Get a cached prediction (returns null if not cached/expired) |
refreshPredictions(queries: PredictionQuery[]): Promise<void> |
Force-refresh predictions from the server |
eventBuilder: EventBuilder |
Get the EventBuilder for constructing events |
queueSize: number |
Number of events currently queued |
isRunning: boolean |
Whether the client is currently running |
TelemetryConfig Options:
| Option | Type | Default | Description |
|---|---|---|---|
serverUrl |
string |
(required) | Base URL of the telemetry server |
apiKey |
string |
(required) | 64-char hex API key |
instanceId |
string |
(required) | UUID for this instance |
enabled |
boolean |
true |
Enable/disable telemetry |
submitIntervalMs |
number |
300_000 (5 min) |
Interval between batch submissions |
maxQueueSize |
number |
1000 |
Maximum queued events |
batchSize |
number |
100 |
Maximum events per batch |
requestTimeoutMs |
number |
10_000 (10 sec) |
HTTP request timeout |
predictionCacheTtlMs |
number |
21_600_000 (6 hr) |
Prediction cache TTL |
dryRun |
boolean |
false |
Log events instead of sending |
maxRetries |
number |
3 |
Retries per submission |
onError |
(error: Error) => void |
noop | Error callback |
EventBuilder Usage:
const event = client.eventBuilder.build({
task_duration_ms: 1500,
task_type: TaskType.IMPLEMENTATION,
complexity: Complexity.LOW,
harness: Harness.API_DIRECT,
model: "claude-sonnet-4-5",
provider: Provider.ANTHROPIC,
estimated_input_tokens: 0,
estimated_output_tokens: 0,
actual_input_tokens: 200,
actual_output_tokens: 500,
estimated_cost_usd_micros: 0,
actual_cost_usd_micros: 8100,
quality_gate_passed: true,
quality_gates_run: [QualityGate.LINT, QualityGate.TEST],
quality_gates_failed: [],
context_compactions: 0,
context_rotations: 0,
context_utilization_final: 0.3,
outcome: Outcome.SUCCESS,
retry_count: 0,
language: "typescript",
});
client.track(event);
Python: mosaicstack-telemetry
Registry: Gitea PyPI registry at git.mosaicstack.dev
Version: 0.1.0
Installation:
pip install mosaicstack-telemetry
Key Imports:
from mosaicstack_telemetry import (
TelemetryClient,
TelemetryConfig,
EventBuilder,
TaskType,
Complexity,
Harness,
Provider,
QualityGate,
Outcome,
)
Python Client Usage:
# Create config (reads MOSAIC_TELEMETRY_* env vars automatically)
config = TelemetryConfig()
errors = config.validate()
# Create and start client
client = TelemetryClient(config)
await client.start_async()
# Build and track an event
builder = EventBuilder(instance_id=config.instance_id)
event = (
builder
.task_type(TaskType.IMPLEMENTATION)
.complexity_level(Complexity.MEDIUM)
.harness_type(Harness.CLAUDE_CODE)
.model("claude-sonnet-4-5")
.provider(Provider.ANTHROPIC)
.duration_ms(5000)
.outcome_value(Outcome.SUCCESS)
.tokens(
estimated_in=0,
estimated_out=0,
actual_in=3000,
actual_out=1500,
)
.cost(estimated=0, actual=52500)
.quality(
passed=True,
gates_run=[QualityGate.BUILD, QualityGate.LINT, QualityGate.TEST],
gates_failed=[],
)
.context(compactions=0, rotations=0, utilization=0.4)
.retry_count(0)
.language("typescript")
.build()
)
client.track(event)
# Shutdown (flushes remaining events)
await client.stop_async()
6. Development Guide
Testing Locally with Dry-Run Mode
The fastest way to develop with telemetry is to use dry-run mode. This logs event payloads to the console without needing a running telemetry API:
# In your .env
MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000
Start the API server and trigger LLM operations. You will see telemetry event payloads logged in the console output.
Adding New Tracking Points
To add telemetry tracking to a new service in the NestJS API:
Step 1: Inject MosaicTelemetryService into your service. Because MosaicTelemetryModule is global, no module import is needed:
import { Injectable } from "@nestjs/common";
import { MosaicTelemetryService } from "../mosaic-telemetry/mosaic-telemetry.service";
import { TaskType, Complexity, Harness, Provider, Outcome } from "@mosaicstack/telemetry-client";
@Injectable()
export class MyService {
constructor(private readonly telemetry: MosaicTelemetryService) {}
}
Step 2: Build and track events after task completion:
async performTask(): Promise<void> {
const start = Date.now();
// ... perform the task ...
const duration = Date.now() - start;
const builder = this.telemetry.eventBuilder;
if (builder) {
const event = builder.build({
task_duration_ms: duration,
task_type: TaskType.IMPLEMENTATION,
complexity: Complexity.MEDIUM,
harness: Harness.API_DIRECT,
model: "claude-sonnet-4-5",
provider: Provider.ANTHROPIC,
estimated_input_tokens: 0,
estimated_output_tokens: 0,
actual_input_tokens: inputTokens,
actual_output_tokens: outputTokens,
estimated_cost_usd_micros: 0,
actual_cost_usd_micros: costMicros,
quality_gate_passed: true,
quality_gates_run: [],
quality_gates_failed: [],
context_compactions: 0,
context_rotations: 0,
context_utilization_final: 0,
outcome: Outcome.SUCCESS,
retry_count: 0,
});
this.telemetry.trackTaskCompletion(event);
}
}
Step 3: For LLM-specific tracking, use LlmTelemetryTrackerService instead, which handles cost calculation and task type inference automatically:
import { LlmTelemetryTrackerService } from "../llm/llm-telemetry-tracker.service";
@Injectable()
export class MyLlmService {
constructor(private readonly telemetryTracker: LlmTelemetryTrackerService) {}
async chat(): Promise<void> {
const start = Date.now();
// ... call LLM ...
this.telemetryTracker.trackLlmCompletion({
model: "claude-sonnet-4-5",
providerType: "claude",
operation: "chat",
durationMs: Date.now() - start,
inputTokens: 150,
outputTokens: 300,
callingContext: "brain", // Used for task type inference
success: true,
});
}
}
Adding Tracking in the Coordinator (Python)
Use the build_task_event() helper from src/mosaic_telemetry.py:
from src.mosaic_telemetry import build_task_event, get_telemetry_client
client = get_telemetry_client(app)
if client is not None:
event = build_task_event(
instance_id=instance_id,
task_type=TaskType.IMPLEMENTATION,
complexity=Complexity.MEDIUM,
outcome=Outcome.SUCCESS,
duration_ms=5000,
model="claude-sonnet-4-5",
provider=Provider.ANTHROPIC,
harness=Harness.CLAUDE_CODE,
actual_input_tokens=3000,
actual_output_tokens=1500,
actual_cost_micros=52500,
)
client.track(event)
Troubleshooting
Telemetry events not appearing:
- Check that
MOSAIC_TELEMETRY_ENABLED=trueis set - Verify all three required variables are set:
SERVER_URL,API_KEY,INSTANCE_ID - Look for warning logs:
"Mosaic Telemetry is enabled but missing configuration"indicates a missing variable - Try dry-run mode to confirm events are being generated
Console shows "Mosaic Telemetry is disabled":
This is the expected message when MOSAIC_TELEMETRY_ENABLED=false. If you intended telemetry to be active, set it to true.
Events queuing but not submitting:
- Check that the telemetry API server at
MOSAIC_TELEMETRY_SERVER_URLis reachable - Verify the API key is a valid 64-character hex string
- The default submission interval is 5 minutes; wait at least one interval or call
stop()to force a flush
Prediction endpoint returns null:
- Predictions require sufficient historical data in the telemetry API
- Check the
metadata.confidencefield;"none"means no data exists for this combination - Predictions are cached for 6 hours; new data takes time to appear
- The
PredictionServicelogs startup refresh status; check logs for errors
"Telemetry client error" in logs:
- These are non-fatal. The SDK never blocks application logic.
- Common causes: network timeout, invalid API key, server-side validation failure
- Check the telemetry API logs for corresponding errors