Files

Jason Woltje 5958569cba docs(#376 ): telemetry integration guide

- Create comprehensive telemetry documentation at docs/telemetry.md
- Cover configuration, event schema, predictions, SDK reference
- Include development guide with dry-run mode and troubleshooting
- Link from main README.md

Refs #376

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 02:10:22 -06:00

25 KiB

Raw Blame History

Mosaic Telemetry Integration Guide

1. Overview

What is Mosaic Telemetry?

Mosaic Telemetry is a task completion tracking system purpose-built for AI operations within Mosaic Stack. It captures detailed metrics about every AI task execution -- token usage, cost, duration, outcome, and quality gate results -- and submits them to a central telemetry API for aggregation and analysis.

The aggregated data powers a prediction system that provides pre-task estimates for cost, token usage, and expected quality, enabling informed decisions before dispatching work to AI agents.

How It Differs from OpenTelemetry

Mosaic Stack uses two separate telemetry systems that serve different purposes:

Aspect	OpenTelemetry (OTEL)	Mosaic Telemetry
Purpose	Distributed request tracing and observability	AI task completion metrics and predictions
What it tracks	HTTP requests, spans, latency, errors	Token counts, costs, outcomes, quality gates
Data destination	OTEL Collector (Jaeger, Grafana, etc.)	Mosaic Telemetry API (PostgreSQL-backed)
Module location (API)	`apps/api/src/telemetry/`	`apps/api/src/mosaic-telemetry/`
Module location (Coordinator)	`apps/coordinator/src/telemetry.py`	`apps/coordinator/src/mosaic_telemetry.py`

Both systems can run simultaneously. They are completely independent.

Architecture

+------------------+     +------------------+
|   Mosaic API     |     |  Coordinator     |
|   (NestJS)       |     |  (FastAPI)       |
+--------+---------+     +--------+---------+
         |                        |
    Track events             Track events
         |                        |
         v                        v
+------------------------------------------+
|    Telemetry Client SDK                  |
|    (JS: @mosaicstack/telemetry-client)   |
|    (Py: mosaicstack-telemetry)           |
|                                          |
|  - Event queue (in-memory)               |
|  - Batch submission (5-min intervals)    |
|  - Prediction cache (6hr TTL)            |
+-------------------+----------------------+
                    |
              HTTP POST /events
              HTTP POST /predictions
                    |
                    v
+------------------------------------------+
|    Mosaic Telemetry API                  |
|    (Separate service)                    |
|                                          |
|  - Event ingestion & validation          |
|  - Aggregation & statistics              |
|  - Prediction generation                 |
+-------------------+----------------------+
                    |
                    v
            +---------------+
            |  PostgreSQL   |
            +---------------+

Data flow:

Application code calls trackTaskCompletion() (JS) or client.track() (Python)
Events are queued in memory (up to 1,000 events)
A background timer flushes the queue every 5 minutes in batches of up to 100
The telemetry API ingests events, validates them, and stores them in PostgreSQL
Prediction queries are served from aggregated data with a 6-hour cache TTL

2. Configuration Guide

Environment Variables

All configuration is done through environment variables prefixed with MOSAIC_TELEMETRY_:

Variable	Type	Default	Description
`MOSAIC_TELEMETRY_ENABLED`	boolean	`true`	Master switch. Set to `false` to completely disable telemetry (no HTTP calls).
`MOSAIC_TELEMETRY_SERVER_URL`	string	(none)	URL of the telemetry API server. For Docker Compose: `http://telemetry-api:8000`. For production: `https://tel-api.mosaicstack.dev`.
`MOSAIC_TELEMETRY_API_KEY`	string	(none)	API key for authenticating with the telemetry server. Generate with: `openssl rand -hex 32` (64-char hex string).
`MOSAIC_TELEMETRY_INSTANCE_ID`	string	(none)	Unique UUID identifying this Mosaic Stack instance. Generate with: `uuidgen` or `python -c "import uuid; print(uuid.uuid4())"`.
`MOSAIC_TELEMETRY_DRY_RUN`	boolean	`false`	When `true`, events are logged to console instead of being sent via HTTP. Useful for development.

Enabling Telemetry

To enable telemetry, set all three required variables in your .env file:

MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_SERVER_URL=http://telemetry-api:8000
MOSAIC_TELEMETRY_API_KEY=<your-64-char-hex-api-key>
MOSAIC_TELEMETRY_INSTANCE_ID=<your-uuid>

If MOSAIC_TELEMETRY_ENABLED is true but any of SERVER_URL, API_KEY, or INSTANCE_ID is missing, the service logs a warning and disables telemetry gracefully. This is intentional: telemetry configuration issues never prevent the application from starting.

Disabling Telemetry

Set MOSAIC_TELEMETRY_ENABLED=false in your .env. No HTTP calls will be made, and all tracking methods become safe no-ops.

Dry-Run Mode

For local development and debugging, enable dry-run mode:

MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000  # Not actually called
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000

In dry-run mode, the SDK logs event payloads to the console instead of submitting them via HTTP. This lets you verify that tracking points are firing correctly without needing a running telemetry API.

Docker Compose Configuration

Both docker-compose.yml (root) and docker/docker-compose.yml pass telemetry environment variables to the API service:

services:
  mosaic-api:
    environment:
      # Telemetry (task completion tracking & predictions)
      MOSAIC_TELEMETRY_ENABLED: ${MOSAIC_TELEMETRY_ENABLED:-false}
      MOSAIC_TELEMETRY_SERVER_URL: ${MOSAIC_TELEMETRY_SERVER_URL:-http://telemetry-api:8000}
      MOSAIC_TELEMETRY_API_KEY: ${MOSAIC_TELEMETRY_API_KEY:-}
      MOSAIC_TELEMETRY_INSTANCE_ID: ${MOSAIC_TELEMETRY_INSTANCE_ID:-}
      MOSAIC_TELEMETRY_DRY_RUN: ${MOSAIC_TELEMETRY_DRY_RUN:-false}

Note that telemetry defaults to false in Docker Compose. Set MOSAIC_TELEMETRY_ENABLED=true in your .env to activate it.

An optional local telemetry API service is available (commented out in docker/docker-compose.yml). Uncomment it to run a self-contained development environment:

# Uncomment in docker/docker-compose.yml
telemetry-api:
  image: git.mosaicstack.dev/mosaic/telemetry-api:latest
  container_name: mosaic-telemetry-api
  restart: unless-stopped
  environment:
    HOST: 0.0.0.0
    PORT: 8000
  ports:
    - "8001:8000"
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 10s
  networks:
    - mosaic-network

3. What Gets Tracked

TaskCompletionEvent Schema

Every tracked event conforms to the TaskCompletionEvent interface. This is the core data structure submitted to the telemetry API:

Field	Type	Description
`instance_id`	`string`	UUID of the Mosaic Stack instance that generated the event
`event_id`	`string`	Unique UUID for this event (auto-generated by the SDK)
`schema_version`	`string`	Schema version for forward compatibility (auto-set by the SDK)
`timestamp`	`string`	ISO 8601 timestamp of event creation (auto-set by the SDK)
`task_duration_ms`	`number`	How long the task took in milliseconds
`task_type`	`TaskType`	Type of task performed (see enum below)
`complexity`	`Complexity`	Complexity level of the task
`harness`	`Harness`	The coding harness or tool used
`model`	`string`	AI model name (e.g., `"claude-sonnet-4-5"`)
`provider`	`Provider`	AI model provider
`estimated_input_tokens`	`number`	Pre-task estimated input tokens (from predictions)
`estimated_output_tokens`	`number`	Pre-task estimated output tokens (from predictions)
`actual_input_tokens`	`number`	Actual input tokens consumed
`actual_output_tokens`	`number`	Actual output tokens generated
`estimated_cost_usd_micros`	`number`	Pre-task estimated cost in microdollars (USD * 1,000,000)
`actual_cost_usd_micros`	`number`	Actual cost in microdollars
`quality_gate_passed`	`boolean`	Whether all quality gates passed
`quality_gates_run`	`QualityGate[]`	List of quality gates that were executed
`quality_gates_failed`	`QualityGate[]`	List of quality gates that failed
`context_compactions`	`number`	Number of context window compactions during the task
`context_rotations`	`number`	Number of context window rotations during the task
`context_utilization_final`	`number`	Final context window utilization (0.0 to 1.0)
`outcome`	`Outcome`	Task outcome
`retry_count`	`number`	Number of retries before completion
`language`	`string?`	Primary programming language (optional)
`repo_size_category`	`RepoSizeCategory?`	Repository size category (optional)

Enum Values

TaskType: planning, implementation, code_review, testing, debugging, refactoring, documentation, configuration, security_audit, unknown

Complexity: low, medium, high, critical

Harness: claude_code, opencode, kilo_code, aider, api_direct, ollama_local, custom, unknown

Provider: anthropic, openai, openrouter, ollama, google, mistral, custom, unknown

QualityGate: build, lint, test, coverage, typecheck, security

Outcome: success, failure, partial, timeout

RepoSizeCategory: tiny, small, medium, large, huge

API Service: LLM Call Tracking

The NestJS API tracks every LLM service call (chat, streaming chat, and embeddings) via LlmTelemetryTrackerService at apps/api/src/llm/llm-telemetry-tracker.service.ts.

Tracked operations:

chat -- Synchronous chat completions
chatStream -- Streaming chat completions
embed -- Embedding generation

For each call, the tracker captures:

Model name and provider type
Input and output token counts
Duration in milliseconds
Success or failure outcome
Calculated cost from the built-in cost table (apps/api/src/llm/llm-cost-table.ts)
Task type inferred from calling context (e.g., "brain" maps to planning, "review" maps to code_review)

The cost table uses longest-prefix matching on model names and covers all major Anthropic and OpenAI models. Ollama/local models are treated as zero-cost.

Coordinator: Agent Task Dispatch Tracking

The FastAPI coordinator tracks agent task completions in apps/coordinator/src/mosaic_telemetry.py and apps/coordinator/src/coordinator.py.

After each agent task dispatch (success or failure), the coordinator emits a TaskCompletionEvent capturing:

Task duration from start to finish
Agent model, provider, and harness (resolved from the assigned_agent field)
Task outcome (success, failure, partial, timeout)
Quality gate results (build, lint, test, etc.)
Retry count for the issue
Complexity level from issue metadata

The coordinator uses the build_task_event() helper function which provides sensible defaults for the coordinator context (Claude Code harness, Anthropic provider, TypeScript language).

Event Lifecycle

1. Application code calls trackTaskCompletion() or client.track()
          |
          v
2. Event is added to in-memory queue (max 1,000 events)
          |
          v
3. Background timer fires every 5 minutes (submitIntervalMs)
          |
          v
4. Queue is drained in batches of up to 100 events (batchSize)
          |
          v
5. Each batch is POSTed to the telemetry API
          |
          v
6. API validates, stores, and acknowledges each event

If the telemetry API is unreachable, events remain in the queue and are retried on the next interval (up to 3 retries per submission). Telemetry errors are logged but never propagated to calling code.

4. Prediction System

How Predictions Work

The Mosaic Telemetry API aggregates historical task completion data across all contributing instances. From this data, it generates statistical predictions for new tasks based on their characteristics (task type, model, provider, complexity).

Predictions include percentile distributions (p10, p25, median, p75, p90) for token usage and cost, plus quality metrics (gate pass rate, success rate).

Querying Predictions via API

The API exposes a prediction endpoint at:

GET /api/telemetry/estimate?taskType=<taskType>&model=<model>&provider=<provider>&complexity=<complexity>

Authentication: Requires a valid session (Bearer token via AuthGuard).

Query Parameters (all required):

Parameter	Type	Example	Description
`taskType`	`TaskType`	`implementation`	Task type to estimate
`model`	`string`	`claude-sonnet-4-5`	Model name
`provider`	`Provider`	`anthropic`	Provider name
`complexity`	`Complexity`	`medium`	Complexity level

Example Request:

curl -X GET \
  'http://localhost:3001/api/telemetry/estimate?taskType=implementation&model=claude-sonnet-4-5&provider=anthropic&complexity=medium' \
  -H 'Authorization: Bearer YOUR_SESSION_TOKEN'

Response:

{
  "data": {
    "prediction": {
      "input_tokens": {
        "p10": 500,
        "p25": 1200,
        "median": 2500,
        "p75": 5000,
        "p90": 10000
      },
      "output_tokens": {
        "p10": 200,
        "p25": 800,
        "median": 1500,
        "p75": 3000,
        "p90": 6000
      },
      "cost_usd_micros": {
        "median": 30000
      },
      "duration_ms": {
        "median": 5000
      },
      "correction_factors": {
        "input": 1.0,
        "output": 1.0
      },
      "quality": {
        "gate_pass_rate": 0.85,
        "success_rate": 0.92
      }
    },
    "metadata": {
      "sample_size": 150,
      "fallback_level": 0,
      "confidence": "high",
      "last_updated": "2026-02-15T10:00:00Z",
      "cache_hit": true
    }
  }
}

If no prediction data is available, the response returns { "data": null }.

Confidence Levels

The prediction system reports a confidence level based on sample size and data freshness:

Confidence	Meaning
`high`	Substantial sample size, recent data, all dimensions matched
`medium`	Moderate sample, some dimension fallback
`low`	Small sample or significant fallback from requested dimensions
`none`	No data available for this combination

Fallback Behavior

When exact matches are unavailable, the prediction system falls back through progressively broader aggregations:

Exact match -- task_type + model + provider + complexity
Drop complexity -- task_type + model + provider
Drop model -- task_type + provider
Global -- task_type only

The fallback_level field in metadata indicates which level was used (0 = exact match).

Cache Strategy

Predictions are cached in-memory by the SDK with a 6-hour TTL (predictionCacheTtlMs: 21_600_000). The PredictionService pre-fetches common combinations on startup to warm the cache:

Models: claude-sonnet-4-5, claude-opus-4, claude-haiku-4-5, gpt-4o, gpt-4o-mini
Task types: implementation, planning, code_review
Complexities: low, medium

This produces 30 pre-cached queries (5 models x 3 task types x 2 complexities). Subsequent requests for these combinations are served from cache without any HTTP call.

5. SDK Reference

JavaScript: @mosaicstack/telemetry-client

Registry: Gitea npm registry at git.mosaicstack.dev Version: 0.1.0

Installation:

pnpm add @mosaicstack/telemetry-client

Key Exports:

// Client
import {
  TelemetryClient,
  EventBuilder,
  resolveConfig,
} from "@mosaicstack/telemetry-client";

// Types
import type {
  TelemetryConfig,
  TaskCompletionEvent,
  EventBuilderParams,
  PredictionQuery,
  PredictionResponse,
  PredictionData,
  PredictionMetadata,
  TokenDistribution,
} from "@mosaicstack/telemetry-client";

// Enums
import {
  TaskType,
  Complexity,
  Harness,
  Provider,
  QualityGate,
  Outcome,
  RepoSizeCategory,
} from "@mosaicstack/telemetry-client";

TelemetryClient API:

Method	Description
`constructor(config: TelemetryConfig)`	Create a new client with the given configuration
`start(): void`	Start background batch submission (idempotent)
`stop(): Promise<void>`	Stop background submission, flush remaining events
`track(event: TaskCompletionEvent): void`	Queue an event for batch submission (never throws)
`getPrediction(query: PredictionQuery): PredictionResponse \| null`	Get a cached prediction (returns null if not cached/expired)
`refreshPredictions(queries: PredictionQuery[]): Promise<void>`	Force-refresh predictions from the server
`eventBuilder: EventBuilder`	Get the EventBuilder for constructing events
`queueSize: number`	Number of events currently queued
`isRunning: boolean`	Whether the client is currently running

TelemetryConfig Options:

Option	Type	Default	Description
`serverUrl`	`string`	(required)	Base URL of the telemetry server
`apiKey`	`string`	(required)	64-char hex API key
`instanceId`	`string`	(required)	UUID for this instance
`enabled`	`boolean`	`true`	Enable/disable telemetry
`submitIntervalMs`	`number`	`300_000` (5 min)	Interval between batch submissions
`maxQueueSize`	`number`	`1000`	Maximum queued events
`batchSize`	`number`	`100`	Maximum events per batch
`requestTimeoutMs`	`number`	`10_000` (10 sec)	HTTP request timeout
`predictionCacheTtlMs`	`number`	`21_600_000` (6 hr)	Prediction cache TTL
`dryRun`	`boolean`	`false`	Log events instead of sending
`maxRetries`	`number`	`3`	Retries per submission
`onError`	`(error: Error) => void`	noop	Error callback

EventBuilder Usage:

const event = client.eventBuilder.build({
  task_duration_ms: 1500,
  task_type: TaskType.IMPLEMENTATION,
  complexity: Complexity.LOW,
  harness: Harness.API_DIRECT,
  model: "claude-sonnet-4-5",
  provider: Provider.ANTHROPIC,
  estimated_input_tokens: 0,
  estimated_output_tokens: 0,
  actual_input_tokens: 200,
  actual_output_tokens: 500,
  estimated_cost_usd_micros: 0,
  actual_cost_usd_micros: 8100,
  quality_gate_passed: true,
  quality_gates_run: [QualityGate.LINT, QualityGate.TEST],
  quality_gates_failed: [],
  context_compactions: 0,
  context_rotations: 0,
  context_utilization_final: 0.3,
  outcome: Outcome.SUCCESS,
  retry_count: 0,
  language: "typescript",
});

client.track(event);

Python: mosaicstack-telemetry

Registry: Gitea PyPI registry at git.mosaicstack.dev Version: 0.1.0

Installation:

pip install mosaicstack-telemetry

Key Imports:

from mosaicstack_telemetry import (
    TelemetryClient,
    TelemetryConfig,
    EventBuilder,
    TaskType,
    Complexity,
    Harness,
    Provider,
    QualityGate,
    Outcome,
)

Python Client Usage:

# Create config (reads MOSAIC_TELEMETRY_* env vars automatically)
config = TelemetryConfig()
errors = config.validate()

# Create and start client
client = TelemetryClient(config)
await client.start_async()

# Build and track an event
builder = EventBuilder(instance_id=config.instance_id)
event = (
    builder
    .task_type(TaskType.IMPLEMENTATION)
    .complexity_level(Complexity.MEDIUM)
    .harness_type(Harness.CLAUDE_CODE)
    .model("claude-sonnet-4-5")
    .provider(Provider.ANTHROPIC)
    .duration_ms(5000)
    .outcome_value(Outcome.SUCCESS)
    .tokens(
        estimated_in=0,
        estimated_out=0,
        actual_in=3000,
        actual_out=1500,
    )
    .cost(estimated=0, actual=52500)
    .quality(
        passed=True,
        gates_run=[QualityGate.BUILD, QualityGate.LINT, QualityGate.TEST],
        gates_failed=[],
    )
    .context(compactions=0, rotations=0, utilization=0.4)
    .retry_count(0)
    .language("typescript")
    .build()
)

client.track(event)

# Shutdown (flushes remaining events)
await client.stop_async()

6. Development Guide

Testing Locally with Dry-Run Mode

The fastest way to develop with telemetry is to use dry-run mode. This logs event payloads to the console without needing a running telemetry API:

# In your .env
MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000

Start the API server and trigger LLM operations. You will see telemetry event payloads logged in the console output.

Adding New Tracking Points

To add telemetry tracking to a new service in the NestJS API:

Step 1: Inject MosaicTelemetryService into your service. Because MosaicTelemetryModule is global, no module import is needed:

import { Injectable } from "@nestjs/common";
import { MosaicTelemetryService } from "../mosaic-telemetry/mosaic-telemetry.service";
import { TaskType, Complexity, Harness, Provider, Outcome } from "@mosaicstack/telemetry-client";

@Injectable()
export class MyService {
  constructor(private readonly telemetry: MosaicTelemetryService) {}
}

Step 2: Build and track events after task completion:

async performTask(): Promise<void> {
  const start = Date.now();

  // ... perform the task ...

  const duration = Date.now() - start;
  const builder = this.telemetry.eventBuilder;

  if (builder) {
    const event = builder.build({
      task_duration_ms: duration,
      task_type: TaskType.IMPLEMENTATION,
      complexity: Complexity.MEDIUM,
      harness: Harness.API_DIRECT,
      model: "claude-sonnet-4-5",
      provider: Provider.ANTHROPIC,
      estimated_input_tokens: 0,
      estimated_output_tokens: 0,
      actual_input_tokens: inputTokens,
      actual_output_tokens: outputTokens,
      estimated_cost_usd_micros: 0,
      actual_cost_usd_micros: costMicros,
      quality_gate_passed: true,
      quality_gates_run: [],
      quality_gates_failed: [],
      context_compactions: 0,
      context_rotations: 0,
      context_utilization_final: 0,
      outcome: Outcome.SUCCESS,
      retry_count: 0,
    });

    this.telemetry.trackTaskCompletion(event);
  }
}

Step 3: For LLM-specific tracking, use LlmTelemetryTrackerService instead, which handles cost calculation and task type inference automatically:

import { LlmTelemetryTrackerService } from "../llm/llm-telemetry-tracker.service";

@Injectable()
export class MyLlmService {
  constructor(private readonly telemetryTracker: LlmTelemetryTrackerService) {}

  async chat(): Promise<void> {
    const start = Date.now();

    // ... call LLM ...

    this.telemetryTracker.trackLlmCompletion({
      model: "claude-sonnet-4-5",
      providerType: "claude",
      operation: "chat",
      durationMs: Date.now() - start,
      inputTokens: 150,
      outputTokens: 300,
      callingContext: "brain",  // Used for task type inference
      success: true,
    });
  }
}

Adding Tracking in the Coordinator (Python)

Use the build_task_event() helper from src/mosaic_telemetry.py:

from src.mosaic_telemetry import build_task_event, get_telemetry_client

client = get_telemetry_client(app)
if client is not None:
    event = build_task_event(
        instance_id=instance_id,
        task_type=TaskType.IMPLEMENTATION,
        complexity=Complexity.MEDIUM,
        outcome=Outcome.SUCCESS,
        duration_ms=5000,
        model="claude-sonnet-4-5",
        provider=Provider.ANTHROPIC,
        harness=Harness.CLAUDE_CODE,
        actual_input_tokens=3000,
        actual_output_tokens=1500,
        actual_cost_micros=52500,
    )
    client.track(event)

Troubleshooting

Telemetry events not appearing:

Check that MOSAIC_TELEMETRY_ENABLED=true is set
Verify all three required variables are set: SERVER_URL, API_KEY, INSTANCE_ID
Look for warning logs: "Mosaic Telemetry is enabled but missing configuration" indicates a missing variable
Try dry-run mode to confirm events are being generated

Console shows "Mosaic Telemetry is disabled":

This is the expected message when MOSAIC_TELEMETRY_ENABLED=false. If you intended telemetry to be active, set it to true.

Events queuing but not submitting:

Check that the telemetry API server at MOSAIC_TELEMETRY_SERVER_URL is reachable
Verify the API key is a valid 64-character hex string
The default submission interval is 5 minutes; wait at least one interval or call stop() to force a flush

Prediction endpoint returns null:

Predictions require sufficient historical data in the telemetry API
Check the metadata.confidence field; "none" means no data exists for this combination
Predictions are cached for 6 hours; new data takes time to appear
The PredictionService logs startup refresh status; check logs for errors

"Telemetry client error" in logs:

These are non-fatal. The SDK never blocks application logic.
Common causes: network timeout, invalid API key, server-side validation failure
Check the telemetry API logs for corresponding errors

25 KiB Raw Blame History