Files
stack/docs/telemetry.md
Jason Woltje eca2c46e9d
Some checks failed
ci/woodpecker/push/infra Pipeline was successful
ci/woodpecker/push/api Pipeline failed
ci/woodpecker/push/web Pipeline failed
ci/woodpecker/push/orchestrator Pipeline failed
ci/woodpecker/push/coordinator Pipeline was successful
merge: resolve conflicts with develop (telemetry + lockfile)
Keep both Mosaic Telemetry section (from develop) and Matrix Dev
Environment section (from feature branch) in .env.example.
Regenerate pnpm-lock.yaml with both dependency trees merged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 12:12:43 -06:00

29 KiB

Mosaic Telemetry Integration Guide

1. Overview

What is Mosaic Telemetry?

Mosaic Telemetry is a task completion tracking system purpose-built for AI operations within Mosaic Stack. It captures detailed metrics about every AI task execution -- token usage, cost, duration, outcome, and quality gate results -- and submits them to a central telemetry API for aggregation and analysis.

The aggregated data powers a prediction system that provides pre-task estimates for cost, token usage, and expected quality, enabling informed decisions before dispatching work to AI agents.

How It Differs from OpenTelemetry

Mosaic Stack uses two separate telemetry systems that serve different purposes:

Aspect OpenTelemetry (OTEL) Mosaic Telemetry
Purpose Distributed request tracing and observability AI task completion metrics and predictions
What it tracks HTTP requests, spans, latency, errors Token counts, costs, outcomes, quality gates
Data destination OTEL Collector (Jaeger, Grafana, etc.) Mosaic Telemetry API (PostgreSQL-backed)
Module location (API) apps/api/src/telemetry/ apps/api/src/mosaic-telemetry/
Module location (Coordinator) apps/coordinator/src/telemetry.py apps/coordinator/src/mosaic_telemetry.py

Both systems can run simultaneously. They are completely independent.

Architecture

+------------------+     +------------------+
|   Mosaic API     |     |  Coordinator     |
|   (NestJS)       |     |  (FastAPI)       |
+--------+---------+     +--------+---------+
         |                        |
    Track events             Track events
         |                        |
         v                        v
+------------------------------------------+
|    Telemetry Client SDK                  |
|    (JS: @mosaicstack/telemetry-client)   |
|    (Py: mosaicstack-telemetry)           |
|                                          |
|  - Event queue (in-memory)               |
|  - Batch submission (5-min intervals)    |
|  - Prediction cache (6hr TTL)            |
+-------------------+----------------------+
                    |
              HTTP POST /events
              HTTP POST /predictions
                    |
                    v
+------------------------------------------+
|    Mosaic Telemetry API                  |
|    (Separate service)                    |
|                                          |
|  - Event ingestion & validation          |
|  - Aggregation & statistics              |
|  - Prediction generation                 |
+-------------------+----------------------+
                    |
                    v
            +---------------+
            |  PostgreSQL   |
            +---------------+

Data flow:

  1. Application code calls trackTaskCompletion() (JS) or client.track() (Python)
  2. Events are queued in memory (up to 1,000 events)
  3. A background timer flushes the queue every 5 minutes in batches of up to 100
  4. The telemetry API ingests events, validates them, and stores them in PostgreSQL
  5. Prediction queries are served from aggregated data with a 6-hour cache TTL

2. Configuration Guide

Environment Variables

All configuration is done through environment variables prefixed with MOSAIC_TELEMETRY_:

Variable Type Default Description
MOSAIC_TELEMETRY_ENABLED boolean true Master switch. Set to false to completely disable telemetry (no HTTP calls).
MOSAIC_TELEMETRY_SERVER_URL string (none) URL of the telemetry API server. For Docker Compose: http://telemetry-api:8000. For production: https://tel-api.mosaicstack.dev.
MOSAIC_TELEMETRY_API_KEY string (none) API key for authenticating with the telemetry server. Generate with: openssl rand -hex 32 (64-char hex string).
MOSAIC_TELEMETRY_INSTANCE_ID string (none) Unique UUID identifying this Mosaic Stack instance. Generate with: uuidgen or python -c "import uuid; print(uuid.uuid4())".
MOSAIC_TELEMETRY_DRY_RUN boolean false When true, events are logged to console instead of being sent via HTTP. Useful for development.

Enabling Telemetry

To enable telemetry, set all three required variables in your .env file:

MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_SERVER_URL=http://telemetry-api:8000
MOSAIC_TELEMETRY_API_KEY=<your-64-char-hex-api-key>
MOSAIC_TELEMETRY_INSTANCE_ID=<your-uuid>

If MOSAIC_TELEMETRY_ENABLED is true but any of SERVER_URL, API_KEY, or INSTANCE_ID is missing, the service logs a warning and disables telemetry gracefully. This is intentional: telemetry configuration issues never prevent the application from starting.

Disabling Telemetry

Set MOSAIC_TELEMETRY_ENABLED=false in your .env. No HTTP calls will be made, and all tracking methods become safe no-ops.

Dry-Run Mode

For local development and debugging, enable dry-run mode:

MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000  # Not actually called
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000

In dry-run mode, the SDK logs event payloads to the console instead of submitting them via HTTP. This lets you verify that tracking points are firing correctly without needing a running telemetry API.

Docker Compose Configuration

Both docker-compose.yml (root) and docker/docker-compose.yml pass telemetry environment variables to the API service:

services:
  mosaic-api:
    environment:
      # Telemetry (task completion tracking & predictions)
      MOSAIC_TELEMETRY_ENABLED: ${MOSAIC_TELEMETRY_ENABLED:-false}
      MOSAIC_TELEMETRY_SERVER_URL: ${MOSAIC_TELEMETRY_SERVER_URL:-http://telemetry-api:8000}
      MOSAIC_TELEMETRY_API_KEY: ${MOSAIC_TELEMETRY_API_KEY:-}
      MOSAIC_TELEMETRY_INSTANCE_ID: ${MOSAIC_TELEMETRY_INSTANCE_ID:-}
      MOSAIC_TELEMETRY_DRY_RUN: ${MOSAIC_TELEMETRY_DRY_RUN:-false}

Note that telemetry defaults to false in Docker Compose. Set MOSAIC_TELEMETRY_ENABLED=true in your .env to activate it.

An optional local telemetry API service is available (commented out in docker/docker-compose.yml). Uncomment it to run a self-contained development environment:

# Uncomment in docker/docker-compose.yml
telemetry-api:
  image: git.mosaicstack.dev/mosaic/telemetry-api:latest
  container_name: mosaic-telemetry-api
  restart: unless-stopped
  environment:
    HOST: 0.0.0.0
    PORT: 8000
  ports:
    - "8001:8000"
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 10s
  networks:
    - mosaic-network

3. What Gets Tracked

TaskCompletionEvent Schema

Every tracked event conforms to the TaskCompletionEvent interface. This is the core data structure submitted to the telemetry API:

Field Type Description
instance_id string UUID of the Mosaic Stack instance that generated the event
event_id string Unique UUID for this event (auto-generated by the SDK)
schema_version string Schema version for forward compatibility (auto-set by the SDK)
timestamp string ISO 8601 timestamp of event creation (auto-set by the SDK)
task_duration_ms number How long the task took in milliseconds
task_type TaskType Type of task performed (see enum below)
complexity Complexity Complexity level of the task
harness Harness The coding harness or tool used
model string AI model name (e.g., "claude-sonnet-4-5")
provider Provider AI model provider
estimated_input_tokens number Pre-task estimated input tokens (from predictions)
estimated_output_tokens number Pre-task estimated output tokens (from predictions)
actual_input_tokens number Actual input tokens consumed
actual_output_tokens number Actual output tokens generated
estimated_cost_usd_micros number Pre-task estimated cost in microdollars (USD * 1,000,000)
actual_cost_usd_micros number Actual cost in microdollars
quality_gate_passed boolean Whether all quality gates passed
quality_gates_run QualityGate[] List of quality gates that were executed
quality_gates_failed QualityGate[] List of quality gates that failed
context_compactions number Number of context window compactions during the task
context_rotations number Number of context window rotations during the task
context_utilization_final number Final context window utilization (0.0 to 1.0)
outcome Outcome Task outcome
retry_count number Number of retries before completion
language string? Primary programming language (optional)
repo_size_category RepoSizeCategory? Repository size category (optional)

Enum Values

TaskType: planning, implementation, code_review, testing, debugging, refactoring, documentation, configuration, security_audit, unknown

Complexity: low, medium, high, critical

Harness: claude_code, opencode, kilo_code, aider, api_direct, ollama_local, custom, unknown

Provider: anthropic, openai, openrouter, ollama, google, mistral, custom, unknown

QualityGate: build, lint, test, coverage, typecheck, security

Outcome: success, failure, partial, timeout

RepoSizeCategory: tiny, small, medium, large, huge

API Service: LLM Call Tracking

The NestJS API tracks every LLM service call (chat, streaming chat, and embeddings) via LlmTelemetryTrackerService at apps/api/src/llm/llm-telemetry-tracker.service.ts.

Tracked operations:

  • chat -- Synchronous chat completions
  • chatStream -- Streaming chat completions
  • embed -- Embedding generation

For each call, the tracker captures:

  • Model name and provider type
  • Input and output token counts
  • Duration in milliseconds
  • Success or failure outcome
  • Calculated cost from the built-in cost table (apps/api/src/llm/llm-cost-table.ts)
  • Task type inferred from calling context (e.g., "brain" maps to planning, "review" maps to code_review)

The cost table uses longest-prefix matching on model names and covers all major Anthropic and OpenAI models. Ollama/local models are treated as zero-cost.

Coordinator: Agent Task Dispatch Tracking

The FastAPI coordinator tracks agent task completions in apps/coordinator/src/mosaic_telemetry.py and apps/coordinator/src/coordinator.py.

After each agent task dispatch (success or failure), the coordinator emits a TaskCompletionEvent capturing:

  • Task duration from start to finish
  • Agent model, provider, and harness (resolved from the assigned_agent field)
  • Task outcome (success, failure, partial, timeout)
  • Quality gate results (build, lint, test, etc.)
  • Retry count for the issue
  • Complexity level from issue metadata

The coordinator uses the build_task_event() helper function which provides sensible defaults for the coordinator context (Claude Code harness, Anthropic provider, TypeScript language).

Event Lifecycle

1. Application code calls trackTaskCompletion() or client.track()
          |
          v
2. Event is added to in-memory queue (max 1,000 events)
          |
          v
3. Background timer fires every 5 minutes (submitIntervalMs)
          |
          v
4. Queue is drained in batches of up to 100 events (batchSize)
          |
          v
5. Each batch is POSTed to the telemetry API
          |
          v
6. API validates, stores, and acknowledges each event

If the telemetry API is unreachable, events remain in the queue and are retried on the next interval (up to 3 retries per submission). Telemetry errors are logged but never propagated to calling code.


4. Prediction System

How Predictions Work

The Mosaic Telemetry API aggregates historical task completion data across all contributing instances. From this data, it generates statistical predictions for new tasks based on their characteristics (task type, model, provider, complexity).

Predictions include percentile distributions (p10, p25, median, p75, p90) for token usage and cost, plus quality metrics (gate pass rate, success rate).

Querying Predictions via API

The API exposes a prediction endpoint at:

GET /api/telemetry/estimate?taskType=<taskType>&model=<model>&provider=<provider>&complexity=<complexity>

Authentication: Requires a valid session (Bearer token via AuthGuard).

Query Parameters (all required):

Parameter Type Example Description
taskType TaskType implementation Task type to estimate
model string claude-sonnet-4-5 Model name
provider Provider anthropic Provider name
complexity Complexity medium Complexity level

Example Request:

curl -X GET \
  'http://localhost:3001/api/telemetry/estimate?taskType=implementation&model=claude-sonnet-4-5&provider=anthropic&complexity=medium' \
  -H 'Authorization: Bearer YOUR_SESSION_TOKEN'

Response:

{
  "data": {
    "prediction": {
      "input_tokens": {
        "p10": 500,
        "p25": 1200,
        "median": 2500,
        "p75": 5000,
        "p90": 10000
      },
      "output_tokens": {
        "p10": 200,
        "p25": 800,
        "median": 1500,
        "p75": 3000,
        "p90": 6000
      },
      "cost_usd_micros": {
        "median": 30000
      },
      "duration_ms": {
        "median": 5000
      },
      "correction_factors": {
        "input": 1.0,
        "output": 1.0
      },
      "quality": {
        "gate_pass_rate": 0.85,
        "success_rate": 0.92
      }
    },
    "metadata": {
      "sample_size": 150,
      "fallback_level": 0,
      "confidence": "high",
      "last_updated": "2026-02-15T10:00:00Z",
      "cache_hit": true
    }
  }
}

If no prediction data is available, the response returns { "data": null }.

Confidence Levels

The prediction system reports a confidence level based on sample size and data freshness:

Confidence Meaning
high Substantial sample size, recent data, all dimensions matched
medium Moderate sample, some dimension fallback
low Small sample or significant fallback from requested dimensions
none No data available for this combination

Fallback Behavior

When exact matches are unavailable, the prediction system falls back through progressively broader aggregations:

  1. Exact match -- task_type + model + provider + complexity
  2. Drop complexity -- task_type + model + provider
  3. Drop model -- task_type + provider
  4. Global -- task_type only

The fallback_level field in metadata indicates which level was used (0 = exact match).

Cache Strategy

Predictions are cached in-memory by the SDK with a 6-hour TTL (predictionCacheTtlMs: 21_600_000). The PredictionService pre-fetches common combinations on startup to warm the cache:

  • Models: claude-sonnet-4-5, claude-opus-4, claude-haiku-4-5, gpt-4o, gpt-4o-mini
  • Task types: implementation, planning, code_review
  • Complexities: low, medium

This produces 30 pre-cached queries (5 models x 3 task types x 2 complexities). Subsequent requests for these combinations are served from cache without any HTTP call.


5. SDK Reference

JavaScript: @mosaicstack/telemetry-client

Registry: Gitea npm registry at git.mosaicstack.dev Version: 0.1.0

Installation:

pnpm add @mosaicstack/telemetry-client

Key Exports:

// Client
import { TelemetryClient, EventBuilder, resolveConfig } from "@mosaicstack/telemetry-client";

// Types
import type {
  TelemetryConfig,
  TaskCompletionEvent,
  EventBuilderParams,
  PredictionQuery,
  PredictionResponse,
  PredictionData,
  PredictionMetadata,
  TokenDistribution,
} from "@mosaicstack/telemetry-client";

// Enums
import {
  TaskType,
  Complexity,
  Harness,
  Provider,
  QualityGate,
  Outcome,
  RepoSizeCategory,
} from "@mosaicstack/telemetry-client";

TelemetryClient API:

Method Description
constructor(config: TelemetryConfig) Create a new client with the given configuration
start(): void Start background batch submission (idempotent)
stop(): Promise<void> Stop background submission, flush remaining events
track(event: TaskCompletionEvent): void Queue an event for batch submission (never throws)
getPrediction(query: PredictionQuery): PredictionResponse | null Get a cached prediction (returns null if not cached/expired)
refreshPredictions(queries: PredictionQuery[]): Promise<void> Force-refresh predictions from the server
eventBuilder: EventBuilder Get the EventBuilder for constructing events
queueSize: number Number of events currently queued
isRunning: boolean Whether the client is currently running

TelemetryConfig Options:

Option Type Default Description
serverUrl string (required) Base URL of the telemetry server
apiKey string (required) 64-char hex API key
instanceId string (required) UUID for this instance
enabled boolean true Enable/disable telemetry
submitIntervalMs number 300_000 (5 min) Interval between batch submissions
maxQueueSize number 1000 Maximum queued events
batchSize number 100 Maximum events per batch
requestTimeoutMs number 10_000 (10 sec) HTTP request timeout
predictionCacheTtlMs number 21_600_000 (6 hr) Prediction cache TTL
dryRun boolean false Log events instead of sending
maxRetries number 3 Retries per submission
onError (error: Error) => void noop Error callback

EventBuilder Usage:

const event = client.eventBuilder.build({
  task_duration_ms: 1500,
  task_type: TaskType.IMPLEMENTATION,
  complexity: Complexity.LOW,
  harness: Harness.API_DIRECT,
  model: "claude-sonnet-4-5",
  provider: Provider.ANTHROPIC,
  estimated_input_tokens: 0,
  estimated_output_tokens: 0,
  actual_input_tokens: 200,
  actual_output_tokens: 500,
  estimated_cost_usd_micros: 0,
  actual_cost_usd_micros: 8100,
  quality_gate_passed: true,
  quality_gates_run: [QualityGate.LINT, QualityGate.TEST],
  quality_gates_failed: [],
  context_compactions: 0,
  context_rotations: 0,
  context_utilization_final: 0.3,
  outcome: Outcome.SUCCESS,
  retry_count: 0,
  language: "typescript",
});

client.track(event);

Python: mosaicstack-telemetry

Registry: Gitea PyPI registry at git.mosaicstack.dev Version: 0.1.0

Installation:

pip install mosaicstack-telemetry

Key Imports:

from mosaicstack_telemetry import (
    TelemetryClient,
    TelemetryConfig,
    EventBuilder,
    TaskType,
    Complexity,
    Harness,
    Provider,
    QualityGate,
    Outcome,
)

Python Client Usage:

# Create config (reads MOSAIC_TELEMETRY_* env vars automatically)
config = TelemetryConfig()
errors = config.validate()

# Create and start client
client = TelemetryClient(config)
await client.start_async()

# Build and track an event
builder = EventBuilder(instance_id=config.instance_id)
event = (
    builder
    .task_type(TaskType.IMPLEMENTATION)
    .complexity_level(Complexity.MEDIUM)
    .harness_type(Harness.CLAUDE_CODE)
    .model("claude-sonnet-4-5")
    .provider(Provider.ANTHROPIC)
    .duration_ms(5000)
    .outcome_value(Outcome.SUCCESS)
    .tokens(
        estimated_in=0,
        estimated_out=0,
        actual_in=3000,
        actual_out=1500,
    )
    .cost(estimated=0, actual=52500)
    .quality(
        passed=True,
        gates_run=[QualityGate.BUILD, QualityGate.LINT, QualityGate.TEST],
        gates_failed=[],
    )
    .context(compactions=0, rotations=0, utilization=0.4)
    .retry_count(0)
    .language("typescript")
    .build()
)

client.track(event)

# Shutdown (flushes remaining events)
await client.stop_async()

6. Development Guide

Testing Locally with Dry-Run Mode

The fastest way to develop with telemetry is to use dry-run mode. This logs event payloads to the console without needing a running telemetry API:

# In your .env
MOSAIC_TELEMETRY_ENABLED=true
MOSAIC_TELEMETRY_DRY_RUN=true
MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000
MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000

Start the API server and trigger LLM operations. You will see telemetry event payloads logged in the console output.

Adding New Tracking Points

To add telemetry tracking to a new service in the NestJS API:

Step 1: Inject MosaicTelemetryService into your service. Because MosaicTelemetryModule is global, no module import is needed:

import { Injectable } from "@nestjs/common";
import { MosaicTelemetryService } from "../mosaic-telemetry/mosaic-telemetry.service";
import { TaskType, Complexity, Harness, Provider, Outcome } from "@mosaicstack/telemetry-client";

@Injectable()
export class MyService {
  constructor(private readonly telemetry: MosaicTelemetryService) {}
}

Step 2: Build and track events after task completion:

async performTask(): Promise<void> {
  const start = Date.now();

  // ... perform the task ...

  const duration = Date.now() - start;
  const builder = this.telemetry.eventBuilder;

  if (builder) {
    const event = builder.build({
      task_duration_ms: duration,
      task_type: TaskType.IMPLEMENTATION,
      complexity: Complexity.MEDIUM,
      harness: Harness.API_DIRECT,
      model: "claude-sonnet-4-5",
      provider: Provider.ANTHROPIC,
      estimated_input_tokens: 0,
      estimated_output_tokens: 0,
      actual_input_tokens: inputTokens,
      actual_output_tokens: outputTokens,
      estimated_cost_usd_micros: 0,
      actual_cost_usd_micros: costMicros,
      quality_gate_passed: true,
      quality_gates_run: [],
      quality_gates_failed: [],
      context_compactions: 0,
      context_rotations: 0,
      context_utilization_final: 0,
      outcome: Outcome.SUCCESS,
      retry_count: 0,
    });

    this.telemetry.trackTaskCompletion(event);
  }
}

Step 3: For LLM-specific tracking, use LlmTelemetryTrackerService instead, which handles cost calculation and task type inference automatically:

import { LlmTelemetryTrackerService } from "../llm/llm-telemetry-tracker.service";

@Injectable()
export class MyLlmService {
  constructor(private readonly telemetryTracker: LlmTelemetryTrackerService) {}

  async chat(): Promise<void> {
    const start = Date.now();

    // ... call LLM ...

    this.telemetryTracker.trackLlmCompletion({
      model: "claude-sonnet-4-5",
      providerType: "claude",
      operation: "chat",
      durationMs: Date.now() - start,
      inputTokens: 150,
      outputTokens: 300,
      callingContext: "brain", // Used for task type inference
      success: true,
    });
  }
}

Adding Tracking in the Coordinator (Python)

Use the build_task_event() helper from src/mosaic_telemetry.py:

from src.mosaic_telemetry import build_task_event, get_telemetry_client

client = get_telemetry_client(app)
if client is not None:
    event = build_task_event(
        instance_id=instance_id,
        task_type=TaskType.IMPLEMENTATION,
        complexity=Complexity.MEDIUM,
        outcome=Outcome.SUCCESS,
        duration_ms=5000,
        model="claude-sonnet-4-5",
        provider=Provider.ANTHROPIC,
        harness=Harness.CLAUDE_CODE,
        actual_input_tokens=3000,
        actual_output_tokens=1500,
        actual_cost_micros=52500,
    )
    client.track(event)

Troubleshooting

Telemetry events not appearing:

  1. Check that MOSAIC_TELEMETRY_ENABLED=true is set
  2. Verify all three required variables are set: SERVER_URL, API_KEY, INSTANCE_ID
  3. Look for warning logs: "Mosaic Telemetry is enabled but missing configuration" indicates a missing variable
  4. Try dry-run mode to confirm events are being generated

Console shows "Mosaic Telemetry is disabled":

This is the expected message when MOSAIC_TELEMETRY_ENABLED=false. If you intended telemetry to be active, set it to true.

Events queuing but not submitting:

  • Check that the telemetry API server at MOSAIC_TELEMETRY_SERVER_URL is reachable
  • Verify the API key is a valid 64-character hex string
  • The default submission interval is 5 minutes; wait at least one interval or call stop() to force a flush

Prediction endpoint returns null:

  • Predictions require sufficient historical data in the telemetry API
  • Check the metadata.confidence field; "none" means no data exists for this combination
  • Predictions are cached for 6 hours; new data takes time to appear
  • The PredictionService logs startup refresh status; check logs for errors

"Telemetry client error" in logs:

  • These are non-fatal. The SDK never blocks application logic.
  • Common causes: network timeout, invalid API key, server-side validation failure
  • Check the telemetry API logs for corresponding errors