merge: resolve conflicts with develop (telemetry + lockfile)

Keep both Mosaic Telemetry section (from develop) and Matrix Dev Environment section (from feature branch) in .env.example. Regenerate pnpm-lock.yaml with both dependency trees merged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 12:12:43 -06:00
parent 03d0c032e4 c5a87df6e1
commit eca2c46e9d
42 changed files with 6276 additions and 15 deletions
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -0,0 +1,735 @@
+# Mosaic Telemetry Integration Guide
+
+## 1. Overview
+
+### What is Mosaic Telemetry?
+
+Mosaic Telemetry is a task completion tracking system purpose-built for AI operations within Mosaic Stack. It captures detailed metrics about every AI task execution -- token usage, cost, duration, outcome, and quality gate results -- and submits them to a central telemetry API for aggregation and analysis.
+
+The aggregated data powers a **prediction system** that provides pre-task estimates for cost, token usage, and expected quality, enabling informed decisions before dispatching work to AI agents.
+
+### How It Differs from OpenTelemetry
+
+Mosaic Stack uses **two separate telemetry systems** that serve different purposes:
+
+| Aspect                            | OpenTelemetry (OTEL)                          | Mosaic Telemetry                             |
+| --------------------------------- | --------------------------------------------- | -------------------------------------------- |
+| **Purpose**                       | Distributed request tracing and observability | AI task completion metrics and predictions   |
+| **What it tracks**                | HTTP requests, spans, latency, errors         | Token counts, costs, outcomes, quality gates |
+| **Data destination**              | OTEL Collector (Jaeger, Grafana, etc.)        | Mosaic Telemetry API (PostgreSQL-backed)     |
+| **Module location (API)**         | `apps/api/src/telemetry/`                     | `apps/api/src/mosaic-telemetry/`             |
+| **Module location (Coordinator)** | `apps/coordinator/src/telemetry.py`           | `apps/coordinator/src/mosaic_telemetry.py`   |
+
+Both systems can run simultaneously. They are completely independent.
+
+### Architecture
+
+```
+------------------+     +------------------+
+|   Mosaic API     |     |  Coordinator     |
+|   (NestJS)       |     |  (FastAPI)       |
+--------+---------+     +--------+---------+
+         |                        |
+    Track events             Track events
+         |                        |
+         v                        v
+------------------------------------------+
+|    Telemetry Client SDK                  |
+|    (JS: @mosaicstack/telemetry-client)   |
+|    (Py: mosaicstack-telemetry)           |
+|                                          |
+|  - Event queue (in-memory)               |
+|  - Batch submission (5-min intervals)    |
+|  - Prediction cache (6hr TTL)            |
+-------------------+----------------------+
+                    |
+              HTTP POST /events
+              HTTP POST /predictions
+                    |
+                    v
+------------------------------------------+
+|    Mosaic Telemetry API                  |
+|    (Separate service)                    |
+|                                          |
+|  - Event ingestion & validation          |
+|  - Aggregation & statistics              |
+|  - Prediction generation                 |
+-------------------+----------------------+
+                    |
+                    v
+            +---------------+
+            |  PostgreSQL   |
+            +---------------+
+```
+
+**Data flow:**
+
+1. Application code calls `trackTaskCompletion()` (JS) or `client.track()` (Python)
+2. Events are queued in memory (up to 1,000 events)
+3. A background timer flushes the queue every 5 minutes in batches of up to 100
+4. The telemetry API ingests events, validates them, and stores them in PostgreSQL
+5. Prediction queries are served from aggregated data with a 6-hour cache TTL
+
+---
+
+## 2. Configuration Guide
+
+### Environment Variables
+
+All configuration is done through environment variables prefixed with `MOSAIC_TELEMETRY_`:
+
+| Variable                       | Type    | Default | Description                                                                                                                          |
+| ------------------------------ | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------ |
+| `MOSAIC_TELEMETRY_ENABLED`     | boolean | `true`  | Master switch. Set to `false` to completely disable telemetry (no HTTP calls).                                                       |
+| `MOSAIC_TELEMETRY_SERVER_URL`  | string  | (none)  | URL of the telemetry API server. For Docker Compose: `http://telemetry-api:8000`. For production: `https://tel-api.mosaicstack.dev`. |
+| `MOSAIC_TELEMETRY_API_KEY`     | string  | (none)  | API key for authenticating with the telemetry server. Generate with: `openssl rand -hex 32` (64-char hex string).                    |
+| `MOSAIC_TELEMETRY_INSTANCE_ID` | string  | (none)  | Unique UUID identifying this Mosaic Stack instance. Generate with: `uuidgen` or `python -c "import uuid; print(uuid.uuid4())"`.      |
+| `MOSAIC_TELEMETRY_DRY_RUN`     | boolean | `false` | When `true`, events are logged to console instead of being sent via HTTP. Useful for development.                                    |
+
+### Enabling Telemetry
+
+To enable telemetry, set all three required variables in your `.env` file:
+
+```bash
+MOSAIC_TELEMETRY_ENABLED=true
+MOSAIC_TELEMETRY_SERVER_URL=http://telemetry-api:8000
+MOSAIC_TELEMETRY_API_KEY=<your-64-char-hex-api-key>
+MOSAIC_TELEMETRY_INSTANCE_ID=<your-uuid>
+```
+
+If `MOSAIC_TELEMETRY_ENABLED` is `true` but any of `SERVER_URL`, `API_KEY`, or `INSTANCE_ID` is missing, the service logs a warning and disables telemetry gracefully. This is intentional: telemetry configuration issues never prevent the application from starting.
+
+### Disabling Telemetry
+
+Set `MOSAIC_TELEMETRY_ENABLED=false` in your `.env`. No HTTP calls will be made, and all tracking methods become safe no-ops.
+
+### Dry-Run Mode
+
+For local development and debugging, enable dry-run mode:
+
+```bash
+MOSAIC_TELEMETRY_ENABLED=true
+MOSAIC_TELEMETRY_DRY_RUN=true
+MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000  # Not actually called
+MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
+MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000
+```
+
+In dry-run mode, the SDK logs event payloads to the console instead of submitting them via HTTP. This lets you verify that tracking points are firing correctly without needing a running telemetry API.
+
+### Docker Compose Configuration
+
+Both `docker-compose.yml` (root) and `docker/docker-compose.yml` pass telemetry environment variables to the API service:
+
+```yaml
+services:
+  mosaic-api:
+    environment:
+      # Telemetry (task completion tracking & predictions)
+      MOSAIC_TELEMETRY_ENABLED: ${MOSAIC_TELEMETRY_ENABLED:-false}
+      MOSAIC_TELEMETRY_SERVER_URL: ${MOSAIC_TELEMETRY_SERVER_URL:-http://telemetry-api:8000}
+      MOSAIC_TELEMETRY_API_KEY: ${MOSAIC_TELEMETRY_API_KEY:-}
+      MOSAIC_TELEMETRY_INSTANCE_ID: ${MOSAIC_TELEMETRY_INSTANCE_ID:-}
+      MOSAIC_TELEMETRY_DRY_RUN: ${MOSAIC_TELEMETRY_DRY_RUN:-false}
+```
+
+Note that telemetry defaults to `false` in Docker Compose. Set `MOSAIC_TELEMETRY_ENABLED=true` in your `.env` to activate it.
+
+An optional local telemetry API service is available (commented out in `docker/docker-compose.yml`). Uncomment it to run a self-contained development environment:
+
+```yaml
+# Uncomment in docker/docker-compose.yml
+telemetry-api:
+  image: git.mosaicstack.dev/mosaic/telemetry-api:latest
+  container_name: mosaic-telemetry-api
+  restart: unless-stopped
+  environment:
+    HOST: 0.0.0.0
+    PORT: 8000
+  ports:
+    - "8001:8000"
+  healthcheck:
+    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
+    interval: 30s
+    timeout: 10s
+    retries: 3
+    start_period: 10s
+  networks:
+    - mosaic-network
+```
+
+---
+
+## 3. What Gets Tracked
+
+### TaskCompletionEvent Schema
+
+Every tracked event conforms to the `TaskCompletionEvent` interface. This is the core data structure submitted to the telemetry API:
+
+| Field                       | Type                | Description                                                    |
+| --------------------------- | ------------------- | -------------------------------------------------------------- |
+| `instance_id`               | `string`            | UUID of the Mosaic Stack instance that generated the event     |
+| `event_id`                  | `string`            | Unique UUID for this event (auto-generated by the SDK)         |
+| `schema_version`            | `string`            | Schema version for forward compatibility (auto-set by the SDK) |
+| `timestamp`                 | `string`            | ISO 8601 timestamp of event creation (auto-set by the SDK)     |
+| `task_duration_ms`          | `number`            | How long the task took in milliseconds                         |
+| `task_type`                 | `TaskType`          | Type of task performed (see enum below)                        |
+| `complexity`                | `Complexity`        | Complexity level of the task                                   |
+| `harness`                   | `Harness`           | The coding harness or tool used                                |
+| `model`                     | `string`            | AI model name (e.g., `"claude-sonnet-4-5"`)                    |
+| `provider`                  | `Provider`          | AI model provider                                              |
+| `estimated_input_tokens`    | `number`            | Pre-task estimated input tokens (from predictions)             |
+| `estimated_output_tokens`   | `number`            | Pre-task estimated output tokens (from predictions)            |
+| `actual_input_tokens`       | `number`            | Actual input tokens consumed                                   |
+| `actual_output_tokens`      | `number`            | Actual output tokens generated                                 |
+| `estimated_cost_usd_micros` | `number`            | Pre-task estimated cost in microdollars (USD \* 1,000,000)     |
+| `actual_cost_usd_micros`    | `number`            | Actual cost in microdollars                                    |
+| `quality_gate_passed`       | `boolean`           | Whether all quality gates passed                               |
+| `quality_gates_run`         | `QualityGate[]`     | List of quality gates that were executed                       |
+| `quality_gates_failed`      | `QualityGate[]`     | List of quality gates that failed                              |
+| `context_compactions`       | `number`            | Number of context window compactions during the task           |
+| `context_rotations`         | `number`            | Number of context window rotations during the task             |
+| `context_utilization_final` | `number`            | Final context window utilization (0.0 to 1.0)                  |
+| `outcome`                   | `Outcome`           | Task outcome                                                   |
+| `retry_count`               | `number`            | Number of retries before completion                            |
+| `language`                  | `string?`           | Primary programming language (optional)                        |
+| `repo_size_category`        | `RepoSizeCategory?` | Repository size category (optional)                            |
+
+### Enum Values
+
+**TaskType:**
+`planning`, `implementation`, `code_review`, `testing`, `debugging`, `refactoring`, `documentation`, `configuration`, `security_audit`, `unknown`
+
+**Complexity:**
+`low`, `medium`, `high`, `critical`
+
+**Harness:**
+`claude_code`, `opencode`, `kilo_code`, `aider`, `api_direct`, `ollama_local`, `custom`, `unknown`
+
+**Provider:**
+`anthropic`, `openai`, `openrouter`, `ollama`, `google`, `mistral`, `custom`, `unknown`
+
+**QualityGate:**
+`build`, `lint`, `test`, `coverage`, `typecheck`, `security`
+
+**Outcome:**
+`success`, `failure`, `partial`, `timeout`
+
+**RepoSizeCategory:**
+`tiny`, `small`, `medium`, `large`, `huge`
+
+### API Service: LLM Call Tracking
+
+The NestJS API tracks every LLM service call (chat, streaming chat, and embeddings) via `LlmTelemetryTrackerService` at `apps/api/src/llm/llm-telemetry-tracker.service.ts`.
+
+Tracked operations:
+
+- **`chat`** -- Synchronous chat completions
+- **`chatStream`** -- Streaming chat completions
+- **`embed`** -- Embedding generation
+
+For each call, the tracker captures:
+
+- Model name and provider type
+- Input and output token counts
+- Duration in milliseconds
+- Success or failure outcome
+- Calculated cost from the built-in cost table (`apps/api/src/llm/llm-cost-table.ts`)
+- Task type inferred from calling context (e.g., `"brain"` maps to `planning`, `"review"` maps to `code_review`)
+
+The cost table uses longest-prefix matching on model names and covers all major Anthropic and OpenAI models. Ollama/local models are treated as zero-cost.
+
+### Coordinator: Agent Task Dispatch Tracking
+
+The FastAPI coordinator tracks agent task completions in `apps/coordinator/src/mosaic_telemetry.py` and `apps/coordinator/src/coordinator.py`.
+
+After each agent task dispatch (success or failure), the coordinator emits a `TaskCompletionEvent` capturing:
+
+- Task duration from start to finish
+- Agent model, provider, and harness (resolved from the `assigned_agent` field)
+- Task outcome (`success`, `failure`, `partial`, `timeout`)
+- Quality gate results (build, lint, test, etc.)
+- Retry count for the issue
+- Complexity level from issue metadata
+
+The coordinator uses the `build_task_event()` helper function which provides sensible defaults for the coordinator context (Claude Code harness, Anthropic provider, TypeScript language).
+
+### Event Lifecycle
+
+```
+1. Application code calls trackTaskCompletion() or client.track()
+          |
+          v
+2. Event is added to in-memory queue (max 1,000 events)
+          |
+          v
+3. Background timer fires every 5 minutes (submitIntervalMs)
+          |
+          v
+4. Queue is drained in batches of up to 100 events (batchSize)
+          |
+          v
+5. Each batch is POSTed to the telemetry API
+          |
+          v
+6. API validates, stores, and acknowledges each event
+```
+
+If the telemetry API is unreachable, events remain in the queue and are retried on the next interval (up to 3 retries per submission). Telemetry errors are logged but never propagated to calling code.
+
+---
+
+## 4. Prediction System
+
+### How Predictions Work
+
+The Mosaic Telemetry API aggregates historical task completion data across all contributing instances. From this data, it generates statistical predictions for new tasks based on their characteristics (task type, model, provider, complexity).
+
+Predictions include percentile distributions (p10, p25, median, p75, p90) for token usage and cost, plus quality metrics (gate pass rate, success rate).
+
+### Querying Predictions via API
+
+The API exposes a prediction endpoint at:
+
+```
+GET /api/telemetry/estimate?taskType=<taskType>&model=<model>&provider=<provider>&complexity=<complexity>
+```
+
+**Authentication:** Requires a valid session (Bearer token via `AuthGuard`).
+
+**Query Parameters (all required):**
+
+| Parameter    | Type         | Example             | Description           |
+| ------------ | ------------ | ------------------- | --------------------- |
+| `taskType`   | `TaskType`   | `implementation`    | Task type to estimate |
+| `model`      | `string`     | `claude-sonnet-4-5` | Model name            |
+| `provider`   | `Provider`   | `anthropic`         | Provider name         |
+| `complexity` | `Complexity` | `medium`            | Complexity level      |
+
+**Example Request:**
+
+```bash
+curl -X GET \
+  'http://localhost:3001/api/telemetry/estimate?taskType=implementation&model=claude-sonnet-4-5&provider=anthropic&complexity=medium' \
+  -H 'Authorization: Bearer YOUR_SESSION_TOKEN'
+```
+
+**Response:**
+
+```json
+{
+  "data": {
+    "prediction": {
+      "input_tokens": {
+        "p10": 500,
+        "p25": 1200,
+        "median": 2500,
+        "p75": 5000,
+        "p90": 10000
+      },
+      "output_tokens": {
+        "p10": 200,
+        "p25": 800,
+        "median": 1500,
+        "p75": 3000,
+        "p90": 6000
+      },
+      "cost_usd_micros": {
+        "median": 30000
+      },
+      "duration_ms": {
+        "median": 5000
+      },
+      "correction_factors": {
+        "input": 1.0,
+        "output": 1.0
+      },
+      "quality": {
+        "gate_pass_rate": 0.85,
+        "success_rate": 0.92
+      }
+    },
+    "metadata": {
+      "sample_size": 150,
+      "fallback_level": 0,
+      "confidence": "high",
+      "last_updated": "2026-02-15T10:00:00Z",
+      "cache_hit": true
+    }
+  }
+}
+```
+
+If no prediction data is available, the response returns `{ "data": null }`.
+
+### Confidence Levels
+
+The prediction system reports a confidence level based on sample size and data freshness:
+
+| Confidence | Meaning                                                        |
+| ---------- | -------------------------------------------------------------- |
+| `high`     | Substantial sample size, recent data, all dimensions matched   |
+| `medium`   | Moderate sample, some dimension fallback                       |
+| `low`      | Small sample or significant fallback from requested dimensions |
+| `none`     | No data available for this combination                         |
+
+### Fallback Behavior
+
+When exact matches are unavailable, the prediction system falls back through progressively broader aggregations:
+
+1. **Exact match** -- task_type + model + provider + complexity
+2. **Drop complexity** -- task_type + model + provider
+3. **Drop model** -- task_type + provider
+4. **Global** -- task_type only
+
+The `fallback_level` field in metadata indicates which level was used (0 = exact match).
+
+### Cache Strategy
+
+Predictions are cached in-memory by the SDK with a **6-hour TTL** (`predictionCacheTtlMs: 21_600_000`). The `PredictionService` pre-fetches common combinations on startup to warm the cache:
+
+- **Models:** claude-sonnet-4-5, claude-opus-4, claude-haiku-4-5, gpt-4o, gpt-4o-mini
+- **Task types:** implementation, planning, code_review
+- **Complexities:** low, medium
+
+This produces 30 pre-cached queries (5 models x 3 task types x 2 complexities). Subsequent requests for these combinations are served from cache without any HTTP call.
+
+---
+
+## 5. SDK Reference
+
+### JavaScript: @mosaicstack/telemetry-client
+
+**Registry:** Gitea npm registry at `git.mosaicstack.dev`
+**Version:** 0.1.0
+
+**Installation:**
+
+```bash
+pnpm add @mosaicstack/telemetry-client
+```
+
+**Key Exports:**
+
+```typescript
+// Client
+import { TelemetryClient, EventBuilder, resolveConfig } from "@mosaicstack/telemetry-client";
+
+// Types
+import type {
+  TelemetryConfig,
+  TaskCompletionEvent,
+  EventBuilderParams,
+  PredictionQuery,
+  PredictionResponse,
+  PredictionData,
+  PredictionMetadata,
+  TokenDistribution,
+} from "@mosaicstack/telemetry-client";
+
+// Enums
+import {
+  TaskType,
+  Complexity,
+  Harness,
+  Provider,
+  QualityGate,
+  Outcome,
+  RepoSizeCategory,
+} from "@mosaicstack/telemetry-client";
+```
+
+**TelemetryClient API:**
+
+| Method                                                              | Description                                                  |
+| ------------------------------------------------------------------- | ------------------------------------------------------------ |
+| `constructor(config: TelemetryConfig)`                              | Create a new client with the given configuration             |
+| `start(): void`                                                     | Start background batch submission (idempotent)               |
+| `stop(): Promise<void>`                                             | Stop background submission, flush remaining events           |
+| `track(event: TaskCompletionEvent): void`                           | Queue an event for batch submission (never throws)           |
+| `getPrediction(query: PredictionQuery): PredictionResponse \| null` | Get a cached prediction (returns null if not cached/expired) |
+| `refreshPredictions(queries: PredictionQuery[]): Promise<void>`     | Force-refresh predictions from the server                    |
+| `eventBuilder: EventBuilder`                                        | Get the EventBuilder for constructing events                 |
+| `queueSize: number`                                                 | Number of events currently queued                            |
+| `isRunning: boolean`                                                | Whether the client is currently running                      |
+
+**TelemetryConfig Options:**
+
+| Option                 | Type                     | Default             | Description                        |
+| ---------------------- | ------------------------ | ------------------- | ---------------------------------- |
+| `serverUrl`            | `string`                 | (required)          | Base URL of the telemetry server   |
+| `apiKey`               | `string`                 | (required)          | 64-char hex API key                |
+| `instanceId`           | `string`                 | (required)          | UUID for this instance             |
+| `enabled`              | `boolean`                | `true`              | Enable/disable telemetry           |
+| `submitIntervalMs`     | `number`                 | `300_000` (5 min)   | Interval between batch submissions |
+| `maxQueueSize`         | `number`                 | `1000`              | Maximum queued events              |
+| `batchSize`            | `number`                 | `100`               | Maximum events per batch           |
+| `requestTimeoutMs`     | `number`                 | `10_000` (10 sec)   | HTTP request timeout               |
+| `predictionCacheTtlMs` | `number`                 | `21_600_000` (6 hr) | Prediction cache TTL               |
+| `dryRun`               | `boolean`                | `false`             | Log events instead of sending      |
+| `maxRetries`           | `number`                 | `3`                 | Retries per submission             |
+| `onError`              | `(error: Error) => void` | noop                | Error callback                     |
+
+**EventBuilder Usage:**
+
+```typescript
+const event = client.eventBuilder.build({
+  task_duration_ms: 1500,
+  task_type: TaskType.IMPLEMENTATION,
+  complexity: Complexity.LOW,
+  harness: Harness.API_DIRECT,
+  model: "claude-sonnet-4-5",
+  provider: Provider.ANTHROPIC,
+  estimated_input_tokens: 0,
+  estimated_output_tokens: 0,
+  actual_input_tokens: 200,
+  actual_output_tokens: 500,
+  estimated_cost_usd_micros: 0,
+  actual_cost_usd_micros: 8100,
+  quality_gate_passed: true,
+  quality_gates_run: [QualityGate.LINT, QualityGate.TEST],
+  quality_gates_failed: [],
+  context_compactions: 0,
+  context_rotations: 0,
+  context_utilization_final: 0.3,
+  outcome: Outcome.SUCCESS,
+  retry_count: 0,
+  language: "typescript",
+});
+
+client.track(event);
+```
+
+### Python: mosaicstack-telemetry
+
+**Registry:** Gitea PyPI registry at `git.mosaicstack.dev`
+**Version:** 0.1.0
+
+**Installation:**
+
+```bash
+pip install mosaicstack-telemetry
+```
+
+**Key Imports:**
+
+```python
+from mosaicstack_telemetry import (
+    TelemetryClient,
+    TelemetryConfig,
+    EventBuilder,
+    TaskType,
+    Complexity,
+    Harness,
+    Provider,
+    QualityGate,
+    Outcome,
+)
+```
+
+**Python Client Usage:**
+
+```python
+# Create config (reads MOSAIC_TELEMETRY_* env vars automatically)
+config = TelemetryConfig()
+errors = config.validate()
+
+# Create and start client
+client = TelemetryClient(config)
+await client.start_async()
+
+# Build and track an event
+builder = EventBuilder(instance_id=config.instance_id)
+event = (
+    builder
+    .task_type(TaskType.IMPLEMENTATION)
+    .complexity_level(Complexity.MEDIUM)
+    .harness_type(Harness.CLAUDE_CODE)
+    .model("claude-sonnet-4-5")
+    .provider(Provider.ANTHROPIC)
+    .duration_ms(5000)
+    .outcome_value(Outcome.SUCCESS)
+    .tokens(
+        estimated_in=0,
+        estimated_out=0,
+        actual_in=3000,
+        actual_out=1500,
+    )
+    .cost(estimated=0, actual=52500)
+    .quality(
+        passed=True,
+        gates_run=[QualityGate.BUILD, QualityGate.LINT, QualityGate.TEST],
+        gates_failed=[],
+    )
+    .context(compactions=0, rotations=0, utilization=0.4)
+    .retry_count(0)
+    .language("typescript")
+    .build()
+)
+
+client.track(event)
+
+# Shutdown (flushes remaining events)
+await client.stop_async()
+```
+
+---
+
+## 6. Development Guide
+
+### Testing Locally with Dry-Run Mode
+
+The fastest way to develop with telemetry is to use dry-run mode. This logs event payloads to the console without needing a running telemetry API:
+
+```bash
+# In your .env
+MOSAIC_TELEMETRY_ENABLED=true
+MOSAIC_TELEMETRY_DRY_RUN=true
+MOSAIC_TELEMETRY_SERVER_URL=http://localhost:8000
+MOSAIC_TELEMETRY_API_KEY=0000000000000000000000000000000000000000000000000000000000000000
+MOSAIC_TELEMETRY_INSTANCE_ID=00000000-0000-0000-0000-000000000000
+```
+
+Start the API server and trigger LLM operations. You will see telemetry event payloads logged in the console output.
+
+### Adding New Tracking Points
+
+To add telemetry tracking to a new service in the NestJS API:
+
+**Step 1:** Inject `MosaicTelemetryService` into your service. Because `MosaicTelemetryModule` is global, no module import is needed:
+
+```typescript
+import { Injectable } from "@nestjs/common";
+import { MosaicTelemetryService } from "../mosaic-telemetry/mosaic-telemetry.service";
+import { TaskType, Complexity, Harness, Provider, Outcome } from "@mosaicstack/telemetry-client";
+
+@Injectable()
+export class MyService {
+  constructor(private readonly telemetry: MosaicTelemetryService) {}
+}
+```
+
+**Step 2:** Build and track events after task completion:
+
+```typescript
+async performTask(): Promise<void> {
+  const start = Date.now();
+
+  // ... perform the task ...
+
+  const duration = Date.now() - start;
+  const builder = this.telemetry.eventBuilder;
+
+  if (builder) {
+    const event = builder.build({
+      task_duration_ms: duration,
+      task_type: TaskType.IMPLEMENTATION,
+      complexity: Complexity.MEDIUM,
+      harness: Harness.API_DIRECT,
+      model: "claude-sonnet-4-5",
+      provider: Provider.ANTHROPIC,
+      estimated_input_tokens: 0,
+      estimated_output_tokens: 0,
+      actual_input_tokens: inputTokens,
+      actual_output_tokens: outputTokens,
+      estimated_cost_usd_micros: 0,
+      actual_cost_usd_micros: costMicros,
+      quality_gate_passed: true,
+      quality_gates_run: [],
+      quality_gates_failed: [],
+      context_compactions: 0,
+      context_rotations: 0,
+      context_utilization_final: 0,
+      outcome: Outcome.SUCCESS,
+      retry_count: 0,
+    });
+
+    this.telemetry.trackTaskCompletion(event);
+  }
+}
+```
+
+**Step 3:** For LLM-specific tracking, use `LlmTelemetryTrackerService` instead, which handles cost calculation and task type inference automatically:
+
+```typescript
+import { LlmTelemetryTrackerService } from "../llm/llm-telemetry-tracker.service";
+
+@Injectable()
+export class MyLlmService {
+  constructor(private readonly telemetryTracker: LlmTelemetryTrackerService) {}
+
+  async chat(): Promise<void> {
+    const start = Date.now();
+
+    // ... call LLM ...
+
+    this.telemetryTracker.trackLlmCompletion({
+      model: "claude-sonnet-4-5",
+      providerType: "claude",
+      operation: "chat",
+      durationMs: Date.now() - start,
+      inputTokens: 150,
+      outputTokens: 300,
+      callingContext: "brain", // Used for task type inference
+      success: true,
+    });
+  }
+}
+```
+
+### Adding Tracking in the Coordinator (Python)
+
+Use the `build_task_event()` helper from `src/mosaic_telemetry.py`:
+
+```python
+from src.mosaic_telemetry import build_task_event, get_telemetry_client
+
+client = get_telemetry_client(app)
+if client is not None:
+    event = build_task_event(
+        instance_id=instance_id,
+        task_type=TaskType.IMPLEMENTATION,
+        complexity=Complexity.MEDIUM,
+        outcome=Outcome.SUCCESS,
+        duration_ms=5000,
+        model="claude-sonnet-4-5",
+        provider=Provider.ANTHROPIC,
+        harness=Harness.CLAUDE_CODE,
+        actual_input_tokens=3000,
+        actual_output_tokens=1500,
+        actual_cost_micros=52500,
+    )
+    client.track(event)
+```
+
+### Troubleshooting
+
+**Telemetry events not appearing:**
+
+1. Check that `MOSAIC_TELEMETRY_ENABLED=true` is set
+2. Verify all three required variables are set: `SERVER_URL`, `API_KEY`, `INSTANCE_ID`
+3. Look for warning logs: `"Mosaic Telemetry is enabled but missing configuration"` indicates a missing variable
+4. Try dry-run mode to confirm events are being generated
+
+**Console shows "Mosaic Telemetry is disabled":**
+
+This is the expected message when `MOSAIC_TELEMETRY_ENABLED=false`. If you intended telemetry to be active, set it to `true`.
+
+**Events queuing but not submitting:**
+
+- Check that the telemetry API server at `MOSAIC_TELEMETRY_SERVER_URL` is reachable
+- Verify the API key is a valid 64-character hex string
+- The default submission interval is 5 minutes; wait at least one interval or call `stop()` to force a flush
+
+**Prediction endpoint returns null:**
+
+- Predictions require sufficient historical data in the telemetry API
+- Check the `metadata.confidence` field; `"none"` means no data exists for this combination
+- Predictions are cached for 6 hours; new data takes time to appear
+- The `PredictionService` logs startup refresh status; check logs for errors
+
+**"Telemetry client error" in logs:**
+
+- These are non-fatal. The SDK never blocks application logic.
+- Common causes: network timeout, invalid API key, server-side validation failure
+- Check the telemetry API logs for corresponding errors