telemetry-client-js/docs/integration-guide.md

# Integration Guide

This guide covers how to integrate `@mosaicstack/telemetry-client` into your applications. The SDK targets **Mosaic Telemetry API v1** (event schema version `1.0`).

## Prerequisites

- Node.js >= 18 (for native `fetch` and `crypto.randomUUID()`)
- A Mosaic Telemetry API key and instance ID (issued by an administrator via the admin API)

## Installation

Configure the Gitea npm registry in your project's `.npmrc`:

```ini
@mosaicstack:registry=https://git.mosaicstack.dev/api/packages/mosaic/npm/
```

Then install:

```bash
# Latest stable release (from main)
npm install @mosaicstack/telemetry-client

# Latest dev build (from develop)
npm install @mosaicstack/telemetry-client@dev
```

| Branch | Dist-tag | Version format | Example |
|--------|----------|----------------|---------|
| `main` | `latest` | `{version}` | `0.1.0` |
| `develop` | `dev` | `{version}-dev.{YYYYMMDDHHmmss}` | `0.1.0-dev.20260215050000` |

The package ships ESM-only with TypeScript declarations. Zero runtime dependencies.

## Environment Setup

Store your credentials in environment variables — never hardcode them.

```bash
# .env (not committed — add to .gitignore)
TELEMETRY_API_URL=https://tel-api.mosaicstack.dev
TELEMETRY_API_KEY=msk_your_api_key_here
TELEMETRY_INSTANCE_ID=a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d
```

```bash
# .env.example (committed — documents required variables)
TELEMETRY_API_URL=https://tel-api.mosaicstack.dev
TELEMETRY_API_KEY=your-api-key
TELEMETRY_INSTANCE_ID=your-instance-uuid
```

---

## Instrumenting a Next.js App

Next.js server actions and API routes run on Node.js, so the SDK works directly. Create a shared singleton and track events from your server-side code.

### 1. Create a telemetry singleton

```typescript
// lib/telemetry.ts
import {
  TelemetryClient,
  TaskType,
  Complexity,
  Harness,
  Provider,
  Outcome,
  QualityGate,
} from '@mosaicstack/telemetry-client';

let client: TelemetryClient | null = null;

export function getTelemetryClient(): TelemetryClient {
  if (!client) {
    client = new TelemetryClient({
      serverUrl: process.env.TELEMETRY_API_URL!,
      apiKey: process.env.TELEMETRY_API_KEY!,
      instanceId: process.env.TELEMETRY_INSTANCE_ID!,
      enabled: process.env.NODE_ENV === 'production',
      onError: (err) => console.error('[telemetry]', err.message),
    });
    client.start();
  }
  return client;
}

// Re-export enums for convenience
export { TaskType, Complexity, Harness, Provider, Outcome, QualityGate };
```

### 2. Track events from an API route

```typescript
// app/api/task-complete/route.ts
import { NextResponse } from 'next/server';
import { getTelemetryClient, TaskType, Complexity, Harness, Provider, Outcome } from '@/lib/telemetry';

export async function POST(request: Request) {
  const body = await request.json();

  const client = getTelemetryClient();
  const event = client.eventBuilder.build({
    task_duration_ms: body.durationMs,
    task_type: TaskType.IMPLEMENTATION,
    complexity: Complexity.MEDIUM,
    harness: Harness.CLAUDE_CODE,
    model: body.model,
    provider: Provider.ANTHROPIC,
    estimated_input_tokens: body.estimatedInputTokens,
    estimated_output_tokens: body.estimatedOutputTokens,
    actual_input_tokens: body.actualInputTokens,
    actual_output_tokens: body.actualOutputTokens,
    estimated_cost_usd_micros: body.estimatedCostMicros,
    actual_cost_usd_micros: body.actualCostMicros,
    quality_gate_passed: body.qualityGatePassed,
    quality_gates_run: body.qualityGatesRun,
    quality_gates_failed: body.qualityGatesFailed,
    context_compactions: body.contextCompactions,
    context_rotations: body.contextRotations,
    context_utilization_final: body.contextUtilization,
    outcome: Outcome.SUCCESS,
    retry_count: 0,
    language: 'typescript',
  });

  client.track(event);

  return NextResponse.json({ status: 'queued' });
}
```

### 3. Graceful shutdown

Next.js doesn't provide a built-in shutdown hook, but you can handle `SIGTERM`:

```typescript
// instrumentation.ts (Next.js instrumentation file)
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    const { getTelemetryClient } = await import('./lib/telemetry');

    // Ensure the client starts on server boot
    getTelemetryClient();

    // Flush remaining events on shutdown
    const shutdown = async () => {
      const { getTelemetryClient } = await import('./lib/telemetry');
      const client = getTelemetryClient();
      await client.stop();
      process.exit(0);
    };

    process.on('SIGTERM', shutdown);
    process.on('SIGINT', shutdown);
  }
}
```

---

## Instrumenting a Node.js Service

For a standalone Node.js service (Express, Fastify, plain script, etc.).

### 1. Initialize and start

```typescript
// src/telemetry.ts
import { TelemetryClient } from '@mosaicstack/telemetry-client';

export const telemetry = new TelemetryClient({
  serverUrl: process.env.TELEMETRY_API_URL ?? 'https://tel-api.mosaicstack.dev',
  apiKey: process.env.TELEMETRY_API_KEY!,
  instanceId: process.env.TELEMETRY_INSTANCE_ID!,
  onError: (err) => console.error('[telemetry]', err.message),
});

telemetry.start();
```

### 2. Track events after task completion

```typescript
// src/task-runner.ts
import {
  TaskType,
  Complexity,
  Harness,
  Provider,
  Outcome,
  QualityGate,
} from '@mosaicstack/telemetry-client';
import { telemetry } from './telemetry.js';

async function runTask() {
  const startTime = Date.now();

  // ... run your AI coding task ...

  const durationMs = Date.now() - startTime;

  const event = telemetry.eventBuilder.build({
    task_duration_ms: durationMs,
    task_type: TaskType.IMPLEMENTATION,
    complexity: Complexity.HIGH,
    harness: Harness.CLAUDE_CODE,
    model: 'claude-sonnet-4-5-20250929',
    provider: Provider.ANTHROPIC,
    estimated_input_tokens: 200000,
    estimated_output_tokens: 80000,
    actual_input_tokens: 215000,
    actual_output_tokens: 72000,
    estimated_cost_usd_micros: 1200000,
    actual_cost_usd_micros: 1150000,
    quality_gate_passed: true,
    quality_gates_run: [
      QualityGate.BUILD,
      QualityGate.LINT,
      QualityGate.TEST,
      QualityGate.TYPECHECK,
    ],
    quality_gates_failed: [],
    context_compactions: 3,
    context_rotations: 1,
    context_utilization_final: 0.85,
    outcome: Outcome.SUCCESS,
    retry_count: 0,
    language: 'typescript',
    repo_size_category: 'medium',
  });

  telemetry.track(event);
}
```

### 3. Graceful shutdown

```typescript
// src/main.ts
import { telemetry } from './telemetry.js';

async function main() {
  // ... your application logic ...

  // On shutdown, flush remaining events
  process.on('SIGTERM', async () => {
    await telemetry.stop();
    process.exit(0);
  });
}

main();
```

---

## Using Predictions

The telemetry API provides crowd-sourced predictions for token usage, cost, and duration based on historical data. The SDK caches these predictions locally.

### Pre-populate the cache

Call `refreshPredictions()` at startup with the dimension combinations your application uses:

```typescript
import { TaskType, Provider, Complexity } from '@mosaicstack/telemetry-client';
import { telemetry } from './telemetry.js';

// Fetch predictions for all combinations you'll need
await telemetry.refreshPredictions([
  { task_type: TaskType.IMPLEMENTATION, model: 'claude-sonnet-4-5-20250929', provider: Provider.ANTHROPIC, complexity: Complexity.LOW },
  { task_type: TaskType.IMPLEMENTATION, model: 'claude-sonnet-4-5-20250929', provider: Provider.ANTHROPIC, complexity: Complexity.MEDIUM },
  { task_type: TaskType.IMPLEMENTATION, model: 'claude-sonnet-4-5-20250929', provider: Provider.ANTHROPIC, complexity: Complexity.HIGH },
  { task_type: TaskType.TESTING, model: 'claude-haiku-4-5-20251001', provider: Provider.ANTHROPIC, complexity: Complexity.LOW },
]);
```

### Read cached predictions

```typescript
const prediction = telemetry.getPrediction({
  task_type: TaskType.IMPLEMENTATION,
  model: 'claude-sonnet-4-5-20250929',
  provider: Provider.ANTHROPIC,
  complexity: Complexity.MEDIUM,
});

if (prediction?.prediction) {
  const p = prediction.prediction;
  console.log('Token predictions (median):', {
    inputTokens: p.input_tokens.median,
    outputTokens: p.output_tokens.median,
  });
  console.log('Cost prediction:', `$${(p.cost_usd_micros.median / 1_000_000).toFixed(2)}`);
  console.log('Duration prediction:', `${(p.duration_ms.median / 1000).toFixed(0)}s`);
  console.log('Correction factors:', {
    input: p.correction_factors.input,   // >1.0 means estimates tend to be too low
    output: p.correction_factors.output,
  });
  console.log('Quality:', {
    gatePassRate: `${(p.quality.gate_pass_rate * 100).toFixed(0)}%`,
    successRate: `${(p.quality.success_rate * 100).toFixed(0)}%`,
  });

  // Check confidence level
  if (prediction.metadata.confidence === 'low') {
    console.warn('Low confidence — small sample size or fallback was applied');
  }
}
```

### Understand fallback behavior

When the server doesn't have enough data for an exact match, it broadens the query by dropping dimensions (e.g., ignoring complexity). The `metadata` fields tell you what happened:

| `fallback_level` | Meaning |
|-------------------|---------|
| `0` | Exact match on all dimensions |
| `1+` | Some dimensions were dropped to find data |
| `-1` | No prediction data available at any level |

---

## Environment-Specific Configuration

### Development

```typescript
const client = new TelemetryClient({
  serverUrl: 'http://localhost:8000',         // Local dev server
  apiKey: process.env.TELEMETRY_API_KEY!,
  instanceId: process.env.TELEMETRY_INSTANCE_ID!,
  dryRun: true,                               // Don't send real data
  submitIntervalMs: 10_000,                    // Flush more frequently for debugging
  onError: (err) => console.error('[telemetry]', err),
});
```

### Production

```typescript
const client = new TelemetryClient({
  serverUrl: 'https://tel-api.mosaicstack.dev',
  apiKey: process.env.TELEMETRY_API_KEY!,
  instanceId: process.env.TELEMETRY_INSTANCE_ID!,
  submitIntervalMs: 300_000,                   // 5 min (default)
  maxRetries: 3,                               // Retry on transient failures
  onError: (err) => {
    // Route to your observability stack
    logger.error('Telemetry submission failed', { error: err.message });
  },
});
```

### Conditional enable/disable

```typescript
const client = new TelemetryClient({
  serverUrl: process.env.TELEMETRY_API_URL!,
  apiKey: process.env.TELEMETRY_API_KEY!,
  instanceId: process.env.TELEMETRY_INSTANCE_ID!,
  enabled: process.env.TELEMETRY_ENABLED !== 'false',  // Opt-out via env var
});
```

When `enabled` is `false`, `track()` returns immediately without queuing.

---

## Error Handling

The SDK is designed to never disrupt your application:

- **`track()` never throws.** All errors are caught and routed to the `onError` callback.
- **Failed batches are re-queued.** If a submission fails, events are prepended back to the queue for the next flush cycle.
- **Exponential backoff with jitter.** Retries use 1s base delay, doubling up to 60s, with random jitter to prevent thundering herd.
- **`Retry-After` header support.** On HTTP 429 (rate limited), the SDK respects the server's `Retry-After` header.
- **HTTP 403 is not retried.** An API key / instance ID mismatch is a permanent error.

### Custom error handling

```typescript
const client = new TelemetryClient({
  // ...
  onError: (error) => {
    if (error.message.includes('HTTP 403')) {
      console.error('Telemetry auth failed — check API key and instance ID');
    } else if (error.message.includes('HTTP 429')) {
      console.warn('Telemetry rate limited — events will be retried');
    } else {
      console.error('Telemetry error:', error.message);
    }
  },
});
```

---

## Batch Submission Behavior

The SDK batches events for efficiency:

1. `track(event)` adds the event to an in-memory queue (bounded, FIFO eviction at capacity).
2. Every `submitIntervalMs` (default: 5 minutes), the background timer drains the queue in batches of up to `batchSize` (default/max: 100).
3. Each batch is POSTed to `POST /v1/events/batch` with exponential backoff on failure.
4. Calling `stop()` flushes all remaining events before resolving.

The server accepts up to **100 events per batch** and supports **partial success** — some events may be accepted while others (e.g., duplicates) are rejected.

---

## API Version Compatibility

| SDK Version | API Version | Schema Version |
|-------------|-------------|----------------|
| 0.1.x | v1 (`/v1/` endpoints) | `1.0` |

The `EventBuilder` automatically sets `schema_version: "1.0"` on every event. The SDK submits to `/v1/events/batch` and queries `/v1/predictions/batch`.

When the telemetry API introduces a v2, this SDK will add support in a new major release. The server supports two API versions simultaneously during a 6-month deprecation window.