mosaic/stack

Fork 0

Files

Jason Woltje 3ec2059470

ci/woodpecker/push/woodpecker Pipeline failed

Details

ci/woodpecker/pr/woodpecker Pipeline failed

Details

feat: add semantic search with pgvector (closes #68 , #69 , #70 )

Issues resolved:
- #68: pgvector Setup
  * Added pgvector vector index migration for knowledge_embeddings
  * Vector index uses HNSW algorithm with cosine distance
  * Optimized for 1536-dimension OpenAI embeddings

- #69: Embedding Generation Pipeline
  * Created EmbeddingService with OpenAI integration
  * Automatic embedding generation on entry create/update
  * Batch processing endpoint for existing entries
  * Async generation to avoid blocking API responses
  * Content preparation with title weighting

- #70: Semantic Search API
  * POST /api/knowledge/search/semantic - pure vector search
  * POST /api/knowledge/search/hybrid - RRF combined search
  * POST /api/knowledge/embeddings/batch - batch generation
  * Comprehensive test coverage
  * Full documentation in docs/SEMANTIC_SEARCH.md

Technical details:
- Uses OpenAI text-embedding-3-small model (1536 dims)
- HNSW index for O(log n) similarity search
- Reciprocal Rank Fusion for hybrid search
- Graceful degradation when OpenAI not configured
- Async embedding generation for performance

Configuration:
- Added OPENAI_API_KEY to .env.example
- Optional feature - disabled if API key not set
- Falls back to keyword search in hybrid mode

2026-01-30 15:19:13 -06:00

8.0 KiB

Raw Blame History

Semantic Search Implementation

This document describes the semantic search implementation for the Mosaic Stack Knowledge Module using OpenAI embeddings and PostgreSQL pgvector.

Overview

The semantic search feature enables AI-powered similarity search across knowledge entries using vector embeddings. It complements the existing full-text search with semantic understanding, allowing users to find relevant content even when exact keywords don't match.

Architecture

Components

EmbeddingService - Generates and manages OpenAI embeddings
SearchService - Enhanced with semantic and hybrid search methods
KnowledgeService - Automatically generates embeddings on entry create/update
pgvector - PostgreSQL extension for vector similarity search

Database Schema

Knowledge Embeddings Table

model KnowledgeEmbedding {
  id      String         @id @default(uuid()) @db.Uuid
  entryId String         @unique @map("entry_id") @db.Uuid
  entry   KnowledgeEntry @relation(fields: [entryId], references: [id], onDelete: Cascade)

  embedding Unsupported("vector(1536)")
  model     String

  createdAt DateTime @default(now()) @map("created_at") @db.Timestamptz
  updatedAt DateTime @updatedAt @map("updated_at") @db.Timestamptz

  @@index([entryId])
  @@map("knowledge_embeddings")
}

Vector Index

An HNSW (Hierarchical Navigable Small World) index is created for fast similarity search:

CREATE INDEX knowledge_embeddings_embedding_idx
ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Configuration

Environment Variables

Add to your .env file:

# Optional: Required for semantic search
OPENAI_API_KEY=sk-...

Get your API key from: https://platform.openai.com/api-keys

OpenAI Model

The default embedding model is text-embedding-3-small (1536 dimensions). This provides:

High quality embeddings
Cost-effective pricing
Fast generation speed

API Endpoints

1. Semantic Search

POST /api/knowledge/search/semantic

Search using vector similarity only.

Request:

{
  "query": "database performance optimization",
  "status": "PUBLISHED"
}

Query Parameters:

page (optional): Page number (default: 1)
limit (optional): Results per page (default: 20)

Response:

{
  "data": [
    {
      "id": "uuid",
      "slug": "postgres-indexing",
      "title": "PostgreSQL Indexing Strategies",
      "content": "...",
      "rank": 0.87,
      "tags": [...],
      ...
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 15,
    "totalPages": 1
  },
  "query": "database performance optimization"
}

2. Hybrid Search (Recommended)

POST /api/knowledge/search/hybrid

Combines vector similarity and full-text search using Reciprocal Rank Fusion (RRF).

Request:

{
  "query": "indexing strategies",
  "status": "PUBLISHED"
}

Benefits of Hybrid Search:

Best of both worlds: semantic understanding + keyword matching
Better ranking for exact matches
Improved recall and precision
Resilient to edge cases

3. Batch Embedding Generation

POST /api/knowledge/embeddings/batch

Generate embeddings for all existing entries. Useful for:

Initial setup after enabling semantic search
Regenerating embeddings after model updates

Request:

{
  "status": "PUBLISHED"
}

Response:

{
  "message": "Generated 42 embeddings out of 45 entries",
  "total": 45,
  "success": 42
}

Permissions: Requires ADMIN role

Automatic Embedding Generation

Embeddings are automatically generated when:

Creating an entry - Embedding generated asynchronously after creation
Updating an entry - Embedding regenerated if title or content changes

The generation happens asynchronously to avoid blocking API responses.

Content Preparation

Before generating embeddings, content is prepared by:

Combining title and content
Weighting title more heavily (appears twice)
This improves semantic matching on titles

prepareContentForEmbedding(title, content) {
  return `${title}\n\n${title}\n\n${content}`.trim();
}

Search Algorithms

Vector Similarity Search

Uses cosine distance to find semantically similar entries:

SELECT *
FROM knowledge_entries e
INNER JOIN knowledge_embeddings emb ON e.id = emb.entry_id
ORDER BY emb.embedding <=> query_embedding
LIMIT 20

<=> operator: cosine distance
Lower distance = higher similarity
Efficient with HNSW index

Hybrid Search (RRF Algorithm)

Reciprocal Rank Fusion combines rankings from multiple sources:

RRF(d) = sum(1 / (k + rank_i))

Where:

d = document
k = constant (60 is standard)
rank_i = rank from source i

Example:

Document ranks in two searches:

Vector search: rank 3
Keyword search: rank 1

RRF score = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Higher RRF score = better combined ranking.

Performance Considerations

Index Parameters

The HNSW index uses:

m = 16: Max connections per layer (balances accuracy/memory)
ef_construction = 64: Build quality (higher = more accurate, slower build)

Query Performance

Typical query time: 10-50ms (with index)
Without index: 1000ms+ (not recommended)
Embedding generation: 100-300ms per entry

Cost (OpenAI API)

Using text-embedding-3-small:

~$0.00002 per 1000 tokens
Average entry (~500 tokens): $0.00001
10,000 entries: ~$0.10

Very cost-effective for most use cases.

Migration Guide

1. Run Migrations

cd apps/api
pnpm prisma migrate deploy

This creates:

knowledge_embeddings table
Vector index on embeddings

2. Configure OpenAI API Key

# Add to .env
OPENAI_API_KEY=sk-...

3. Generate Embeddings for Existing Entries

curl -X POST http://localhost:3001/api/knowledge/embeddings/batch \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "PUBLISHED"}'

Or use the web UI (Admin dashboard → Knowledge → Generate Embeddings).

4. Test Semantic Search

curl -X POST http://localhost:3001/api/knowledge/search/hybrid \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "your search query"}'

Troubleshooting

"OpenAI API key not configured"

Cause: OPENAI_API_KEY environment variable not set

Solution: Add the API key to your .env file and restart the API server

Semantic search returns no results

Possible causes:

No embeddings generated
- Run batch generation endpoint
- Check knowledge_embeddings table
Query too specific
- Try broader terms
- Use hybrid search for better recall
Index not created
- Check migration status
- Verify index exists: \di knowledge_embeddings_embedding_idx in psql

Slow query performance

Solutions:

Verify index exists and is being used:

EXPLAIN ANALYZE
SELECT * FROM knowledge_embeddings
ORDER BY embedding <=> '[...]'::vector
LIMIT 20;

Adjust index parameters (requires recreation):

DROP INDEX knowledge_embeddings_embedding_idx;
CREATE INDEX knowledge_embeddings_embedding_idx
ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128); -- Higher values

Future Enhancements

Potential improvements:

Custom embeddings: Support for local embedding models (Ollama, etc.)
Chunking: Split large entries into chunks for better granularity
Reranking: Add cross-encoder reranking for top results
Caching: Cache query embeddings for repeated searches
Multi-modal: Support image/file embeddings

8.0 KiB Raw Blame History

Semantic Search Implementation

Overview

Architecture

Components

Database Schema

Knowledge Embeddings Table

Vector Index

Configuration

Environment Variables

OpenAI Model

API Endpoints

1. Semantic Search

2. Hybrid Search (Recommended)

3. Batch Embedding Generation

Automatic Embedding Generation

Content Preparation

Search Algorithms

Vector Similarity Search

Hybrid Search (RRF Algorithm)

Performance Considerations

Index Parameters

Query Performance

Cost (OpenAI API)

Migration Guide

1. Run Migrations

2. Configure OpenAI API Key

3. Generate Embeddings for Existing Entries

4. Test Semantic Search

Troubleshooting

"OpenAI API key not configured"

Semantic search returns no results

Slow query performance

Future Enhancements

References

8.0 KiB

Raw Blame History