# Semantic Search Implementation This document describes the semantic search implementation for the Mosaic Stack Knowledge Module using OpenAI embeddings and PostgreSQL pgvector. ## Overview The semantic search feature enables AI-powered similarity search across knowledge entries using vector embeddings. It complements the existing full-text search with semantic understanding, allowing users to find relevant content even when exact keywords don't match. ## Architecture ### Components 1. **EmbeddingService** - Generates and manages OpenAI embeddings 2. **SearchService** - Enhanced with semantic and hybrid search methods 3. **KnowledgeService** - Automatically generates embeddings on entry create/update 4. **pgvector** - PostgreSQL extension for vector similarity search ### Database Schema #### Knowledge Embeddings Table ```prisma model KnowledgeEmbedding { id String @id @default(uuid()) @db.Uuid entryId String @unique @map("entry_id") @db.Uuid entry KnowledgeEntry @relation(fields: [entryId], references: [id], onDelete: Cascade) embedding Unsupported("vector(1536)") model String createdAt DateTime @default(now()) @map("created_at") @db.Timestamptz updatedAt DateTime @updatedAt @map("updated_at") @db.Timestamptz @@index([entryId]) @@map("knowledge_embeddings") } ``` #### Vector Index An HNSW (Hierarchical Navigable Small World) index is created for fast similarity search: ```sql CREATE INDEX knowledge_embeddings_embedding_idx ON knowledge_embeddings USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); ``` ## Configuration ### Environment Variables Add to your `.env` file: ```env # Optional: Required for semantic search OPENAI_API_KEY=sk-... ``` Get your API key from: https://platform.openai.com/api-keys ### OpenAI Model The default embedding model is `text-embedding-3-small` (1536 dimensions). This provides: - High quality embeddings - Cost-effective pricing - Fast generation speed ## API Endpoints ### 1. Semantic Search **POST** `/api/knowledge/search/semantic` Search using vector similarity only. **Request:** ```json { "query": "database performance optimization", "status": "PUBLISHED" } ``` **Query Parameters:** - `page` (optional): Page number (default: 1) - `limit` (optional): Results per page (default: 20) **Response:** ```json { "data": [ { "id": "uuid", "slug": "postgres-indexing", "title": "PostgreSQL Indexing Strategies", "content": "...", "rank": 0.87, "tags": [...], ... } ], "pagination": { "page": 1, "limit": 20, "total": 15, "totalPages": 1 }, "query": "database performance optimization" } ``` ### 2. Hybrid Search (Recommended) **POST** `/api/knowledge/search/hybrid` Combines vector similarity and full-text search using Reciprocal Rank Fusion (RRF). **Request:** ```json { "query": "indexing strategies", "status": "PUBLISHED" } ``` **Benefits of Hybrid Search:** - Best of both worlds: semantic understanding + keyword matching - Better ranking for exact matches - Improved recall and precision - Resilient to edge cases ### 3. Batch Embedding Generation **POST** `/api/knowledge/embeddings/batch` Generate embeddings for all existing entries. Useful for: - Initial setup after enabling semantic search - Regenerating embeddings after model updates **Request:** ```json { "status": "PUBLISHED" } ``` **Response:** ```json { "message": "Generated 42 embeddings out of 45 entries", "total": 45, "success": 42 } ``` **Permissions:** Requires ADMIN role ## Automatic Embedding Generation Embeddings are automatically generated when: 1. **Creating an entry** - Embedding generated asynchronously after creation 2. **Updating an entry** - Embedding regenerated if title or content changes The generation happens asynchronously to avoid blocking API responses. ### Content Preparation Before generating embeddings, content is prepared by: 1. Combining title and content 2. Weighting title more heavily (appears twice) 3. This improves semantic matching on titles ```typescript prepareContentForEmbedding(title, content) { return `${title}\n\n${title}\n\n${content}`.trim(); } ``` ## Search Algorithms ### Vector Similarity Search Uses cosine distance to find semantically similar entries: ```sql SELECT * FROM knowledge_entries e INNER JOIN knowledge_embeddings emb ON e.id = emb.entry_id ORDER BY emb.embedding <=> query_embedding LIMIT 20 ``` - `<=>` operator: cosine distance - Lower distance = higher similarity - Efficient with HNSW index ### Hybrid Search (RRF Algorithm) Reciprocal Rank Fusion combines rankings from multiple sources: ``` RRF(d) = sum(1 / (k + rank_i)) ``` Where: - `d` = document - `k` = constant (60 is standard) - `rank_i` = rank from source i **Example:** Document ranks in two searches: - Vector search: rank 3 - Keyword search: rank 1 RRF score = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 Higher RRF score = better combined ranking. ## Performance Considerations ### Index Parameters The HNSW index uses: - `m = 16`: Max connections per layer (balances accuracy/memory) - `ef_construction = 64`: Build quality (higher = more accurate, slower build) ### Query Performance - **Typical query time:** 10-50ms (with index) - **Without index:** 1000ms+ (not recommended) - **Embedding generation:** 100-300ms per entry ### Cost (OpenAI API) Using `text-embedding-3-small`: - ~$0.00002 per 1000 tokens - Average entry (~500 tokens): $0.00001 - 10,000 entries: ~$0.10 Very cost-effective for most use cases. ## Migration Guide ### 1. Run Migrations ```bash cd apps/api pnpm prisma migrate deploy ``` This creates: - `knowledge_embeddings` table - Vector index on embeddings ### 2. Configure OpenAI API Key ```bash # Add to .env OPENAI_API_KEY=sk-... ``` ### 3. Generate Embeddings for Existing Entries ```bash curl -X POST http://localhost:3001/api/knowledge/embeddings/batch \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"status": "PUBLISHED"}' ``` Or use the web UI (Admin dashboard → Knowledge → Generate Embeddings). ### 4. Test Semantic Search ```bash curl -X POST http://localhost:3001/api/knowledge/search/hybrid \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"query": "your search query"}' ``` ## Troubleshooting ### "OpenAI API key not configured" **Cause:** `OPENAI_API_KEY` environment variable not set **Solution:** Add the API key to your `.env` file and restart the API server ### Semantic search returns no results **Possible causes:** 1. **No embeddings generated** - Run batch generation endpoint - Check `knowledge_embeddings` table 2. **Query too specific** - Try broader terms - Use hybrid search for better recall 3. **Index not created** - Check migration status - Verify index exists: `\di knowledge_embeddings_embedding_idx` in psql ### Slow query performance **Solutions:** 1. Verify index exists and is being used: ```sql EXPLAIN ANALYZE SELECT * FROM knowledge_embeddings ORDER BY embedding <=> '[...]'::vector LIMIT 20; ``` 2. Adjust index parameters (requires recreation): ```sql DROP INDEX knowledge_embeddings_embedding_idx; CREATE INDEX knowledge_embeddings_embedding_idx ON knowledge_embeddings USING hnsw (embedding vector_cosine_ops) WITH (m = 32, ef_construction = 128); -- Higher values ``` ## Future Enhancements Potential improvements: 1. **Custom embeddings**: Support for local embedding models (Ollama, etc.) 2. **Chunking**: Split large entries into chunks for better granularity 3. **Reranking**: Add cross-encoder reranking for top results 4. **Caching**: Cache query embeddings for repeated searches 5. **Multi-modal**: Support image/file embeddings ## References - [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings) - [pgvector Documentation](https://github.com/pgvector/pgvector) - [HNSW Algorithm Paper](https://arxiv.org/abs/1603.09320) - [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)