feat: add semantic search with pgvector (closes #68, #69, #70)

Issues resolved: - #68: pgvector Setup * Added pgvector vector index migration for knowledge_embeddings * Vector index uses HNSW algorithm with cosine distance * Optimized for 1536-dimension OpenAI embeddings - #69: Embedding Generation Pipeline * Created EmbeddingService with OpenAI integration * Automatic embedding generation on entry create/update * Batch processing endpoint for existing entries * Async generation to avoid blocking API responses * Content preparation with title weighting - #70: Semantic Search API * POST /api/knowledge/search/semantic - pure vector search * POST /api/knowledge/search/hybrid - RRF combined search * POST /api/knowledge/embeddings/batch - batch generation * Comprehensive test coverage * Full documentation in docs/SEMANTIC_SEARCH.md Technical details: - Uses OpenAI text-embedding-3-small model (1536 dims) - HNSW index for O(log n) similarity search - Reciprocal Rank Fusion for hybrid search - Graceful degradation when OpenAI not configured - Async embedding generation for performance Configuration: - Added OPENAI_API_KEY to .env.example - Optional feature - disabled if API key not set - Falls back to keyword search in hybrid mode
2026-01-30 00:24:41 -06:00
parent 22cd68811d
commit 3ec2059470
14 changed files with 1408 additions and 5 deletions
--- a/docs/SEMANTIC_SEARCH.md
+++ b/docs/SEMANTIC_SEARCH.md
@@ -0,0 +1,346 @@
+# Semantic Search Implementation
+
+This document describes the semantic search implementation for the Mosaic Stack Knowledge Module using OpenAI embeddings and PostgreSQL pgvector.
+
+## Overview
+
+The semantic search feature enables AI-powered similarity search across knowledge entries using vector embeddings. It complements the existing full-text search with semantic understanding, allowing users to find relevant content even when exact keywords don't match.
+
+## Architecture
+
+### Components
+
+1. **EmbeddingService** - Generates and manages OpenAI embeddings
+2. **SearchService** - Enhanced with semantic and hybrid search methods
+3. **KnowledgeService** - Automatically generates embeddings on entry create/update
+4. **pgvector** - PostgreSQL extension for vector similarity search
+
+### Database Schema
+
+#### Knowledge Embeddings Table
+
+```prisma
+model KnowledgeEmbedding {
+  id      String         @id @default(uuid()) @db.Uuid
+  entryId String         @unique @map("entry_id") @db.Uuid
+  entry   KnowledgeEntry @relation(fields: [entryId], references: [id], onDelete: Cascade)
+
+  embedding Unsupported("vector(1536)")
+  model     String
+
+  createdAt DateTime @default(now()) @map("created_at") @db.Timestamptz
+  updatedAt DateTime @updatedAt @map("updated_at") @db.Timestamptz
+
+  @@index([entryId])
+  @@map("knowledge_embeddings")
+}
+```
+
+#### Vector Index
+
+An HNSW (Hierarchical Navigable Small World) index is created for fast similarity search:
+
+```sql
+CREATE INDEX knowledge_embeddings_embedding_idx
+ON knowledge_embeddings
+USING hnsw (embedding vector_cosine_ops)
+WITH (m = 16, ef_construction = 64);
+```
+
+## Configuration
+
+### Environment Variables
+
+Add to your `.env` file:
+
+```env
+# Optional: Required for semantic search
+OPENAI_API_KEY=sk-...
+```
+
+Get your API key from: https://platform.openai.com/api-keys
+
+### OpenAI Model
+
+The default embedding model is `text-embedding-3-small` (1536 dimensions). This provides:
+- High quality embeddings
+- Cost-effective pricing
+- Fast generation speed
+
+## API Endpoints
+
+### 1. Semantic Search
+
+**POST** `/api/knowledge/search/semantic`
+
+Search using vector similarity only.
+
+**Request:**
+```json
+{
+  "query": "database performance optimization",
+  "status": "PUBLISHED"
+}
+```
+
+**Query Parameters:**
+- `page` (optional): Page number (default: 1)
+- `limit` (optional): Results per page (default: 20)
+
+**Response:**
+```json
+{
+  "data": [
+    {
+      "id": "uuid",
+      "slug": "postgres-indexing",
+      "title": "PostgreSQL Indexing Strategies",
+      "content": "...",
+      "rank": 0.87,
+      "tags": [...],
+      ...
+    }
+  ],
+  "pagination": {
+    "page": 1,
+    "limit": 20,
+    "total": 15,
+    "totalPages": 1
+  },
+  "query": "database performance optimization"
+}
+```
+
+### 2. Hybrid Search (Recommended)
+
+**POST** `/api/knowledge/search/hybrid`
+
+Combines vector similarity and full-text search using Reciprocal Rank Fusion (RRF).
+
+**Request:**
+```json
+{
+  "query": "indexing strategies",
+  "status": "PUBLISHED"
+}
+```
+
+**Benefits of Hybrid Search:**
+- Best of both worlds: semantic understanding + keyword matching
+- Better ranking for exact matches
+- Improved recall and precision
+- Resilient to edge cases
+
+### 3. Batch Embedding Generation
+
+**POST** `/api/knowledge/embeddings/batch`
+
+Generate embeddings for all existing entries. Useful for:
+- Initial setup after enabling semantic search
+- Regenerating embeddings after model updates
+
+**Request:**
+```json
+{
+  "status": "PUBLISHED"
+}
+```
+
+**Response:**
+```json
+{
+  "message": "Generated 42 embeddings out of 45 entries",
+  "total": 45,
+  "success": 42
+}
+```
+
+**Permissions:** Requires ADMIN role
+
+## Automatic Embedding Generation
+
+Embeddings are automatically generated when:
+
+1. **Creating an entry** - Embedding generated asynchronously after creation
+2. **Updating an entry** - Embedding regenerated if title or content changes
+
+The generation happens asynchronously to avoid blocking API responses.
+
+### Content Preparation
+
+Before generating embeddings, content is prepared by:
+1. Combining title and content
+2. Weighting title more heavily (appears twice)
+3. This improves semantic matching on titles
+
+```typescript
+prepareContentForEmbedding(title, content) {
+  return `${title}\n\n${title}\n\n${content}`.trim();
+}
+```
+
+## Search Algorithms
+
+### Vector Similarity Search
+
+Uses cosine distance to find semantically similar entries:
+
+```sql
+SELECT *
+FROM knowledge_entries e
+INNER JOIN knowledge_embeddings emb ON e.id = emb.entry_id
+ORDER BY emb.embedding <=> query_embedding
+LIMIT 20
+```
+
+- `<=>` operator: cosine distance
+- Lower distance = higher similarity
+- Efficient with HNSW index
+
+### Hybrid Search (RRF Algorithm)
+
+Reciprocal Rank Fusion combines rankings from multiple sources:
+
+```
+RRF(d) = sum(1 / (k + rank_i))
+```
+
+Where:
+- `d` = document
+- `k` = constant (60 is standard)
+- `rank_i` = rank from source i
+
+**Example:**
+
+Document ranks in two searches:
+- Vector search: rank 3
+- Keyword search: rank 1
+
+RRF score = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
+
+Higher RRF score = better combined ranking.
+
+## Performance Considerations
+
+### Index Parameters
+
+The HNSW index uses:
+- `m = 16`: Max connections per layer (balances accuracy/memory)
+- `ef_construction = 64`: Build quality (higher = more accurate, slower build)
+
+### Query Performance
+
+- **Typical query time:** 10-50ms (with index)
+- **Without index:** 1000ms+ (not recommended)
+- **Embedding generation:** 100-300ms per entry
+
+### Cost (OpenAI API)
+
+Using `text-embedding-3-small`:
+- ~$0.00002 per 1000 tokens
+- Average entry (~500 tokens): $0.00001
+- 10,000 entries: ~$0.10
+
+Very cost-effective for most use cases.
+
+## Migration Guide
+
+### 1. Run Migrations
+
+```bash
+cd apps/api
+pnpm prisma migrate deploy
+```
+
+This creates:
+- `knowledge_embeddings` table
+- Vector index on embeddings
+
+### 2. Configure OpenAI API Key
+
+```bash
+# Add to .env
+OPENAI_API_KEY=sk-...
+```
+
+### 3. Generate Embeddings for Existing Entries
+
+```bash
+curl -X POST http://localhost:3001/api/knowledge/embeddings/batch \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"status": "PUBLISHED"}'
+```
+
+Or use the web UI (Admin dashboard → Knowledge → Generate Embeddings).
+
+### 4. Test Semantic Search
+
+```bash
+curl -X POST http://localhost:3001/api/knowledge/search/hybrid \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"query": "your search query"}'
+```
+
+## Troubleshooting
+
+### "OpenAI API key not configured"
+
+**Cause:** `OPENAI_API_KEY` environment variable not set
+
+**Solution:** Add the API key to your `.env` file and restart the API server
+
+### Semantic search returns no results
+
+**Possible causes:**
+
+1. **No embeddings generated**
+   - Run batch generation endpoint
+   - Check `knowledge_embeddings` table
+
+2. **Query too specific**
+   - Try broader terms
+   - Use hybrid search for better recall
+
+3. **Index not created**
+   - Check migration status
+   - Verify index exists: `\di knowledge_embeddings_embedding_idx` in psql
+
+### Slow query performance
+
+**Solutions:**
+
+1. Verify index exists and is being used:
+   ```sql
+   EXPLAIN ANALYZE
+   SELECT * FROM knowledge_embeddings
+   ORDER BY embedding <=> '[...]'::vector
+   LIMIT 20;
+   ```
+
+2. Adjust index parameters (requires recreation):
+   ```sql
+   DROP INDEX knowledge_embeddings_embedding_idx;
+   CREATE INDEX knowledge_embeddings_embedding_idx
+   ON knowledge_embeddings
+   USING hnsw (embedding vector_cosine_ops)
+   WITH (m = 32, ef_construction = 128); -- Higher values
+   ```
+
+## Future Enhancements
+
+Potential improvements:
+
+1. **Custom embeddings**: Support for local embedding models (Ollama, etc.)
+2. **Chunking**: Split large entries into chunks for better granularity
+3. **Reranking**: Add cross-encoder reranking for top results
+4. **Caching**: Cache query embeddings for repeated searches
+5. **Multi-modal**: Support image/file embeddings
+
+## References
+
+- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
+- [pgvector Documentation](https://github.com/pgvector/pgvector)
+- [HNSW Algorithm Paper](https://arxiv.org/abs/1603.09320)
+- [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)