feat: add semantic search with pgvector (closes #68, #69, #70)
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed

Issues resolved:
- #68: pgvector Setup
  * Added pgvector vector index migration for knowledge_embeddings
  * Vector index uses HNSW algorithm with cosine distance
  * Optimized for 1536-dimension OpenAI embeddings

- #69: Embedding Generation Pipeline
  * Created EmbeddingService with OpenAI integration
  * Automatic embedding generation on entry create/update
  * Batch processing endpoint for existing entries
  * Async generation to avoid blocking API responses
  * Content preparation with title weighting

- #70: Semantic Search API
  * POST /api/knowledge/search/semantic - pure vector search
  * POST /api/knowledge/search/hybrid - RRF combined search
  * POST /api/knowledge/embeddings/batch - batch generation
  * Comprehensive test coverage
  * Full documentation in docs/SEMANTIC_SEARCH.md

Technical details:
- Uses OpenAI text-embedding-3-small model (1536 dims)
- HNSW index for O(log n) similarity search
- Reciprocal Rank Fusion for hybrid search
- Graceful degradation when OpenAI not configured
- Async embedding generation for performance

Configuration:
- Added OPENAI_API_KEY to .env.example
- Optional feature - disabled if API key not set
- Falls back to keyword search in hybrid mode
This commit is contained in:
Jason Woltje
2026-01-30 00:24:41 -06:00
parent 22cd68811d
commit 3ec2059470
14 changed files with 1408 additions and 5 deletions

346
docs/SEMANTIC_SEARCH.md Normal file
View File

@@ -0,0 +1,346 @@
# Semantic Search Implementation
This document describes the semantic search implementation for the Mosaic Stack Knowledge Module using OpenAI embeddings and PostgreSQL pgvector.
## Overview
The semantic search feature enables AI-powered similarity search across knowledge entries using vector embeddings. It complements the existing full-text search with semantic understanding, allowing users to find relevant content even when exact keywords don't match.
## Architecture
### Components
1. **EmbeddingService** - Generates and manages OpenAI embeddings
2. **SearchService** - Enhanced with semantic and hybrid search methods
3. **KnowledgeService** - Automatically generates embeddings on entry create/update
4. **pgvector** - PostgreSQL extension for vector similarity search
### Database Schema
#### Knowledge Embeddings Table
```prisma
model KnowledgeEmbedding {
id String @id @default(uuid()) @db.Uuid
entryId String @unique @map("entry_id") @db.Uuid
entry KnowledgeEntry @relation(fields: [entryId], references: [id], onDelete: Cascade)
embedding Unsupported("vector(1536)")
model String
createdAt DateTime @default(now()) @map("created_at") @db.Timestamptz
updatedAt DateTime @updatedAt @map("updated_at") @db.Timestamptz
@@index([entryId])
@@map("knowledge_embeddings")
}
```
#### Vector Index
An HNSW (Hierarchical Navigable Small World) index is created for fast similarity search:
```sql
CREATE INDEX knowledge_embeddings_embedding_idx
ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
## Configuration
### Environment Variables
Add to your `.env` file:
```env
# Optional: Required for semantic search
OPENAI_API_KEY=sk-...
```
Get your API key from: https://platform.openai.com/api-keys
### OpenAI Model
The default embedding model is `text-embedding-3-small` (1536 dimensions). This provides:
- High quality embeddings
- Cost-effective pricing
- Fast generation speed
## API Endpoints
### 1. Semantic Search
**POST** `/api/knowledge/search/semantic`
Search using vector similarity only.
**Request:**
```json
{
"query": "database performance optimization",
"status": "PUBLISHED"
}
```
**Query Parameters:**
- `page` (optional): Page number (default: 1)
- `limit` (optional): Results per page (default: 20)
**Response:**
```json
{
"data": [
{
"id": "uuid",
"slug": "postgres-indexing",
"title": "PostgreSQL Indexing Strategies",
"content": "...",
"rank": 0.87,
"tags": [...],
...
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 15,
"totalPages": 1
},
"query": "database performance optimization"
}
```
### 2. Hybrid Search (Recommended)
**POST** `/api/knowledge/search/hybrid`
Combines vector similarity and full-text search using Reciprocal Rank Fusion (RRF).
**Request:**
```json
{
"query": "indexing strategies",
"status": "PUBLISHED"
}
```
**Benefits of Hybrid Search:**
- Best of both worlds: semantic understanding + keyword matching
- Better ranking for exact matches
- Improved recall and precision
- Resilient to edge cases
### 3. Batch Embedding Generation
**POST** `/api/knowledge/embeddings/batch`
Generate embeddings for all existing entries. Useful for:
- Initial setup after enabling semantic search
- Regenerating embeddings after model updates
**Request:**
```json
{
"status": "PUBLISHED"
}
```
**Response:**
```json
{
"message": "Generated 42 embeddings out of 45 entries",
"total": 45,
"success": 42
}
```
**Permissions:** Requires ADMIN role
## Automatic Embedding Generation
Embeddings are automatically generated when:
1. **Creating an entry** - Embedding generated asynchronously after creation
2. **Updating an entry** - Embedding regenerated if title or content changes
The generation happens asynchronously to avoid blocking API responses.
### Content Preparation
Before generating embeddings, content is prepared by:
1. Combining title and content
2. Weighting title more heavily (appears twice)
3. This improves semantic matching on titles
```typescript
prepareContentForEmbedding(title, content) {
return `${title}\n\n${title}\n\n${content}`.trim();
}
```
## Search Algorithms
### Vector Similarity Search
Uses cosine distance to find semantically similar entries:
```sql
SELECT *
FROM knowledge_entries e
INNER JOIN knowledge_embeddings emb ON e.id = emb.entry_id
ORDER BY emb.embedding <=> query_embedding
LIMIT 20
```
- `<=>` operator: cosine distance
- Lower distance = higher similarity
- Efficient with HNSW index
### Hybrid Search (RRF Algorithm)
Reciprocal Rank Fusion combines rankings from multiple sources:
```
RRF(d) = sum(1 / (k + rank_i))
```
Where:
- `d` = document
- `k` = constant (60 is standard)
- `rank_i` = rank from source i
**Example:**
Document ranks in two searches:
- Vector search: rank 3
- Keyword search: rank 1
RRF score = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Higher RRF score = better combined ranking.
## Performance Considerations
### Index Parameters
The HNSW index uses:
- `m = 16`: Max connections per layer (balances accuracy/memory)
- `ef_construction = 64`: Build quality (higher = more accurate, slower build)
### Query Performance
- **Typical query time:** 10-50ms (with index)
- **Without index:** 1000ms+ (not recommended)
- **Embedding generation:** 100-300ms per entry
### Cost (OpenAI API)
Using `text-embedding-3-small`:
- ~$0.00002 per 1000 tokens
- Average entry (~500 tokens): $0.00001
- 10,000 entries: ~$0.10
Very cost-effective for most use cases.
## Migration Guide
### 1. Run Migrations
```bash
cd apps/api
pnpm prisma migrate deploy
```
This creates:
- `knowledge_embeddings` table
- Vector index on embeddings
### 2. Configure OpenAI API Key
```bash
# Add to .env
OPENAI_API_KEY=sk-...
```
### 3. Generate Embeddings for Existing Entries
```bash
curl -X POST http://localhost:3001/api/knowledge/embeddings/batch \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"status": "PUBLISHED"}'
```
Or use the web UI (Admin dashboard → Knowledge → Generate Embeddings).
### 4. Test Semantic Search
```bash
curl -X POST http://localhost:3001/api/knowledge/search/hybrid \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "your search query"}'
```
## Troubleshooting
### "OpenAI API key not configured"
**Cause:** `OPENAI_API_KEY` environment variable not set
**Solution:** Add the API key to your `.env` file and restart the API server
### Semantic search returns no results
**Possible causes:**
1. **No embeddings generated**
- Run batch generation endpoint
- Check `knowledge_embeddings` table
2. **Query too specific**
- Try broader terms
- Use hybrid search for better recall
3. **Index not created**
- Check migration status
- Verify index exists: `\di knowledge_embeddings_embedding_idx` in psql
### Slow query performance
**Solutions:**
1. Verify index exists and is being used:
```sql
EXPLAIN ANALYZE
SELECT * FROM knowledge_embeddings
ORDER BY embedding <=> '[...]'::vector
LIMIT 20;
```
2. Adjust index parameters (requires recreation):
```sql
DROP INDEX knowledge_embeddings_embedding_idx;
CREATE INDEX knowledge_embeddings_embedding_idx
ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128); -- Higher values
```
## Future Enhancements
Potential improvements:
1. **Custom embeddings**: Support for local embedding models (Ollama, etc.)
2. **Chunking**: Split large entries into chunks for better granularity
3. **Reranking**: Add cross-encoder reranking for top results
4. **Caching**: Cache query embeddings for repeated searches
5. **Multi-modal**: Support image/file embeddings
## References
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [pgvector Documentation](https://github.com/pgvector/pgvector)
- [HNSW Algorithm Paper](https://arxiv.org/abs/1603.09320)
- [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)