# Database Schema & Vector Storage
## Overview
The YouTube KB uses **Supabase with pgvector** for storing video metadata and embeddings. This enables hybrid search combining semantic similarity (vectors) with keyword matching (full-text search).
---
## Schema Design
### Entity Relationship Diagram
```
┌──────────────────────┐ ┌──────────────────────────────────────┐
│ videos │ │ chunks │
├──────────────────────┤ ├──────────────────────────────────────┤
│ id (uuid) PK │◄───────►│ id (uuid) PK │
│ video_id (text) UK │ 1:N │ video_id (text) FK │
│ title (text) │ │ chunk_index (int) │
│ channel_name (text) │ │ text (text) │
│ channel_id (text) │ │ embedding (vector[1536]) │
│ domain (text) │ │ domain (text) │
│ published_at (ts) │ │ fts (tsvector) GENERATED │
│ view_count (bigint) │ │ created_at (timestamptz) │
│ duration_seconds │ └──────────────────────────────────────┘
│ url (text) GENERATED │
│ created_at (ts) │
│ updated_at (ts) │
└──────────────────────┘
```
---
## Complete Schema SQL
```sql
-- =============================================================================
-- YouTube Knowledge Base - Supabase Schema
-- =============================================================================
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- For fuzzy text matching
-- =============================================================================
-- VIDEOS TABLE
-- =============================================================================
-- Stores metadata about each YouTube video
CREATE TABLE videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- YouTube identifiers
video_id TEXT UNIQUE NOT NULL,
channel_id TEXT,
-- Metadata
title TEXT NOT NULL,
channel_name TEXT NOT NULL,
domain TEXT NOT NULL,
-- Video details
published_at TIMESTAMPTZ,
view_count BIGINT DEFAULT 0,
duration_seconds INTEGER,
-- Generated column for URL
url TEXT GENERATED ALWAYS AS (
'https://youtube.com/watch?v=' || video_id
) STORED,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for videos
CREATE INDEX idx_videos_domain ON videos(domain);
CREATE INDEX idx_videos_channel ON videos(channel_name);
CREATE INDEX idx_videos_published ON videos(published_at DESC);
CREATE INDEX idx_videos_views ON videos(view_count DESC);
-- =============================================================================
-- CHUNKS TABLE
-- =============================================================================
-- Stores text chunks with embeddings for semantic search
CREATE TABLE chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Foreign key to videos
video_id TEXT NOT NULL REFERENCES videos(video_id) ON DELETE CASCADE,
-- Chunk identification
chunk_index INTEGER NOT NULL,
-- Content
text TEXT NOT NULL,
-- Vector embedding (OpenAI text-embedding-3-small = 1536 dimensions)
embedding vector(1536),
-- Metadata (denormalized for query performance)
domain TEXT NOT NULL,
-- Full-text search column (auto-generated)
fts tsvector GENERATED ALWAYS AS (
to_tsvector('english', text)
) STORED,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Ensure unique chunks per video
UNIQUE(video_id, chunk_index)
);
-- =============================================================================
-- INDEXES
-- =============================================================================
-- HNSW index for fast approximate nearest neighbor search
-- m = 16: Number of connections per layer (default, good balance)
-- ef_construction = 64: Build quality (higher = better but slower)
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Full-text search index using GIN
CREATE INDEX idx_chunks_fts ON chunks USING gin(fts);
-- Domain filtering index
CREATE INDEX idx_chunks_domain ON chunks(domain);
-- Video lookup index
CREATE INDEX idx_chunks_video ON chunks(video_id);
-- =============================================================================
-- FUNCTIONS
-- =============================================================================
-- Function: Semantic search with optional domain filter
CREATE OR REPLACE FUNCTION search_chunks(
query_embedding vector(1536),
match_domain TEXT DEFAULT NULL,
match_count INTEGER DEFAULT 5
)
RETURNS TABLE (
id UUID,
video_id TEXT,
chunk_index INTEGER,
text TEXT,
domain TEXT,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
c.id,
c.video_id,
c.chunk_index,
c.text,
c.domain,
1 - (c.embedding <=> query_embedding) AS similarity
FROM chunks c
WHERE (match_domain IS NULL OR c.domain = match_domain)
ORDER BY c.embedding <=> query_embedding
LIMIT match_count;
END;
$$;
-- Function: Hybrid search (semantic + keyword)
CREATE OR REPLACE FUNCTION hybrid_search(
query_embedding vector(1536),
query_text TEXT,
match_domain TEXT DEFAULT NULL,
match_count INTEGER DEFAULT 5,
semantic_weight FLOAT DEFAULT 0.7,
keyword_weight FLOAT DEFAULT 0.3
)
RETURNS TABLE (
id UUID,
video_id TEXT,
chunk_index INTEGER,
text TEXT,
domain TEXT,
semantic_score FLOAT,
keyword_score FLOAT,
combined_score FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
WITH semantic AS (
SELECT
c.id,
c.video_id,
c.chunk_index,
c.text,
c.domain,
1 - (c.embedding <=> query_embedding) AS score
FROM chunks c
WHERE (match_domain IS NULL OR c.domain = match_domain)
ORDER BY c.embedding <=> query_embedding
LIMIT match_count * 3 -- Over-fetch for reranking
),
keyword AS (
SELECT
c.id,
ts_rank(c.fts, websearch_to_tsquery('english', query_text)) AS score
FROM chunks c
WHERE c.fts @@ websearch_to_tsquery('english', query_text)
AND (match_domain IS NULL OR c.domain = match_domain)
)
SELECT
s.id,
s.video_id,
s.chunk_index,
s.text,
s.domain,
s.score AS semantic_score,
COALESCE(k.score, 0.0) AS keyword_score,
(s.score * semantic_weight + COALESCE(k.score, 0.0) * keyword_weight) AS combined_score
FROM semantic s
LEFT JOIN keyword k ON s.id = k.id
ORDER BY combined_score DESC
LIMIT match_count;
END;
$$;
-- Function: Get domain statistics
CREATE OR REPLACE FUNCTION get_domain_stats()
RETURNS TABLE (
domain TEXT,
video_count BIGINT,
chunk_count BIGINT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
c.domain,
COUNT(DISTINCT c.video_id) AS video_count,
COUNT(*) AS chunk_count
FROM chunks c
GROUP BY c.domain
ORDER BY chunk_count DESC;
END;
$$;
-- Function: Get overall statistics
CREATE OR REPLACE FUNCTION get_stats()
RETURNS TABLE (
total_videos BIGINT,
total_chunks BIGINT,
total_domains BIGINT,
domains TEXT[]
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
(SELECT COUNT(*) FROM videos),
(SELECT COUNT(*) FROM chunks),
(SELECT COUNT(DISTINCT domain) FROM chunks),
(SELECT ARRAY_AGG(DISTINCT domain) FROM chunks);
END;
$$;
-- =============================================================================
-- ROW LEVEL SECURITY (Optional - Disabled for Public Read)
-- =============================================================================
-- For a public read-only API, we disable RLS
ALTER TABLE videos ENABLE ROW LEVEL SECURITY;
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
-- Allow public read access
CREATE POLICY "Allow public read access to videos" ON videos
FOR SELECT USING (true);
CREATE POLICY "Allow public read access to chunks" ON chunks
FOR SELECT USING (true);
-- =============================================================================
-- TRIGGERS
-- =============================================================================
-- Update timestamp trigger
CREATE OR REPLACE FUNCTION update_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_videos_updated_at
BEFORE UPDATE ON videos
FOR EACH ROW
EXECUTE FUNCTION update_updated_at();
```
---
## Index Strategy
### Vector Index (HNSW)
```sql
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
**Parameters**:
- `m = 16`: Connections per layer (higher = better recall, more memory)
- `ef_construction = 64`: Build quality (higher = better index, slower build)
**Why HNSW over IVFFlat?**
- Better recall at same speed
- No training required
- Better for smaller datasets (< 1M vectors)
### Full-Text Search Index (GIN)
```sql
CREATE INDEX idx_chunks_fts ON chunks USING gin(fts);
```
**Why GIN?**
- Optimized for full-text search
- Handles tsvector columns efficiently
- Good for "contains" queries
---
## Query Patterns
### Pattern 1: Pure Semantic Search
```sql
SELECT
c.id,
c.text,
c.video_id,
v.title,
v.channel_name,
v.url,
1 - (c.embedding <=> $1) AS similarity
FROM chunks c
JOIN videos v ON c.video_id = v.video_id
WHERE c.domain = $2 -- Optional domain filter
ORDER BY c.embedding <=> $1
LIMIT $3;
```
**Use When**: User query is conceptual, searching for meaning.
### Pattern 2: Hybrid Search
```sql
SELECT * FROM hybrid_search(
$1::vector(1536), -- Query embedding
$2::text, -- Query text
$3::text, -- Domain filter (optional)
$4::integer, -- Max results
0.7, -- Semantic weight
0.3 -- Keyword weight
);
```
**Use When**: Query contains specific terms that should match exactly.
### Pattern 3: Domain Listing
```sql
SELECT * FROM get_domain_stats();
```
### Pattern 4: Video Listing
```sql
SELECT
video_id,
title,
channel_name,
domain,
url,
view_count,
published_at
FROM videos
WHERE domain = $1 -- Optional
ORDER BY view_count DESC
LIMIT $2;
```
---
## Data Types
### Embedding Vector
```typescript
// TypeScript representation
type Embedding = number[]; // Array of 1536 floats
// Supabase insert
await supabase.from('chunks').insert({
video_id: 'abc123',
chunk_index: 0,
text: 'This is the chunk text...',
embedding: [0.123, -0.456, ...], // 1536 dimensions
domain: 'growth'
});
```
### Domain Enum (Soft)
```typescript
// Recommended domains (not enforced at DB level for flexibility)
type Domain =
| 'growth'
| 'strategy'
| 'engineering'
| 'product'
| 'startup'
| 'sales'
| 'design'
| 'ai-tools'
| 'ai-builder'
| 'research';
```
---
## Migration from LanceDB
### Current LanceDB Structure
```python
# Current schema in LanceDB
{
"id": str,
"video_id": str,
"chunk_index": int,
"text": str,
"embedding": list[float],
"video_title": str,
"channel_name": str,
"domain": str,
"published_at": str,
"view_count": int
}
```
### Migration Script Outline
```python
# See implementation/MIGRATION.md for full script
async def migrate_to_supabase():
# 1. Read all data from LanceDB
lance_db = lancedb.connect("knowledge/lancedb")
chunks_table = lance_db.open_table("chunks")
# 2. Extract unique videos
videos = extract_unique_videos(chunks_table)
# 3. Insert videos into Supabase
await supabase.from('videos').upsert(videos)
# 4. Insert chunks in batches
for batch in batched(chunks_table.to_pandas(), 100):
await supabase.from('chunks').insert(batch)
```
---
## Performance Considerations
### Query Performance
| Query Type | Expected Latency | Index Used |
|------------|------------------|------------|
| Semantic search (5 results) | ~50ms | HNSW |
| Hybrid search (5 results) | ~100ms | HNSW + GIN |
| Domain listing | ~10ms | idx_chunks_domain |
| Video listing | ~20ms | idx_videos_domain |
### Scaling Thresholds
| Metric | Current | Supabase Free | Supabase Pro |
|--------|---------|---------------|--------------|
| DB Size | ~100MB | 500MB | 8GB |
| Vectors | ~50K | ~250K | ~4M |
| Connections | 10 | 50 | 200 |
### Optimization Techniques
1. **Connection Pooling**: Use Supabase connection pooler
2. **Query Caching**: Cache common searches (future)
3. **Partial Indexes**: Index only recent content if needed
4. **Materialized Views**: Pre-compute domain stats
---
## Backup & Recovery
### Automatic Backups
Supabase provides:
- **Daily backups**: 7-day retention (free), 30-day (pro)
- **Point-in-time recovery**: Pro tier only
- **Logical backups**: pg_dump export
### Manual Export
```bash
# Export via Supabase CLI
supabase db dump -f backup.sql
# Export specific tables
pg_dump --table=videos --table=chunks > youtube_kb_backup.sql
```
### Recovery Procedure
```bash
# Restore from backup
psql $DATABASE_URL < backup.sql
# Or use Supabase dashboard restore feature
```
---
## Supabase Project Setup
### 1. Create Project
```bash
# Via CLI
supabase projects create youtube-kb --org-id YOUR_ORG
# Or via dashboard: https://supabase.com/dashboard
```
### 2. Apply Schema
```bash
# Copy schema to supabase/migrations/
cp prd/architecture/DATABASE.md supabase/migrations/001_initial_schema.sql
# Apply migrations
supabase db push
```
### 3. Get Connection Details
```bash
# Get API URL and keys
supabase status
# Or from dashboard: Project Settings > API
```
### 4. Test Connection
```typescript
import { createClient } from '@supabase/supabase-js';
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_ANON_KEY!
);
// Test query
const { data, error } = await supabase.rpc('get_stats');
console.log(data);
```