YouTube Knowledge Base

Overview Schema Related Servers Score Discussions

youtube-kb
docs
architecture

DATABASE.md•14.5 KiB

# Database Schema & Vector Storage ## Overview The YouTube KB uses **Supabase with pgvector** for storing video metadata and embeddings. This enables hybrid search combining semantic similarity (vectors) with keyword matching (full-text search). --- ## Schema Design ### Entity Relationship Diagram ``` ┌──────────────────────┐ ┌──────────────────────────────────────┐ │ videos │ │ chunks │ ├──────────────────────┤ ├──────────────────────────────────────┤ │ id (uuid) PK │◄───────►│ id (uuid) PK │ │ video_id (text) UK │ 1:N │ video_id (text) FK │ │ title (text) │ │ chunk_index (int) │ │ channel_name (text) │ │ text (text) │ │ channel_id (text) │ │ embedding (vector[1536]) │ │ domain (text) │ │ domain (text) │ │ published_at (ts) │ │ fts (tsvector) GENERATED │ │ view_count (bigint) │ │ created_at (timestamptz) │ │ duration_seconds │ └──────────────────────────────────────┘ │ url (text) GENERATED │ │ created_at (ts) │ │ updated_at (ts) │ └──────────────────────┘ ``` --- ## Complete Schema SQL ```sql -- ============================================================================= -- YouTube Knowledge Base - Supabase Schema -- ============================================================================= -- Enable required extensions CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS pg_trgm; -- For fuzzy text matching -- ============================================================================= -- VIDEOS TABLE -- ============================================================================= -- Stores metadata about each YouTube video CREATE TABLE videos ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), -- YouTube identifiers video_id TEXT UNIQUE NOT NULL, channel_id TEXT, -- Metadata title TEXT NOT NULL, channel_name TEXT NOT NULL, domain TEXT NOT NULL, -- Video details published_at TIMESTAMPTZ, view_count BIGINT DEFAULT 0, duration_seconds INTEGER, -- Generated column for URL url TEXT GENERATED ALWAYS AS ( 'https://youtube.com/watch?v=' || video_id ) STORED, -- Timestamps created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); -- Indexes for videos CREATE INDEX idx_videos_domain ON videos(domain); CREATE INDEX idx_videos_channel ON videos(channel_name); CREATE INDEX idx_videos_published ON videos(published_at DESC); CREATE INDEX idx_videos_views ON videos(view_count DESC); -- ============================================================================= -- CHUNKS TABLE -- ============================================================================= -- Stores text chunks with embeddings for semantic search CREATE TABLE chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), -- Foreign key to videos video_id TEXT NOT NULL REFERENCES videos(video_id) ON DELETE CASCADE, -- Chunk identification chunk_index INTEGER NOT NULL, -- Content text TEXT NOT NULL, -- Vector embedding (OpenAI text-embedding-3-small = 1536 dimensions) embedding vector(1536), -- Metadata (denormalized for query performance) domain TEXT NOT NULL, -- Full-text search column (auto-generated) fts tsvector GENERATED ALWAYS AS ( to_tsvector('english', text) ) STORED, -- Timestamps created_at TIMESTAMPTZ DEFAULT NOW(), -- Ensure unique chunks per video UNIQUE(video_id, chunk_index) ); -- ============================================================================= -- INDEXES -- ============================================================================= -- HNSW index for fast approximate nearest neighbor search -- m = 16: Number of connections per layer (default, good balance) -- ef_construction = 64: Build quality (higher = better but slower) CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Full-text search index using GIN CREATE INDEX idx_chunks_fts ON chunks USING gin(fts); -- Domain filtering index CREATE INDEX idx_chunks_domain ON chunks(domain); -- Video lookup index CREATE INDEX idx_chunks_video ON chunks(video_id); -- ============================================================================= -- FUNCTIONS -- ============================================================================= -- Function: Semantic search with optional domain filter CREATE OR REPLACE FUNCTION search_chunks( query_embedding vector(1536), match_domain TEXT DEFAULT NULL, match_count INTEGER DEFAULT 5 ) RETURNS TABLE ( id UUID, video_id TEXT, chunk_index INTEGER, text TEXT, domain TEXT, similarity FLOAT ) LANGUAGE plpgsql AS $$ BEGIN RETURN QUERY SELECT c.id, c.video_id, c.chunk_index, c.text, c.domain, 1 - (c.embedding <=> query_embedding) AS similarity FROM chunks c WHERE (match_domain IS NULL OR c.domain = match_domain) ORDER BY c.embedding <=> query_embedding LIMIT match_count; END; $$; -- Function: Hybrid search (semantic + keyword) CREATE OR REPLACE FUNCTION hybrid_search( query_embedding vector(1536), query_text TEXT, match_domain TEXT DEFAULT NULL, match_count INTEGER DEFAULT 5, semantic_weight FLOAT DEFAULT 0.7, keyword_weight FLOAT DEFAULT 0.3 ) RETURNS TABLE ( id UUID, video_id TEXT, chunk_index INTEGER, text TEXT, domain TEXT, semantic_score FLOAT, keyword_score FLOAT, combined_score FLOAT ) LANGUAGE plpgsql AS $$ BEGIN RETURN QUERY WITH semantic AS ( SELECT c.id, c.video_id, c.chunk_index, c.text, c.domain, 1 - (c.embedding <=> query_embedding) AS score FROM chunks c WHERE (match_domain IS NULL OR c.domain = match_domain) ORDER BY c.embedding <=> query_embedding LIMIT match_count * 3 -- Over-fetch for reranking ), keyword AS ( SELECT c.id, ts_rank(c.fts, websearch_to_tsquery('english', query_text)) AS score FROM chunks c WHERE c.fts @@ websearch_to_tsquery('english', query_text) AND (match_domain IS NULL OR c.domain = match_domain) ) SELECT s.id, s.video_id, s.chunk_index, s.text, s.domain, s.score AS semantic_score, COALESCE(k.score, 0.0) AS keyword_score, (s.score * semantic_weight + COALESCE(k.score, 0.0) * keyword_weight) AS combined_score FROM semantic s LEFT JOIN keyword k ON s.id = k.id ORDER BY combined_score DESC LIMIT match_count; END; $$; -- Function: Get domain statistics CREATE OR REPLACE FUNCTION get_domain_stats() RETURNS TABLE ( domain TEXT, video_count BIGINT, chunk_count BIGINT ) LANGUAGE plpgsql AS $$ BEGIN RETURN QUERY SELECT c.domain, COUNT(DISTINCT c.video_id) AS video_count, COUNT(*) AS chunk_count FROM chunks c GROUP BY c.domain ORDER BY chunk_count DESC; END; $$; -- Function: Get overall statistics CREATE OR REPLACE FUNCTION get_stats() RETURNS TABLE ( total_videos BIGINT, total_chunks BIGINT, total_domains BIGINT, domains TEXT[] ) LANGUAGE plpgsql AS $$ BEGIN RETURN QUERY SELECT (SELECT COUNT(*) FROM videos), (SELECT COUNT(*) FROM chunks), (SELECT COUNT(DISTINCT domain) FROM chunks), (SELECT ARRAY_AGG(DISTINCT domain) FROM chunks); END; $$; -- ============================================================================= -- ROW LEVEL SECURITY (Optional - Disabled for Public Read) -- ============================================================================= -- For a public read-only API, we disable RLS ALTER TABLE videos ENABLE ROW LEVEL SECURITY; ALTER TABLE chunks ENABLE ROW LEVEL SECURITY; -- Allow public read access CREATE POLICY "Allow public read access to videos" ON videos FOR SELECT USING (true); CREATE POLICY "Allow public read access to chunks" ON chunks FOR SELECT USING (true); -- ============================================================================= -- TRIGGERS -- ============================================================================= -- Update timestamp trigger CREATE OR REPLACE FUNCTION update_updated_at() RETURNS TRIGGER AS $$ BEGIN NEW.updated_at = NOW(); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER update_videos_updated_at BEFORE UPDATE ON videos FOR EACH ROW EXECUTE FUNCTION update_updated_at(); ``` --- ## Index Strategy ### Vector Index (HNSW) ```sql CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); ``` **Parameters**: - `m = 16`: Connections per layer (higher = better recall, more memory) - `ef_construction = 64`: Build quality (higher = better index, slower build) **Why HNSW over IVFFlat?** - Better recall at same speed - No training required - Better for smaller datasets (< 1M vectors) ### Full-Text Search Index (GIN) ```sql CREATE INDEX idx_chunks_fts ON chunks USING gin(fts); ``` **Why GIN?** - Optimized for full-text search - Handles tsvector columns efficiently - Good for "contains" queries --- ## Query Patterns ### Pattern 1: Pure Semantic Search ```sql SELECT c.id, c.text, c.video_id, v.title, v.channel_name, v.url, 1 - (c.embedding <=> $1) AS similarity FROM chunks c JOIN videos v ON c.video_id = v.video_id WHERE c.domain = $2 -- Optional domain filter ORDER BY c.embedding <=> $1 LIMIT $3; ``` **Use When**: User query is conceptual, searching for meaning. ### Pattern 2: Hybrid Search ```sql SELECT * FROM hybrid_search( $1::vector(1536), -- Query embedding $2::text, -- Query text $3::text, -- Domain filter (optional) $4::integer, -- Max results 0.7, -- Semantic weight 0.3 -- Keyword weight ); ``` **Use When**: Query contains specific terms that should match exactly. ### Pattern 3: Domain Listing ```sql SELECT * FROM get_domain_stats(); ``` ### Pattern 4: Video Listing ```sql SELECT video_id, title, channel_name, domain, url, view_count, published_at FROM videos WHERE domain = $1 -- Optional ORDER BY view_count DESC LIMIT $2; ``` --- ## Data Types ### Embedding Vector ```typescript // TypeScript representation type Embedding = number[]; // Array of 1536 floats // Supabase insert await supabase.from('chunks').insert({ video_id: 'abc123', chunk_index: 0, text: 'This is the chunk text...', embedding: [0.123, -0.456, ...], // 1536 dimensions domain: 'growth' }); ``` ### Domain Enum (Soft) ```typescript // Recommended domains (not enforced at DB level for flexibility) type Domain = | 'growth' | 'strategy' | 'engineering' | 'product' | 'startup' | 'sales' | 'design' | 'ai-tools' | 'ai-builder' | 'research'; ``` --- ## Migration from LanceDB ### Current LanceDB Structure ```python # Current schema in LanceDB { "id": str, "video_id": str, "chunk_index": int, "text": str, "embedding": list[float], "video_title": str, "channel_name": str, "domain": str, "published_at": str, "view_count": int } ``` ### Migration Script Outline ```python # See implementation/MIGRATION.md for full script async def migrate_to_supabase(): # 1. Read all data from LanceDB lance_db = lancedb.connect("knowledge/lancedb") chunks_table = lance_db.open_table("chunks") # 2. Extract unique videos videos = extract_unique_videos(chunks_table) # 3. Insert videos into Supabase await supabase.from('videos').upsert(videos) # 4. Insert chunks in batches for batch in batched(chunks_table.to_pandas(), 100): await supabase.from('chunks').insert(batch) ``` --- ## Performance Considerations ### Query Performance | Query Type | Expected Latency | Index Used | |------------|------------------|------------| | Semantic search (5 results) | ~50ms | HNSW | | Hybrid search (5 results) | ~100ms | HNSW + GIN | | Domain listing | ~10ms | idx_chunks_domain | | Video listing | ~20ms | idx_videos_domain | ### Scaling Thresholds | Metric | Current | Supabase Free | Supabase Pro | |--------|---------|---------------|--------------| | DB Size | ~100MB | 500MB | 8GB | | Vectors | ~50K | ~250K | ~4M | | Connections | 10 | 50 | 200 | ### Optimization Techniques 1. **Connection Pooling**: Use Supabase connection pooler 2. **Query Caching**: Cache common searches (future) 3. **Partial Indexes**: Index only recent content if needed 4. **Materialized Views**: Pre-compute domain stats --- ## Backup & Recovery ### Automatic Backups Supabase provides: - **Daily backups**: 7-day retention (free), 30-day (pro) - **Point-in-time recovery**: Pro tier only - **Logical backups**: pg_dump export ### Manual Export ```bash # Export via Supabase CLI supabase db dump -f backup.sql # Export specific tables pg_dump --table=videos --table=chunks > youtube_kb_backup.sql ``` ### Recovery Procedure ```bash # Restore from backup psql $DATABASE_URL < backup.sql # Or use Supabase dashboard restore feature ``` --- ## Supabase Project Setup ### 1. Create Project ```bash # Via CLI supabase projects create youtube-kb --org-id YOUR_ORG # Or via dashboard: https://supabase.com/dashboard ``` ### 2. Apply Schema ```bash # Copy schema to supabase/migrations/ cp prd/architecture/DATABASE.md supabase/migrations/001_initial_schema.sql # Apply migrations supabase db push ``` ### 3. Get Connection Details ```bash # Get API URL and keys supabase status # Or from dashboard: Project Settings > API ``` ### 4. Test Connection ```typescript import { createClient } from '@supabase/supabase-js'; const supabase = createClient( process.env.SUPABASE_URL!, process.env.SUPABASE_ANON_KEY! ); // Test query const { data, error } = await supabase.rpc('get_stats'); console.log(data); ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/unisone/youtube-kb'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

DATABASE.md•14.5 KiB