YouTube Knowledge Base

Overview Schema Related Servers Score Discussions

youtube-kb
docs
implementation

MIGRATION.md•12.2 KiB

# Migration Guide: LanceDB to Supabase ## Overview This guide covers migrating the existing YouTube KB data from local LanceDB to hosted Supabase pgvector. --- ## Current State (LanceDB) ### Location ``` /Users/zaytsev/Documents/Projects/YT-transcripts/knowledge/lancedb/ ``` ### Schema ```python # Current LanceDB structure chunks_table: - id: str (UUID) - video_id: str - chunk_index: int - text: str - embedding: list[float] # 1536 dimensions - video_title: str - channel_name: str - domain: str - published_at: str (optional) - view_count: int (optional) ``` ### Estimated Data Size - Videos: ~500-1000 - Chunks: ~50,000-100,000 - Embeddings: ~150MB-300MB --- ## Target State (Supabase) ### Schema See [DATABASE.md](../architecture/DATABASE.md) for full schema. ### Key Differences | Aspect | LanceDB | Supabase | |--------|---------|----------| | Storage | Local files | Cloud PostgreSQL | | Vector index | DiskANN | HNSW (pgvector) | | Text search | N/A | tsvector (native) | | Metadata | Denormalized | Normalized (videos + chunks) | --- ## Migration Script ### Full Migration Script (Python) ```python #!/usr/bin/env python3 """ Migration script: LanceDB → Supabase pgvector Usage: python migrate_to_supabase.py --dry-run # Preview only python migrate_to_supabase.py # Execute migration """ import os import json import argparse from datetime import datetime from pathlib import Path from typing import Iterator import lancedb from supabase import create_client, Client from dotenv import load_dotenv from tqdm import tqdm # Load environment load_dotenv() # Configuration LANCEDB_PATH = Path("knowledge/lancedb") BATCH_SIZE = 100 # Records per batch SUPABASE_URL = os.environ["SUPABASE_URL"] SUPABASE_KEY = os.environ["SUPABASE_SERVICE_KEY"] # Use service key for migration def get_lancedb_connection(): """Connect to local LanceDB.""" return lancedb.connect(str(LANCEDB_PATH)) def get_supabase_client() -> Client: """Connect to Supabase.""" return create_client(SUPABASE_URL, SUPABASE_KEY) def extract_videos(lance_db) -> list[dict]: """Extract unique videos from chunks table.""" chunks_table = lance_db.open_table("chunks") df = chunks_table.to_pandas() # Get unique videos videos = {} for _, row in df.iterrows(): video_id = row["video_id"] if video_id not in videos: videos[video_id] = { "video_id": video_id, "title": row.get("video_title", "Unknown"), "channel_name": row.get("channel_name", "Unknown"), "channel_id": row.get("channel_id"), "domain": row.get("domain", "general"), "published_at": parse_date(row.get("published_at")), "view_count": row.get("view_count", 0), "duration_seconds": row.get("duration_seconds"), } return list(videos.values()) def extract_chunks(lance_db) -> Iterator[list[dict]]: """Extract chunks in batches.""" chunks_table = lance_db.open_table("chunks") df = chunks_table.to_pandas() batch = [] for _, row in df.iterrows(): chunk = { "video_id": row["video_id"], "chunk_index": int(row.get("chunk_index", 0)), "text": row["text"], "embedding": row["embedding"].tolist() if hasattr(row["embedding"], "tolist") else row["embedding"], "domain": row.get("domain", "general"), } batch.append(chunk) if len(batch) >= BATCH_SIZE: yield batch batch = [] if batch: yield batch def parse_date(date_str: str | None) -> str | None: """Parse date string to ISO format.""" if not date_str: return None try: # Handle various date formats for fmt in ["%Y-%m-%d", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%dT%H:%M:%SZ"]: try: return datetime.strptime(date_str, fmt).isoformat() except ValueError: continue return None except Exception: return None def migrate_videos(supabase: Client, videos: list[dict], dry_run: bool = False): """Migrate videos to Supabase.""" print(f"\nMigrating {len(videos)} videos...") if dry_run: print(" [DRY RUN] Would insert videos:") for v in videos[:3]: print(f" - {v['title'][:50]}... ({v['video_id']})") print(f" ... and {len(videos) - 3} more") return # Batch insert for i in range(0, len(videos), BATCH_SIZE): batch = videos[i:i + BATCH_SIZE] result = supabase.table("videos").upsert(batch, on_conflict="video_id").execute() print(f" Inserted batch {i // BATCH_SIZE + 1}: {len(batch)} videos") print(f" ✓ Migrated {len(videos)} videos") def migrate_chunks(supabase: Client, lance_db, dry_run: bool = False): """Migrate chunks to Supabase.""" chunks_table = lance_db.open_table("chunks") total_chunks = len(chunks_table.to_pandas()) print(f"\nMigrating {total_chunks} chunks...") if dry_run: print(" [DRY RUN] Would insert chunks in batches of", BATCH_SIZE) return inserted = 0 for batch in tqdm(extract_chunks(lance_db), desc="Migrating chunks"): # Convert embeddings to proper format for chunk in batch: # pgvector expects array format chunk["embedding"] = chunk["embedding"] result = supabase.table("chunks").insert(batch).execute() inserted += len(batch) print(f" ✓ Migrated {inserted} chunks") def verify_migration(supabase: Client, lance_db): """Verify migration was successful.""" print("\nVerifying migration...") # Count videos lance_videos = len(extract_videos(lance_db)) supa_videos = supabase.table("videos").select("*", count="exact").execute() print(f" Videos: LanceDB={lance_videos}, Supabase={supa_videos.count}") # Count chunks chunks_table = lance_db.open_table("chunks") lance_chunks = len(chunks_table.to_pandas()) supa_chunks = supabase.table("chunks").select("*", count="exact").execute() print(f" Chunks: LanceDB={lance_chunks}, Supabase={supa_chunks.count}") # Verify counts match if lance_videos == supa_videos.count and lance_chunks == supa_chunks.count: print(" ✓ Migration verified successfully!") return True else: print(" ✗ Count mismatch detected!") return False def test_search(supabase: Client): """Test search functionality.""" print("\nTesting search functionality...") # Test domain stats result = supabase.rpc("get_domain_stats").execute() print(f" Domain stats: {len(result.data)} domains found") # Test overall stats result = supabase.rpc("get_stats").execute() stats = result.data[0] print(f" Total videos: {stats['total_videos']}") print(f" Total chunks: {stats['total_chunks']}") print(f" Domains: {', '.join(stats['domains'])}") print(" ✓ Search functions working!") def main(): parser = argparse.ArgumentParser(description="Migrate LanceDB to Supabase") parser.add_argument("--dry-run", action="store_true", help="Preview without executing") args = parser.parse_args() print("=" * 60) print("YouTube KB Migration: LanceDB → Supabase") print("=" * 60) if args.dry_run: print("\n🔍 DRY RUN MODE - No data will be modified\n") # Connect to databases print("Connecting to databases...") lance_db = get_lancedb_connection() supabase = get_supabase_client() print(" ✓ Connected to LanceDB") print(" ✓ Connected to Supabase") # Extract and migrate videos videos = extract_videos(lance_db) migrate_videos(supabase, videos, dry_run=args.dry_run) # Migrate chunks migrate_chunks(supabase, lance_db, dry_run=args.dry_run) if not args.dry_run: # Verify migration verify_migration(supabase, lance_db) # Test search test_search(supabase) print("\n" + "=" * 60) if args.dry_run: print("DRY RUN COMPLETE - Run without --dry-run to execute") else: print("MIGRATION COMPLETE!") print("=" * 60) if __name__ == "__main__": main() ``` --- ## Step-by-Step Migration ### Step 1: Prepare Supabase 1. **Create Supabase Project** ```bash # Or use dashboard: https://supabase.com/dashboard ``` 2. **Apply Schema** ```bash # Copy schema from architecture/DATABASE.md # Run in Supabase SQL editor ``` 3. **Get Credentials** ```bash # From Project Settings > API SUPABASE_URL=https://xxxxx.supabase.co SUPABASE_ANON_KEY=eyJ... SUPABASE_SERVICE_KEY=eyJ... # For migration only ``` ### Step 2: Prepare Environment ```bash cd /Users/zaytsev/Documents/Projects/YT-transcripts # Activate virtual environment source .venv/bin/activate # Install migration dependencies pip install supabase python-dotenv tqdm # Create .env file cat > .env.migration << EOF SUPABASE_URL=https://xxxxx.supabase.co SUPABASE_SERVICE_KEY=eyJ... EOF ``` ### Step 3: Run Dry Run ```bash python prd/migrations/migrate_to_supabase.py --dry-run ``` Expected output: ``` ============================================================ YouTube KB Migration: LanceDB → Supabase ============================================================ 🔍 DRY RUN MODE - No data will be modified Connecting to databases... ✓ Connected to LanceDB ✓ Connected to Supabase Migrating 523 videos... [DRY RUN] Would insert videos: - How to Price Your SaaS Product... (abc123) - Growth Hacking 101... (def456) - System Design Interview... (ghi789) ... and 520 more Migrating 52,345 chunks... [DRY RUN] Would insert chunks in batches of 100 ============================================================ DRY RUN COMPLETE - Run without --dry-run to execute ============================================================ ``` ### Step 4: Execute Migration ```bash python prd/migrations/migrate_to_supabase.py ``` ### Step 5: Verify ```bash # Check via Supabase dashboard # Or run verification queries: # In Supabase SQL editor: SELECT COUNT(*) FROM videos; SELECT COUNT(*) FROM chunks; SELECT * FROM get_domain_stats(); ``` --- ## Rollback Plan If migration fails or data is corrupted: ### Option 1: Clear and Retry ```sql -- In Supabase SQL editor TRUNCATE TABLE chunks CASCADE; TRUNCATE TABLE videos CASCADE; -- Then re-run migration ``` ### Option 2: Restore from LanceDB The original LanceDB data remains untouched at: ``` /Users/zaytsev/Documents/Projects/YT-transcripts/knowledge/lancedb/ ``` Simply re-run migration script after fixing issues. --- ## Post-Migration Tasks 1. **Update Local MCP Server** - Point to Supabase instead of LanceDB - Test locally before deploying 2. **Monitor Performance** - Check query latencies - Monitor Supabase usage 3. **Archive LanceDB** ```bash # Keep as backup mv knowledge/lancedb knowledge/lancedb-archive-$(date +%Y%m%d) ``` 4. **Update Documentation** - Remove references to local LanceDB - Add Supabase configuration docs --- ## Troubleshooting ### Error: "relation does not exist" **Cause**: Schema not applied **Fix**: Run schema SQL in Supabase editor ### Error: "invalid input syntax for type vector" **Cause**: Embedding format mismatch **Fix**: Ensure embeddings are plain arrays, not numpy arrays ```python # Convert numpy to list embedding = embedding.tolist() if hasattr(embedding, "tolist") else embedding ``` ### Error: "duplicate key value violates unique constraint" **Cause**: Re-running migration without clearing **Fix**: Use upsert or truncate tables first ### Error: Rate limiting **Cause**: Too many requests **Fix**: Increase BATCH_SIZE or add delays ```python import time time.sleep(0.1) # Add between batches ``` ### Slow migration **Cause**: Large dataset **Fix**: - Increase BATCH_SIZE to 500-1000 - Run during off-peak hours - Use connection pooler --- ## Migration Checklist - [ ] Supabase project created - [ ] Schema applied - [ ] Extensions enabled (pgvector) - [ ] Indexes created - [ ] Functions created - [ ] Environment variables set - [ ] Dry run successful - [ ] Migration executed - [ ] Video count verified - [ ] Chunk count verified - [ ] Search function tested - [ ] Local MCP updated - [ ] LanceDB archived

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/unisone/youtube-kb'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

MIGRATION.md•12.2 KiB