# Migration Guide: LanceDB to Supabase
## Overview
This guide covers migrating the existing YouTube KB data from local LanceDB to hosted Supabase pgvector.
---
## Current State (LanceDB)
### Location
```
/Users/zaytsev/Documents/Projects/YT-transcripts/knowledge/lancedb/
```
### Schema
```python
# Current LanceDB structure
chunks_table:
- id: str (UUID)
- video_id: str
- chunk_index: int
- text: str
- embedding: list[float] # 1536 dimensions
- video_title: str
- channel_name: str
- domain: str
- published_at: str (optional)
- view_count: int (optional)
```
### Estimated Data Size
- Videos: ~500-1000
- Chunks: ~50,000-100,000
- Embeddings: ~150MB-300MB
---
## Target State (Supabase)
### Schema
See [DATABASE.md](../architecture/DATABASE.md) for full schema.
### Key Differences
| Aspect | LanceDB | Supabase |
|--------|---------|----------|
| Storage | Local files | Cloud PostgreSQL |
| Vector index | DiskANN | HNSW (pgvector) |
| Text search | N/A | tsvector (native) |
| Metadata | Denormalized | Normalized (videos + chunks) |
---
## Migration Script
### Full Migration Script (Python)
```python
#!/usr/bin/env python3
"""
Migration script: LanceDB → Supabase pgvector
Usage:
python migrate_to_supabase.py --dry-run # Preview only
python migrate_to_supabase.py # Execute migration
"""
import os
import json
import argparse
from datetime import datetime
from pathlib import Path
from typing import Iterator
import lancedb
from supabase import create_client, Client
from dotenv import load_dotenv
from tqdm import tqdm
# Load environment
load_dotenv()
# Configuration
LANCEDB_PATH = Path("knowledge/lancedb")
BATCH_SIZE = 100 # Records per batch
SUPABASE_URL = os.environ["SUPABASE_URL"]
SUPABASE_KEY = os.environ["SUPABASE_SERVICE_KEY"] # Use service key for migration
def get_lancedb_connection():
"""Connect to local LanceDB."""
return lancedb.connect(str(LANCEDB_PATH))
def get_supabase_client() -> Client:
"""Connect to Supabase."""
return create_client(SUPABASE_URL, SUPABASE_KEY)
def extract_videos(lance_db) -> list[dict]:
"""Extract unique videos from chunks table."""
chunks_table = lance_db.open_table("chunks")
df = chunks_table.to_pandas()
# Get unique videos
videos = {}
for _, row in df.iterrows():
video_id = row["video_id"]
if video_id not in videos:
videos[video_id] = {
"video_id": video_id,
"title": row.get("video_title", "Unknown"),
"channel_name": row.get("channel_name", "Unknown"),
"channel_id": row.get("channel_id"),
"domain": row.get("domain", "general"),
"published_at": parse_date(row.get("published_at")),
"view_count": row.get("view_count", 0),
"duration_seconds": row.get("duration_seconds"),
}
return list(videos.values())
def extract_chunks(lance_db) -> Iterator[list[dict]]:
"""Extract chunks in batches."""
chunks_table = lance_db.open_table("chunks")
df = chunks_table.to_pandas()
batch = []
for _, row in df.iterrows():
chunk = {
"video_id": row["video_id"],
"chunk_index": int(row.get("chunk_index", 0)),
"text": row["text"],
"embedding": row["embedding"].tolist() if hasattr(row["embedding"], "tolist") else row["embedding"],
"domain": row.get("domain", "general"),
}
batch.append(chunk)
if len(batch) >= BATCH_SIZE:
yield batch
batch = []
if batch:
yield batch
def parse_date(date_str: str | None) -> str | None:
"""Parse date string to ISO format."""
if not date_str:
return None
try:
# Handle various date formats
for fmt in ["%Y-%m-%d", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%dT%H:%M:%SZ"]:
try:
return datetime.strptime(date_str, fmt).isoformat()
except ValueError:
continue
return None
except Exception:
return None
def migrate_videos(supabase: Client, videos: list[dict], dry_run: bool = False):
"""Migrate videos to Supabase."""
print(f"\nMigrating {len(videos)} videos...")
if dry_run:
print(" [DRY RUN] Would insert videos:")
for v in videos[:3]:
print(f" - {v['title'][:50]}... ({v['video_id']})")
print(f" ... and {len(videos) - 3} more")
return
# Batch insert
for i in range(0, len(videos), BATCH_SIZE):
batch = videos[i:i + BATCH_SIZE]
result = supabase.table("videos").upsert(batch, on_conflict="video_id").execute()
print(f" Inserted batch {i // BATCH_SIZE + 1}: {len(batch)} videos")
print(f" ✓ Migrated {len(videos)} videos")
def migrate_chunks(supabase: Client, lance_db, dry_run: bool = False):
"""Migrate chunks to Supabase."""
chunks_table = lance_db.open_table("chunks")
total_chunks = len(chunks_table.to_pandas())
print(f"\nMigrating {total_chunks} chunks...")
if dry_run:
print(" [DRY RUN] Would insert chunks in batches of", BATCH_SIZE)
return
inserted = 0
for batch in tqdm(extract_chunks(lance_db), desc="Migrating chunks"):
# Convert embeddings to proper format
for chunk in batch:
# pgvector expects array format
chunk["embedding"] = chunk["embedding"]
result = supabase.table("chunks").insert(batch).execute()
inserted += len(batch)
print(f" ✓ Migrated {inserted} chunks")
def verify_migration(supabase: Client, lance_db):
"""Verify migration was successful."""
print("\nVerifying migration...")
# Count videos
lance_videos = len(extract_videos(lance_db))
supa_videos = supabase.table("videos").select("*", count="exact").execute()
print(f" Videos: LanceDB={lance_videos}, Supabase={supa_videos.count}")
# Count chunks
chunks_table = lance_db.open_table("chunks")
lance_chunks = len(chunks_table.to_pandas())
supa_chunks = supabase.table("chunks").select("*", count="exact").execute()
print(f" Chunks: LanceDB={lance_chunks}, Supabase={supa_chunks.count}")
# Verify counts match
if lance_videos == supa_videos.count and lance_chunks == supa_chunks.count:
print(" ✓ Migration verified successfully!")
return True
else:
print(" ✗ Count mismatch detected!")
return False
def test_search(supabase: Client):
"""Test search functionality."""
print("\nTesting search functionality...")
# Test domain stats
result = supabase.rpc("get_domain_stats").execute()
print(f" Domain stats: {len(result.data)} domains found")
# Test overall stats
result = supabase.rpc("get_stats").execute()
stats = result.data[0]
print(f" Total videos: {stats['total_videos']}")
print(f" Total chunks: {stats['total_chunks']}")
print(f" Domains: {', '.join(stats['domains'])}")
print(" ✓ Search functions working!")
def main():
parser = argparse.ArgumentParser(description="Migrate LanceDB to Supabase")
parser.add_argument("--dry-run", action="store_true", help="Preview without executing")
args = parser.parse_args()
print("=" * 60)
print("YouTube KB Migration: LanceDB → Supabase")
print("=" * 60)
if args.dry_run:
print("\n🔍 DRY RUN MODE - No data will be modified\n")
# Connect to databases
print("Connecting to databases...")
lance_db = get_lancedb_connection()
supabase = get_supabase_client()
print(" ✓ Connected to LanceDB")
print(" ✓ Connected to Supabase")
# Extract and migrate videos
videos = extract_videos(lance_db)
migrate_videos(supabase, videos, dry_run=args.dry_run)
# Migrate chunks
migrate_chunks(supabase, lance_db, dry_run=args.dry_run)
if not args.dry_run:
# Verify migration
verify_migration(supabase, lance_db)
# Test search
test_search(supabase)
print("\n" + "=" * 60)
if args.dry_run:
print("DRY RUN COMPLETE - Run without --dry-run to execute")
else:
print("MIGRATION COMPLETE!")
print("=" * 60)
if __name__ == "__main__":
main()
```
---
## Step-by-Step Migration
### Step 1: Prepare Supabase
1. **Create Supabase Project**
```bash
# Or use dashboard: https://supabase.com/dashboard
```
2. **Apply Schema**
```bash
# Copy schema from architecture/DATABASE.md
# Run in Supabase SQL editor
```
3. **Get Credentials**
```bash
# From Project Settings > API
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_ANON_KEY=eyJ...
SUPABASE_SERVICE_KEY=eyJ... # For migration only
```
### Step 2: Prepare Environment
```bash
cd /Users/zaytsev/Documents/Projects/YT-transcripts
# Activate virtual environment
source .venv/bin/activate
# Install migration dependencies
pip install supabase python-dotenv tqdm
# Create .env file
cat > .env.migration << EOF
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_SERVICE_KEY=eyJ...
EOF
```
### Step 3: Run Dry Run
```bash
python prd/migrations/migrate_to_supabase.py --dry-run
```
Expected output:
```
============================================================
YouTube KB Migration: LanceDB → Supabase
============================================================
🔍 DRY RUN MODE - No data will be modified
Connecting to databases...
✓ Connected to LanceDB
✓ Connected to Supabase
Migrating 523 videos...
[DRY RUN] Would insert videos:
- How to Price Your SaaS Product... (abc123)
- Growth Hacking 101... (def456)
- System Design Interview... (ghi789)
... and 520 more
Migrating 52,345 chunks...
[DRY RUN] Would insert chunks in batches of 100
============================================================
DRY RUN COMPLETE - Run without --dry-run to execute
============================================================
```
### Step 4: Execute Migration
```bash
python prd/migrations/migrate_to_supabase.py
```
### Step 5: Verify
```bash
# Check via Supabase dashboard
# Or run verification queries:
# In Supabase SQL editor:
SELECT COUNT(*) FROM videos;
SELECT COUNT(*) FROM chunks;
SELECT * FROM get_domain_stats();
```
---
## Rollback Plan
If migration fails or data is corrupted:
### Option 1: Clear and Retry
```sql
-- In Supabase SQL editor
TRUNCATE TABLE chunks CASCADE;
TRUNCATE TABLE videos CASCADE;
-- Then re-run migration
```
### Option 2: Restore from LanceDB
The original LanceDB data remains untouched at:
```
/Users/zaytsev/Documents/Projects/YT-transcripts/knowledge/lancedb/
```
Simply re-run migration script after fixing issues.
---
## Post-Migration Tasks
1. **Update Local MCP Server**
- Point to Supabase instead of LanceDB
- Test locally before deploying
2. **Monitor Performance**
- Check query latencies
- Monitor Supabase usage
3. **Archive LanceDB**
```bash
# Keep as backup
mv knowledge/lancedb knowledge/lancedb-archive-$(date +%Y%m%d)
```
4. **Update Documentation**
- Remove references to local LanceDB
- Add Supabase configuration docs
---
## Troubleshooting
### Error: "relation does not exist"
**Cause**: Schema not applied
**Fix**: Run schema SQL in Supabase editor
### Error: "invalid input syntax for type vector"
**Cause**: Embedding format mismatch
**Fix**: Ensure embeddings are plain arrays, not numpy arrays
```python
# Convert numpy to list
embedding = embedding.tolist() if hasattr(embedding, "tolist") else embedding
```
### Error: "duplicate key value violates unique constraint"
**Cause**: Re-running migration without clearing
**Fix**: Use upsert or truncate tables first
### Error: Rate limiting
**Cause**: Too many requests
**Fix**: Increase BATCH_SIZE or add delays
```python
import time
time.sleep(0.1) # Add between batches
```
### Slow migration
**Cause**: Large dataset
**Fix**:
- Increase BATCH_SIZE to 500-1000
- Run during off-peak hours
- Use connection pooler
---
## Migration Checklist
- [ ] Supabase project created
- [ ] Schema applied
- [ ] Extensions enabled (pgvector)
- [ ] Indexes created
- [ ] Functions created
- [ ] Environment variables set
- [ ] Dry run successful
- [ ] Migration executed
- [ ] Video count verified
- [ ] Chunk count verified
- [ ] Search function tested
- [ ] Local MCP updated
- [ ] LanceDB archived