Supports OpenAI embeddings as one of multiple embedding provider options for generating vector representations of documents in the RAG system.
š BerryRAG: Local Vector Database with Playwright MCP Integration
A complete local RAG (Retrieval-Augmented Generation) system that integrates Playwright MCP web scraping with vector database storage for Claude.
⨠Features
Zero-cost self-hosted vector database
Playwright MCP integration for automated web scraping
Multiple embedding providers (sentence-transformers, OpenAI, fallback)
Smart content processing with quality filters
Claude-optimized context formatting
MCP server for direct Claude integration
Command-line tools for manual operation
š Quick Start
1. Installation
2. Configure Claude Desktop
Add to your claude_desktop_config.json:
3. Start Using
š Project Structure
š§ Commands
Streamlit Web Interface
Launch the web interface for easy interaction with your RAG system:
The web interface provides:
š Search: Interactive document search with similarity controls
š Context: Generate formatted context for AI assistants
ā Add Document: Upload files or paste content directly
š List Documents: Browse your document library
š Statistics: System health and performance metrics
NPM Scripts
Command | Description |
| Install all dependencies |
| Initialize directories and instructions |
| Compile TypeScript MCP server |
| Process scraped files into vector DB |
| Search the knowledge base |
| List all documents |
Python CLI
š¤ Usage with Claude
1. Scraping Documentation
2. Processing into Vector Database
3. Querying Knowledge Base
š MCP Tools Available to Claude
BerryRAG provides two powerful MCP servers for Claude integration:
Vector DB Server Tools
add_document- Add content directly to vector DBsearch_documents- Search for similar contentget_context- Get formatted context for querieslist_documents- List all stored documentsget_stats- Vector database statisticsprocess_scraped_files- Process Playwright scraped contentsave_scraped_content- Save content for later processing
BerryExa Server Tools
crawl_content- Advanced web content extraction with subpage supportextract_links- Extract internal links for subpage discoveryget_content_preview- Quick content preview without full processing
š For complete MCP setup and usage guide, see
š§ Embedding Providers
The system supports multiple embedding providers with automatic fallback:
sentence-transformers (recommended, free, local)
OpenAI embeddings (requires API key, set
OPENAI_API_KEY)Simple hash-based (fallback, not recommended for production)
āļø Configuration
Environment Variables
Content Quality Filters
The system automatically filters out:
Content shorter than 100 characters
Navigation-only content
Repetitive/duplicate content
Files larger than 500KB
Chunking Strategy
Default chunk size: 500 characters
Overlap: 50 characters
Smart boundary detection (sentences, paragraphs)
š Monitoring
Check System Status
Storage Information
Database:
storage/documents.db(SQLite metadata)Vectors:
storage/vectors/(NumPy arrays)Scraped Content:
scraped_content/(Markdown files)
š Example Workflows
Academic Research
Scrape research papers with Playwright
Process into vector database
Query for specific concepts across all papers
Documentation Management
Scrape API documentation from multiple sources
Build unified searchable knowledge base
Get contextual answers about implementation details
Content Aggregation
Scrape blog posts and articles
Create topic-based knowledge clusters
Find related content across sources
š ļø Development
Building the MCP Server
Running in Development Mode
Testing
šØ Troubleshooting
Common Issues
Python dependencies missing:
TypeScript compilation errors:
Embedding model download slow: The first run downloads sentence-transformers model (~90MB). This is normal.
No results from search:
Check if documents were processed:
python src/rag_system.py listVerify content quality filters aren't too strict
Try broader search terms
Logs and Debugging
Python logs: Check console output
MCP server logs: Stderr output
Processing status:
scraped_content/.processed_files.json
š License
MIT License - feel free to modify and extend for your needs.
š¤ Contributing
This is a personal project for Eric Berry, but feel free to fork and adapt for your own use cases.
Happy scraping and searching! š·ļøšāØ