# š BerryRAG: Local Vector Database with Playwright MCP Integration
A complete local RAG (Retrieval-Augmented Generation) system that integrates Playwright MCP web scraping with vector database storage for Claude.
## ⨠Features
- **Zero-cost self-hosted** vector database
- **Playwright MCP integration** for automated web scraping
- **Multiple embedding providers** (sentence-transformers, OpenAI, fallback)
- **Smart content processing** with quality filters
- **Claude-optimized** context formatting
- **MCP server** for direct Claude integration
- **Command-line tools** for manual operation
## š Quick Start
### 1. Installation
```bash
git clone https://github.com/berrydev-ai/berry-rag.git
cd berry-rag
# Install dependencies
npm run install-deps
# Setup directories and instructions
npm run setup
```
### 2. Configure Claude Desktop
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"]
},
"berry-rag": {
"command": "node",
"args": ["mcp_servers/vector_db_server.js"],
"cwd": "/Users/eberry/BerryDev/berry-rag"
}
}
}
```
### 3. Start Using
```bash
# Example workflow:
# 1. Scrape with Playwright MCP through Claude
# 2. Process into vector DB
npm run process-scraped
# 3. Search your knowledge base
npm run search "React hooks"
```
## š Project Structure
```
berry-rag/
āāā src/ # Python source code
ā āāā rag_system.py # Core vector database system
ā āāā playwright_integration.py # Playwright MCP integration
āāā mcp_servers/ # MCP server implementations
ā āāā vector_db_server.ts # TypeScript MCP server
āāā storage/ # Vector database storage
ā āāā documents.db # SQLite metadata
ā āāā vectors/ # NumPy embedding files
āāā scraped_content/ # Playwright saves content here
āāā dist/ # Compiled TypeScript
```
## š§ Commands
### Streamlit Web Interface
Launch the web interface for easy interaction with your RAG system:
```bash
# Start the Streamlit web interface
python run_streamlit.py
# Or directly with streamlit
streamlit run streamlit_app.py
```
The web interface provides:
- **š Search**: Interactive document search with similarity controls
- **š Context**: Generate formatted context for AI assistants
- **ā Add Document**: Upload files or paste content directly
- **š List Documents**: Browse your document library
- **š Statistics**: System health and performance metrics
### NPM Scripts
| Command | Description |
| ------------------------- | --------------------------------------- |
| `npm run install-deps` | Install all dependencies |
| `npm run setup` | Initialize directories and instructions |
| `npm run build` | Compile TypeScript MCP server |
| `npm run process-scraped` | Process scraped files into vector DB |
| `npm run search` | Search the knowledge base |
| `npm run list-docs` | List all documents |
### Python CLI
```bash
# RAG System Operations
python src/rag_system.py search "query"
python src/rag_system.py context "query" # Claude-formatted
python src/rag_system.py add <url> <title> <file>
python src/rag_system.py list
python src/rag_system.py stats
# Playwright Integration
python src/playwright_integration.py process
python src/playwright_integration.py setup
python src/playwright_integration.py stats
```
## š¤ Usage with Claude
### 1. Scraping Documentation
```
"Use Playwright to scrape the React hooks documentation from https://react.dev/reference/react and save it to the scraped_content directory"
```
### 2. Processing into Vector Database
```
"Process all new scraped files and add them to the BerryRAG vector database"
```
### 3. Querying Knowledge Base
```
"Search the BerryRAG database for information about React useState best practices"
"Get context from the vector database about implementing custom hooks"
```
## š MCP Tools Available to Claude
BerryRAG provides two powerful MCP servers for Claude integration:
### Vector DB Server Tools
- `add_document` - Add content directly to vector DB
- `search_documents` - Search for similar content
- `get_context` - Get formatted context for queries
- `list_documents` - List all stored documents
- `get_stats` - Vector database statistics
- `process_scraped_files` - Process Playwright scraped content
- `save_scraped_content` - Save content for later processing
### BerryExa Server Tools
- `crawl_content` - Advanced web content extraction with subpage support
- `extract_links` - Extract internal links for subpage discovery
- `get_content_preview` - Quick content preview without full processing
š **For complete MCP setup and usage guide, see [BERRY_MCP.md](BERRY_MCP.md)**
## š§ Embedding Providers
The system supports multiple embedding providers with automatic fallback:
1. **sentence-transformers** (recommended, free, local)
2. **OpenAI embeddings** (requires API key, set `OPENAI_API_KEY`)
3. **Simple hash-based** (fallback, not recommended for production)
## āļø Configuration
### Environment Variables
```bash
# Optional: for OpenAI embeddings
export OPENAI_API_KEY=your_key_here
```
### Content Quality Filters
The system automatically filters out:
- Content shorter than 100 characters
- Navigation-only content
- Repetitive/duplicate content
- Files larger than 500KB
### Chunking Strategy
- Default chunk size: 500 characters
- Overlap: 50 characters
- Smart boundary detection (sentences, paragraphs)
## š Monitoring
### Check System Status
```bash
# Vector database statistics
python src/rag_system.py stats
# Processing status
python src/playwright_integration.py stats
# View recent documents
python src/rag_system.py list
```
### Storage Information
- **Database**: `storage/documents.db` (SQLite metadata)
- **Vectors**: `storage/vectors/` (NumPy arrays)
- **Scraped Content**: `scraped_content/` (Markdown files)
## š Example Workflows
### Academic Research
1. Scrape research papers with Playwright
2. Process into vector database
3. Query for specific concepts across all papers
### Documentation Management
1. Scrape API documentation from multiple sources
2. Build unified searchable knowledge base
3. Get contextual answers about implementation details
### Content Aggregation
1. Scrape blog posts and articles
2. Create topic-based knowledge clusters
3. Find related content across sources
## š ļø Development
### Building the MCP Server
```bash
npm run build
```
### Running in Development Mode
```bash
npm run dev # TypeScript watch mode
```
### Testing
```bash
# Test RAG system
python src/rag_system.py stats
# Test integration
python src/playwright_integration.py setup
# Test MCP server
node mcp_servers/vector_db_server.js
```
## šØ Troubleshooting
### Common Issues
**Python dependencies missing:**
```bash
pip install -r requirements.txt
```
**TypeScript compilation errors:**
```bash
npm install
npm run build
```
**Embedding model download slow:**
The first run downloads sentence-transformers model (~90MB). This is normal.
**No results from search:**
- Check if documents were processed: `python src/rag_system.py list`
- Verify content quality filters aren't too strict
- Try broader search terms
### Logs and Debugging
- Python logs: Check console output
- MCP server logs: Stderr output
- Processing status: `scraped_content/.processed_files.json`
## š License
MIT License - feel free to modify and extend for your needs.
## š¤ Contributing
This is a personal project for Eric Berry, but feel free to fork and adapt for your own use cases.
---
**Happy scraping and searching!** š·ļøšāØ