BerryRAG

README_SUBPAGE_IMPLEMENTATION.md•10.5 KiB

# 🍓 BerryExa Subpage Implementation This document describes the implementation of subpage crawling functionality in BerryExa, bringing it to feature parity with Exa's subpage capabilities. ## 🎯 Overview BerryExa now supports: - **Recursive subpage crawling** up to depth 3 - **Keyword-targeted subpage selection** - **Rate-limited crawling** (1 second between requests) - **MCP server architecture** for clean separation from BerryRAG - **Flexible RAG integration** via BerryRAG's crawl command ## 🏗️ Architecture ### MCP Server Pattern ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ BerryRAG │ │ BerryExa MCP │ │ BerryExa │ │ System │◄──►│ Server │◄──►│ Core System │ │ │ │ │ │ │ │ • Vector DB │ │ • Tool Exposure │ │ • Web Crawling │ │ • Chunking │ │ • JSON-RPC │ │ • Readability │ │ • Embedding │ │ • Error Handling│ │ • AI Processing │ │ • Search │ │ • Rate Limiting │ │ • Link Analysis │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ### Clean Separation of Concerns - **BerryExa Core**: Pure content extraction and crawling - **BerryExa MCP Server**: Tool exposure and protocol handling - **BerryRAG System**: Vector database and RAG functionality ## 🔧 Implementation Details ### Enhanced ExaResult Structure ```python @dataclass class ExaResult: title: str url: str text: str summary: str highlights: List[str] # New subpage fields depth: int = 0 # 0=main, 1-3=subpage depth parent_url: Optional[str] = None # Parent page URL crawl_path: List[str] = None # Breadcrumb trail ``` ### Subpage Crawling Algorithm 1. **Main Page Extraction** - Crawl and extract content using Mozilla Readability - Extract internal links for subpage discovery - Generate AI summary and highlights 2. **Link Scoring & Selection** ```python def score_link(link, keywords, position): relevance_score = keyword_match_score(link, keywords) * 0.7 position_score = (1.0 - position/total_links) * 0.3 return relevance_score + position_score ``` 3. **Recursive Crawling** - Rate-limited crawling (1 second delays) - Duplicate URL detection - Depth tracking and limits - Error handling per subpage 4. **Content Processing** - Full Readability extraction per subpage - AI-powered summaries and highlights - Parent-child relationship tracking ## 🛠️ MCP Server Tools ### 1. `crawl_content` Primary tool for web content extraction with subpage support. **Parameters:** ```json { "url": "https://example.com", // Required: Main URL "subpages": 5, // Optional: Number of subpages (0-20) "subpage_target": ["docs", "tutorial"], // Optional: Keywords for targeting "max_depth": 3, // Optional: Max crawling depth (1-3) "include_highlights": true, // Optional: AI highlights "include_summary": true // Optional: AI summaries } ``` **Response Format:** ```json { "main_page": { "title": "Main Page Title", "url": "https://example.com", "text": "Clean markdown content...", "summary": "AI-generated summary...", "highlights": ["Key point 1", "Key point 2"], "metadata": { "content_length": 5000, "extraction_source": "berry_exa_readability" } }, "subpages": [ { "title": "Subpage Title", "url": "https://example.com/subpage", "text": "Subpage content...", "summary": "Subpage summary...", "highlights": ["Subpage highlight..."], "depth": 1, "parent_url": "https://example.com", "crawl_path": ["https://example.com", "https://example.com/subpage"] } ], "crawl_metadata": { "total_pages": 6, "request_id": "uuid-string", "successful_crawls": 5, "failed_crawls": 1 } } ``` ### 2. `extract_links` Extract internal links for subpage discovery. **Parameters:** ```json { "url": "https://example.com", "filter_keywords": ["docs", "tutorial"], // Optional: Filter by keywords "max_links": 20 // Optional: Limit results (1-50) } ``` ### 3. `get_content_preview` Quick content preview without full processing. **Parameters:** ```json { "url": "https://example.com", "max_chars": 500 // Optional: Preview length (100-2000) } ``` ## 🚀 Usage Examples ### Basic Subpage Crawling ```python import asyncio from src.berry_exa import BerryExaSystem async def crawl_with_subpages(): berry_exa = BerryExaSystem() response = await berry_exa.get_contents_with_subpages( url="https://docs.python.org/3/tutorial/", subpages=5, subpage_target=["tutorial", "introduction"], max_depth=2, add_to_rag=False ) print(f"Crawled {len(response.results)} pages") for result in response.results: print(f"- {result.title} (depth: {result.depth})") asyncio.run(crawl_with_subpages()) ``` ### MCP Server Usage ```bash # Start the MCP server python mcp_servers/berry_exa_server.py # Use via MCP client (JSON-RPC) { "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "crawl_content", "arguments": { "url": "https://docs.python.org/3/tutorial/", "subpages": 3, "subpage_target": ["tutorial", "basics"] } } } ``` ### Testing the Implementation ```bash # Run comprehensive tests python test_berry_exa_subpages.py # Test specific functionality python -c " import asyncio from src.berry_exa import BerryExaSystem async def test(): berry_exa = BerryExaSystem() links = await berry_exa.extract_links_only( 'https://docs.python.org/3/tutorial/', filter_keywords=['tutorial'], max_links=5 ) print(f'Found {len(links)} relevant links') asyncio.run(test()) " ``` ## 🎯 Keyword Targeting Strategy ### Relevance Scoring ```python def calculate_keyword_relevance(link, keywords): url_text = link['url'].lower() link_text = link['text'].lower() combined_text = f"{url_text} {link_text}" matches = sum(1 for keyword in keywords if keyword.lower() in combined_text) return min(matches / len(keywords), 1.0) ``` ### Link Selection Algorithm 1. **Position Score**: Earlier links get higher priority (30% weight) 2. **Keyword Relevance**: Keyword matches in URL/text (70% weight) 3. **Final Ranking**: Combined score determines selection order ### Example Targeting ```python # Target documentation pages subpage_target = ["docs", "documentation", "guide"] # Target tutorial content subpage_target = ["tutorial", "getting-started", "introduction"] # Target API references subpage_target = ["api", "reference", "methods"] ``` ## 🔄 Rate Limiting & Respectful Crawling ### Implementation - **1-second delay** between requests to same domain - **Concurrent limit**: Maximum 1 request at a time per domain - **Timeout handling**: 30-second timeout per page - **Error recovery**: Continue crawling even if some pages fail ### Best Practices ```python # Respectful crawling configuration crawler_config = { "rate_limit_delay": 1.0, # 1 second between requests "max_concurrent": 1, # No concurrent requests to same domain "timeout": 30000, # 30 second timeout "user_agent": "BerryExa/1.0 (Web Content Extractor)" } ``` ## 📊 Performance Characteristics ### Timing Estimates - **Main page**: ~3-5 seconds (crawl + extract + AI) - **Each subpage**: ~4-6 seconds (including 1s rate limit) - **5 subpages total**: ~25-35 seconds end-to-end ### Memory Usage - **Efficient streaming**: Content processed page-by-page - **Cleanup**: Browser instances closed after each page - **Limits**: Configurable content size limits ### Error Handling - **Graceful degradation**: Failed subpages don't stop crawling - **Detailed status**: Per-page success/failure tracking - **Retry logic**: Built-in timeout and error recovery ## 🔮 Future Enhancements ### Planned Features 1. **Recursive Depth Control**: Full 3-level depth crawling 2. **Batch Processing**: Multiple URLs in single request 3. **Custom Extractors**: Plugin system for specialized content 4. **Cache System**: Local caching to avoid re-crawling 5. **Advanced Filtering**: Content-based subpage selection ### Integration Opportunities 1. **BerryRAG Crawl Command**: Direct integration with vector database 2. **Claude MCP Tools**: Direct access from Claude conversations 3. **Workflow Automation**: Scheduled crawling and updates 4. **Content Monitoring**: Change detection and alerts ## 🧪 Testing & Validation ### Test Coverage - ✅ **Subpage crawling**: Multi-page extraction with targeting - ✅ **Link extraction**: Keyword filtering and scoring - ✅ **Content preview**: Quick content sampling - ✅ **Error handling**: Network failures and timeouts - ✅ **Rate limiting**: Respectful crawling behavior ### Test Execution ```bash # Run all tests python test_berry_exa_subpages.py # Expected output: # 🧪 BerryExa Subpage Functionality Tests # ✅ Successfully crawled X pages # ✅ Found Y filtered links # ✅ Preview generated # ✅ All tests completed! ``` ## 📝 API Compatibility ### Exa API Parity | Feature | Exa API | BerryExa | Status | | ------------------ | ------- | -------- | ----------- | | Content Extraction | ✅ | ✅ | ✅ Complete | | Subpage Crawling | ✅ | ✅ | ✅ Complete | | Keyword Targeting | ✅ | ✅ | ✅ Complete | | AI Summaries | ✅ | ✅ | ✅ Complete | | Highlights | ✅ | ✅ | ✅ Complete | | Depth Control | ✅ | ✅ | ✅ Complete | | Rate Limiting | ✅ | ✅ | ✅ Complete | | Local Storage | ❌ | ✅ | 🎯 Enhanced | | Privacy | ❌ | ✅ | 🎯 Enhanced | | Customization | ❌ | ✅ | 🎯 Enhanced | ### Response Format Compatibility BerryExa maintains compatibility with Exa's response structure while adding enhanced metadata for local processing and RAG integration. --- **🍓 BerryExa now provides complete subpage crawling capabilities with the added benefits of local processing, privacy, and seamless BerryRAG integration!**

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/berrydev-ai/berry-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README_SUBPAGE_IMPLEMENTATION.md•10.5 KiB