Crawl4Claude

README.md•13.2 kB

# Documentation Scraper & MCP Server A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance. ## 🚀 Features ### Core Functionality - **🌐 Universal Documentation Scraper**: Works with any documentation website - **📊 Structured Database**: SQLite database with full-text search capabilities - **🤖 MCP Server Integration**: Native Claude Desktop integration via Model Context Protocol - **📝 LLM-Optimized Output**: Ready-to-use context files for AI applications - **⚙️ Configuration-Driven**: Single config file controls all settings ### Advanced Tools - **🔍 Query Interface**: Command-line tool for searching and analyzing scraped content - **🛠️ Debug Suite**: Comprehensive debugging tools for testing and validation - **📋 Auto-Configuration**: Automatic MCP setup file generation - **📈 Progress Tracking**: Detailed logging and error handling - **💾 Resumable Crawls**: Smart caching for interrupted crawls ## 📋 Prerequisites - Python 3.8 or higher - Internet connection - ~500MB free disk space per documentation site ## 🛠️ Quick Start ### 1. Installation ```bash # Clone the repository git clone <repository-url> cd documentation-scraper # Install dependencies pip install -r requirements.txt ``` ### 2. Configure Your Target Edit `config.py` to set your documentation site: ```python SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", # Your documentation site "output_dir": "docs_db", "max_pages": 200, # ... other settings } ``` ### 3. Run the Scraper ```bash python docs_scraper.py ``` ### 4. Query Your Documentation ```bash # Search for content python query_docs.py --search "tutorial" # Browse by section python query_docs.py --section "getting-started" # Get statistics python query_docs.py --stats ``` ### 5. Set Up Claude Integration ```bash # Generate MCP configuration files python utils/gen_mcp.py # Follow the instructions to add to Claude Desktop ``` ## 🏗️ Project Structure ``` 📁 documentation-scraper/ ├── 📄 config.py # Central configuration file ├── 🕷️ docs_scraper.py # Main scraper script ├── 🔍 query_docs.py # Query and analysis tool ├── 🤖 mcp_docs_server.py # MCP server for Claude integration ├── 📋 requirements.txt # Python dependencies ├── 📁 utils/ # Debug and utility tools │ ├── 🛠️ gen_mcp.py # Generate MCP config files │ ├── 🧪 debug_scraper.py # Test scraper functionality │ ├── 🔧 debug_mcp_server.py # Debug MCP server │ ├── 🎯 debug_mcp_client.py # Test MCP tools directly │ ├── 📡 debug_mcp_server_protocol.py # Test MCP via JSON-RPC │ └── 🌐 debug_site_content.py # Debug content extraction ├── 📁 docs_db/ # Generated documentation database │ ├── 📊 documentation.db # SQLite database │ ├── 📄 documentation.json # JSON export │ ├── 📋 scrape_summary.json # Statistics │ └── 📁 llm_context/ # LLM-ready context files └── 📁 mcp/ # Generated MCP configuration ├── 🔧 run_mcp_server.bat # Windows launcher script └── ⚙️ claude_mcp_config.json # Claude Desktop config ``` ## ⚙️ Configuration ### Main Configuration (`config.py`) The entire system is controlled by a single configuration file: ```python # Basic scraping settings SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", "output_dir": "docs_db", "max_depth": 3, "max_pages": 200, "delay_between_requests": 0.5, } # URL filtering rules URL_FILTER_CONFIG = { "skip_patterns": [r'/api/', r'\.pdf$'], "allowed_domains": ["docs.example.com"], } # MCP server settings MCP_CONFIG = { "server_name": "docs-server", "default_search_limit": 10, "max_search_limit": 50, } ``` ### Environment Overrides You can override any setting with environment variables: ```bash export DOCS_DB_PATH="/custom/path/documentation.db" export DOCS_BASE_URL="https://different-docs.com/" python mcp_docs_server.py ``` ## 🤖 Claude Desktop Integration ### Automatic Setup 1. **Generate configuration files**: ```bash python utils/gen_mcp.py ``` 2. **Copy the generated config** to Claude Desktop: - **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` - **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` 3. **Restart Claude Desktop** ### Manual Setup If you prefer manual setup, add this to your Claude Desktop config: ```json { "mcpServers": { "docs": { "command": "python", "args": ["path/to/mcp_docs_server.py"], "cwd": "path/to/project", "env": { "DOCS_DB_PATH": "path/to/docs_db/documentation.db" } } } } ``` ### Available MCP Tools Once connected, Claude can use these tools: - **🔍 search_documentation**: Search for content across all documentation - **📚 get_documentation_sections**: List all available sections - **📄 get_page_content**: Get full content of specific pages - **🗂️ browse_section**: Browse pages within a section - **📊 get_documentation_stats**: Get database statistics ## 🔧 Command Line Tools ### Documentation Scraper ```bash # Basic scraping python docs_scraper.py # Override config settings python docs_scraper.py # Settings from config.py ``` ### Query Tool ```bash # Search for content python query_docs.py --search "authentication guide" # Browse specific sections python query_docs.py --section "api-reference" # Get database statistics python query_docs.py --stats # List all sections python query_docs.py --list-sections # Export section to file python query_docs.py --export-section "tutorials" --format markdown > tutorials.md # Use custom database python query_docs.py --db "custom/path/docs.db" --search "example" ``` ### Debug Tools ```bash # Test scraper functionality python utils/debug_scraper.py # Test MCP server python utils/debug_mcp_server.py # Test MCP tools directly python utils/debug_mcp_client.py # Test MCP protocol python utils/debug_mcp_server_protocol.py # Debug content extraction python utils/debug_site_content.py # Generate MCP config files python utils/gen_mcp.py ``` ## 📊 Database Schema ### Pages Table ```sql CREATE TABLE pages ( id INTEGER PRIMARY KEY, url TEXT UNIQUE NOT NULL, title TEXT, content TEXT, markdown TEXT, word_count INTEGER, section TEXT, subsection TEXT, scraped_at TIMESTAMP, metadata TEXT ); ``` ### Full-Text Search ```sql -- Search using FTS5 SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term'; -- Or use the query tool python query_docs.py --search "your search term" ``` ## 🎯 Example Use Cases ### 1. Documentation Analysis ```bash # Get overview of documentation python query_docs.py --stats # Find all tutorial content python query_docs.py --search "tutorial guide example" # Export specific sections python query_docs.py --export-section "getting-started" > onboarding.md ``` ### 2. AI Integration with Claude ```python # Once MCP is set up, ask Claude: # "Search the documentation for authentication examples" # "What sections are available in the documentation?" # "Show me the content for the API reference page" ``` ### 3. Custom Applications ```python import sqlite3 # Connect to your scraped documentation conn = sqlite3.connect('docs_db/documentation.db') # Query for specific content results = conn.execute(""" SELECT title, url, markdown FROM pages WHERE section = 'tutorials' AND word_count > 500 ORDER BY word_count DESC """).fetchall() # Build your own tools on top of the structured data ``` ## 🔍 Debugging and Testing ### Test Scraper Before Full Run ```bash python utils/debug_scraper.py ``` ### Validate Content Extraction ```bash python utils/debug_site_content.py ``` ### Test MCP Integration ```bash # Test server functionality python utils/debug_mcp_server.py # Test tools directly python utils/debug_mcp_client.py # Test JSON-RPC protocol python utils/debug_mcp_server_protocol.py ``` ## 📈 Performance and Optimization ### Scraping Performance - **Start small**: Use `max_pages=50` for testing - **Adjust depth**: `max_depth=2` covers most content efficiently - **Rate limiting**: Increase `delay_between_requests` if getting blocked - **Caching**: Enabled by default for resumable crawls ### Database Performance - **Full-text search**: Automatic FTS5 index for fast searching - **Indexing**: Optimized indexes on URL and section columns - **Word counts**: Pre-calculated for quick statistics ### MCP Performance - **Configurable limits**: Set appropriate search and section limits - **Snippet length**: Adjust snippet size for optimal response times - **Connection pooling**: Efficient database connections ## 🌐 Supported Documentation Sites This scraper works with most documentation websites including: - **Static sites**: Hugo, Jekyll, MkDocs, Docusaurus - **Documentation platforms**: GitBook, Notion, Confluence - **API docs**: Swagger/OpenAPI documentation - **Wiki-style**: MediaWiki, TiddlyWiki - **Custom sites**: Any site with consistent HTML structure ### Site-Specific Configuration Customize URL filtering and content extraction for your target site: ```python URL_FILTER_CONFIG = { "skip_patterns": [ r'/api/', # Skip API endpoint docs r'/edit/', # Skip edit pages r'\.pdf$', # Skip PDF files ], "allowed_domains": ["docs.yoursite.com"], } CONTENT_FILTER_CONFIG = { "remove_patterns": [ r'Edit this page.*?\n', # Remove edit links r'Was this helpful\?.*?\n', # Remove feedback sections ], } ``` ## 🤝 Contributing We welcome contributions! Here are some areas where you can help: - **New export formats**: PDF, EPUB, Word documents - **Enhanced content filtering**: Better noise removal - **Additional debug tools**: More comprehensive testing - **Documentation**: Improve guides and examples - **Performance optimizations**: Faster scraping and querying ## ⚠️ Responsible Usage - **Respect robots.txt**: Check the target site's robots.txt file - **Rate limiting**: Use appropriate delays between requests - **Terms of service**: Respect the documentation site's terms - **Fair use**: Use for educational, research, or personal purposes - **Attribution**: Credit the original documentation source ## 📄 License This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape. --- ## 🎉 Getting Started Examples ### Example 1: Scrape Python Documentation ```python # config.py SCRAPER_CONFIG = { "base_url": "https://docs.python.org/3/", "max_pages": 500, "max_depth": 3, } ``` ### Example 2: Scrape API Documentation ```python # config.py SCRAPER_CONFIG = { "base_url": "https://api-docs.example.com/", "max_pages": 200, } URL_FILTER_CONFIG = { "skip_patterns": [r'/changelog/', r'/releases/'], } ``` ### Example 3: Corporate Documentation ```python # config.py SCRAPER_CONFIG = { "base_url": "https://internal-docs.company.com/", "output_dir": "company_docs", } MCP_CONFIG = { "server_name": "company-docs-server", "docs_display_name": "Company Internal Docs", } ``` --- **Happy Documenting! 📚✨** For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details. --- ## 🙏 Attribution This project is powered by **[Crawl4AI](https://github.com/unclecode/crawl4ai)** - an amazing open-source LLM-friendly web crawler and scraper. <a href="https://github.com/unclecode/crawl4ai"> <img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/> </a> Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to [@unclecode](https://github.com/unclecode) and the Crawl4AI community for building such an incredible tool! 🚀 **Check out Crawl4AI:** - **Repository**: https://github.com/unclecode/crawl4ai - **Documentation**: https://crawl4ai.com - **Discord Community**: https://discord.gg/jP8KfhDhyN ## 📄 License

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dragomirweb/Crawl4Claude'

If you have feedback or need assistance with the MCP directory API, please join our Discord server