Enables scraping and structuring Confluence documentation for AI-assisted searching
Supports scraping and structuring documentation from Docusaurus sites for AI-powered search and retrieval
Enables scraping and indexing of GitBook documentation platforms for AI assistance and search
Enables scraping Hugo-based static documentation sites for AI-powered search and assistance
Supports scraping Jekyll-based documentation sites to create structured, searchable databases
Supports scraping and indexing Notion documentation for AI-powered search and retrieval
Uses SQLite database with full-text search capabilities for storing and querying scraped documentation
Enables scraping and indexing of Swagger/OpenAPI documentation for AI-powered API reference assistance
Supports scraping TiddlyWiki documentation sites to create searchable knowledge bases
Documentation Scraper & MCP Server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
🚀 Features
Core Functionality
- 🌐 Universal Documentation Scraper: Works with any documentation website
- 📊 Structured Database: SQLite database with full-text search capabilities
- 🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
- 📝 LLM-Optimized Output: Ready-to-use context files for AI applications
- ⚙️ Configuration-Driven: Single config file controls all settings
Advanced Tools
- 🔍 Query Interface: Command-line tool for searching and analyzing scraped content
- 🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
- 📋 Auto-Configuration: Automatic MCP setup file generation
- 📈 Progress Tracking: Detailed logging and error handling
- 💾 Resumable Crawls: Smart caching for interrupted crawls
📋 Prerequisites
- Python 3.8 or higher
- Internet connection
- ~500MB free disk space per documentation site
🛠️ Quick Start
1. Installation
2. Configure Your Target
Edit config.py
to set your documentation site:
3. Run the Scraper
4. Query Your Documentation
5. Set Up Claude Integration
🏗️ Project Structure
⚙️ Configuration
Main Configuration (config.py
)
The entire system is controlled by a single configuration file:
Environment Overrides
You can override any setting with environment variables:
🤖 Claude Desktop Integration
Automatic Setup
- Generate configuration files:
- Copy the generated config to Claude Desktop:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
- Restart Claude Desktop
Manual Setup
If you prefer manual setup, add this to your Claude Desktop config:
Available MCP Tools
Once connected, Claude can use these tools:
- 🔍 search_documentation: Search for content across all documentation
- 📚 get_documentation_sections: List all available sections
- 📄 get_page_content: Get full content of specific pages
- 🗂️ browse_section: Browse pages within a section
- 📊 get_documentation_stats: Get database statistics
🔧 Command Line Tools
Documentation Scraper
Query Tool
Debug Tools
📊 Database Schema
Pages Table
Full-Text Search
🎯 Example Use Cases
1. Documentation Analysis
2. AI Integration with Claude
3. Custom Applications
🔍 Debugging and Testing
Test Scraper Before Full Run
Validate Content Extraction
Test MCP Integration
📈 Performance and Optimization
Scraping Performance
- Start small: Use
max_pages=50
for testing - Adjust depth:
max_depth=2
covers most content efficiently - Rate limiting: Increase
delay_between_requests
if getting blocked - Caching: Enabled by default for resumable crawls
Database Performance
- Full-text search: Automatic FTS5 index for fast searching
- Indexing: Optimized indexes on URL and section columns
- Word counts: Pre-calculated for quick statistics
MCP Performance
- Configurable limits: Set appropriate search and section limits
- Snippet length: Adjust snippet size for optimal response times
- Connection pooling: Efficient database connections
🌐 Supported Documentation Sites
This scraper works with most documentation websites including:
- Static sites: Hugo, Jekyll, MkDocs, Docusaurus
- Documentation platforms: GitBook, Notion, Confluence
- API docs: Swagger/OpenAPI documentation
- Wiki-style: MediaWiki, TiddlyWiki
- Custom sites: Any site with consistent HTML structure
Site-Specific Configuration
Customize URL filtering and content extraction for your target site:
🤝 Contributing
We welcome contributions! Here are some areas where you can help:
- New export formats: PDF, EPUB, Word documents
- Enhanced content filtering: Better noise removal
- Additional debug tools: More comprehensive testing
- Documentation: Improve guides and examples
- Performance optimizations: Faster scraping and querying
⚠️ Responsible Usage
- Respect robots.txt: Check the target site's robots.txt file
- Rate limiting: Use appropriate delays between requests
- Terms of service: Respect the documentation site's terms
- Fair use: Use for educational, research, or personal purposes
- Attribution: Credit the original documentation source
📄 License
This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.
🎉 Getting Started Examples
Example 1: Scrape Python Documentation
Example 2: Scrape API Documentation
Example 3: Corporate Documentation
Happy Documenting! 📚✨
For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.
🙏 Attribution
This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.
Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀
Check out Crawl4AI:
- Repository: https://github.com/unclecode/crawl4ai
- Documentation: https://crawl4ai.com
- Discord Community: https://discord.gg/jP8KfhDhyN
📄 License
This server cannot be installed
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
Related MCP Servers
- -securityFlicense-qualityA smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.Last updated -7TypeScript
- AsecurityAlicenseAqualityA custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.Last updated -12JavaScriptMIT License
- -securityFlicense-qualityAn MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.Last updated -Python
- -securityAlicense-qualityIntegrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.Last updated -1,569TypeScriptApache 2.0