Enables scraping and structuring Confluence documentation for AI-assisted searching
Supports scraping and structuring documentation from Docusaurus sites for AI-powered search and retrieval
Enables scraping and indexing of GitBook documentation platforms for AI assistance and search
Enables scraping Hugo-based static documentation sites for AI-powered search and assistance
Supports scraping Jekyll-based documentation sites to create structured, searchable databases
Supports scraping and indexing Notion documentation for AI-powered search and retrieval
Uses SQLite database with full-text search capabilities for storing and querying scraped documentation
Enables scraping and indexing of Swagger/OpenAPI documentation for AI-powered API reference assistance
Supports scraping TiddlyWiki documentation sites to create searchable knowledge bases
Documentation Scraper & MCP Server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
🚀 Features
Core Functionality
🌐 Universal Documentation Scraper: Works with any documentation website
📊 Structured Database: SQLite database with full-text search capabilities
🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
📝 LLM-Optimized Output: Ready-to-use context files for AI applications
⚙️ Configuration-Driven: Single config file controls all settings
Advanced Tools
🔍 Query Interface: Command-line tool for searching and analyzing scraped content
🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
📋 Auto-Configuration: Automatic MCP setup file generation
📈 Progress Tracking: Detailed logging and error handling
💾 Resumable Crawls: Smart caching for interrupted crawls
📋 Prerequisites
Python 3.8 or higher
Internet connection
~500MB free disk space per documentation site
🛠️ Quick Start
1. Installation
2. Configure Your Target
Edit config.py
to set your documentation site:
3. Run the Scraper
4. Query Your Documentation
5. Set Up Claude Integration
🏗️ Project Structure
⚙️ Configuration
Main Configuration (config.py
)
The entire system is controlled by a single configuration file:
Environment Overrides
You can override any setting with environment variables:
🤖 Claude Desktop Integration
Automatic Setup
Generate configuration files:
python utils/gen_mcp.pyCopy the generated config to Claude Desktop:
Windows:
%APPDATA%\Claude\claude_desktop_config.json
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Restart Claude Desktop
Manual Setup
If you prefer manual setup, add this to your Claude Desktop config:
Available MCP Tools
Once connected, Claude can use these tools:
🔍 search_documentation: Search for content across all documentation
📚 get_documentation_sections: List all available sections
📄 get_page_content: Get full content of specific pages
🗂️ browse_section: Browse pages within a section
📊 get_documentation_stats: Get database statistics
🔧 Command Line Tools
Documentation Scraper
Query Tool
Debug Tools
📊 Database Schema
Pages Table
Full-Text Search
🎯 Example Use Cases
1. Documentation Analysis
2. AI Integration with Claude
3. Custom Applications
🔍 Debugging and Testing
Test Scraper Before Full Run
Validate Content Extraction
Test MCP Integration
📈 Performance and Optimization
Scraping Performance
Start small: Use
max_pages=50
for testingAdjust depth:
max_depth=2
covers most content efficientlyRate limiting: Increase
delay_between_requests
if getting blockedCaching: Enabled by default for resumable crawls
Database Performance
Full-text search: Automatic FTS5 index for fast searching
Indexing: Optimized indexes on URL and section columns
Word counts: Pre-calculated for quick statistics
MCP Performance
Configurable limits: Set appropriate search and section limits
Snippet length: Adjust snippet size for optimal response times
Connection pooling: Efficient database connections
🌐 Supported Documentation Sites
This scraper works with most documentation websites including:
Static sites: Hugo, Jekyll, MkDocs, Docusaurus
Documentation platforms: GitBook, Notion, Confluence
API docs: Swagger/OpenAPI documentation
Wiki-style: MediaWiki, TiddlyWiki
Custom sites: Any site with consistent HTML structure
Site-Specific Configuration
Customize URL filtering and content extraction for your target site:
🤝 Contributing
We welcome contributions! Here are some areas where you can help:
New export formats: PDF, EPUB, Word documents
Enhanced content filtering: Better noise removal
Additional debug tools: More comprehensive testing
Documentation: Improve guides and examples
Performance optimizations: Faster scraping and querying
⚠️ Responsible Usage
Respect robots.txt: Check the target site's robots.txt file
Rate limiting: Use appropriate delays between requests
Terms of service: Respect the documentation site's terms
Fair use: Use for educational, research, or personal purposes
Attribution: Credit the original documentation source
📄 License
This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.
🎉 Getting Started Examples
Example 1: Scrape Python Documentation
Example 2: Scrape API Documentation
Example 3: Corporate Documentation
Happy Documenting! 📚✨
For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.
🙏 Attribution
This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.
Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀
Check out Crawl4AI:
Repository: https://github.com/unclecode/crawl4ai
Documentation: https://crawl4ai.com
Discord Community: https://discord.gg/jP8KfhDhyN
📄 License
This server cannot be installed
local-only server
The server can only run on the client's local machine because it depends on local resources.
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
- 🚀 Features
- 📋 Prerequisites
- 🛠️ Quick Start
- 🏗️ Project Structure
- ⚙️ Configuration
- 🤖 Claude Desktop Integration
- 🔧 Command Line Tools
- 📊 Database Schema
- 🎯 Example Use Cases
- 🔍 Debugging and Testing
- 📈 Performance and Optimization
- 🌐 Supported Documentation Sites
- 🤝 Contributing
- ⚠️ Responsible Usage
- 📄 License
- 🎉 Getting Started Examples
- 🙏 Attribution
- 📄 License
Related MCP Servers
- -securityFlicense-qualityA smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.Last updated -10
- AsecurityAlicenseAqualityA custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.Last updated -14MIT License
- -securityFlicense-qualityAn MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.Last updated -
- -securityAlicense-qualityIntegrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.Last updated -1,908Apache 2.0