Enables scraping and structuring Confluence documentation for AI-assisted searching
Supports scraping and structuring documentation from Docusaurus sites for AI-powered search and retrieval
Enables scraping and indexing of GitBook documentation platforms for AI assistance and search
Enables scraping Hugo-based static documentation sites for AI-powered search and assistance
Supports scraping Jekyll-based documentation sites to create structured, searchable databases
Supports scraping and indexing Notion documentation for AI-powered search and retrieval
Uses SQLite database with full-text search capabilities for storing and querying scraped documentation
Enables scraping and indexing of Swagger/OpenAPI documentation for AI-powered API reference assistance
Supports scraping TiddlyWiki documentation sites to create searchable knowledge bases
Documentation Scraper & MCP Server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
π Features
Core Functionality
π Universal Documentation Scraper: Works with any documentation website
π Structured Database: SQLite database with full-text search capabilities
π€ MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
π LLM-Optimized Output: Ready-to-use context files for AI applications
βοΈ Configuration-Driven: Single config file controls all settings
Advanced Tools
π Query Interface: Command-line tool for searching and analyzing scraped content
π οΈ Debug Suite: Comprehensive debugging tools for testing and validation
π Auto-Configuration: Automatic MCP setup file generation
π Progress Tracking: Detailed logging and error handling
πΎ Resumable Crawls: Smart caching for interrupted crawls
π Prerequisites
Python 3.8 or higher
Internet connection
~500MB free disk space per documentation site
π οΈ Quick Start
1. Installation
2. Configure Your Target
Edit config.py
to set your documentation site:
3. Run the Scraper
4. Query Your Documentation
5. Set Up Claude Integration
ποΈ Project Structure
βοΈ Configuration
Main Configuration (config.py
)
The entire system is controlled by a single configuration file:
Environment Overrides
You can override any setting with environment variables:
π€ Claude Desktop Integration
Automatic Setup
Generate configuration files:
python utils/gen_mcp.pyCopy the generated config to Claude Desktop:
Windows:
%APPDATA%\Claude\claude_desktop_config.json
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Restart Claude Desktop
Manual Setup
If you prefer manual setup, add this to your Claude Desktop config:
Available MCP Tools
Once connected, Claude can use these tools:
π search_documentation: Search for content across all documentation
π get_documentation_sections: List all available sections
π get_page_content: Get full content of specific pages
ποΈ browse_section: Browse pages within a section
π get_documentation_stats: Get database statistics
π§ Command Line Tools
Documentation Scraper
Query Tool
Debug Tools
π Database Schema
Pages Table
Full-Text Search
π― Example Use Cases
1. Documentation Analysis
2. AI Integration with Claude
3. Custom Applications
π Debugging and Testing
Test Scraper Before Full Run
Validate Content Extraction
Test MCP Integration
π Performance and Optimization
Scraping Performance
Start small: Use
max_pages=50
for testingAdjust depth:
max_depth=2
covers most content efficientlyRate limiting: Increase
delay_between_requests
if getting blockedCaching: Enabled by default for resumable crawls
Database Performance
Full-text search: Automatic FTS5 index for fast searching
Indexing: Optimized indexes on URL and section columns
Word counts: Pre-calculated for quick statistics
MCP Performance
Configurable limits: Set appropriate search and section limits
Snippet length: Adjust snippet size for optimal response times
Connection pooling: Efficient database connections
π Supported Documentation Sites
This scraper works with most documentation websites including:
Static sites: Hugo, Jekyll, MkDocs, Docusaurus
Documentation platforms: GitBook, Notion, Confluence
API docs: Swagger/OpenAPI documentation
Wiki-style: MediaWiki, TiddlyWiki
Custom sites: Any site with consistent HTML structure
Site-Specific Configuration
Customize URL filtering and content extraction for your target site:
π€ Contributing
We welcome contributions! Here are some areas where you can help:
New export formats: PDF, EPUB, Word documents
Enhanced content filtering: Better noise removal
Additional debug tools: More comprehensive testing
Documentation: Improve guides and examples
Performance optimizations: Faster scraping and querying
β οΈ Responsible Usage
Respect robots.txt: Check the target site's robots.txt file
Rate limiting: Use appropriate delays between requests
Terms of service: Respect the documentation site's terms
Fair use: Use for educational, research, or personal purposes
Attribution: Credit the original documentation source
π License
This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.
π Getting Started Examples
Example 1: Scrape Python Documentation
Example 2: Scrape API Documentation
Example 3: Corporate Documentation
Happy Documenting! πβ¨
For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.
π Attribution
This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.
Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! π
Check out Crawl4AI:
Repository: https://github.com/unclecode/crawl4ai
Documentation: https://crawl4ai.com
Discord Community: https://discord.gg/jP8KfhDhyN
π License
This server cannot be installed
local-only server
The server can only run on the client's local machine because it depends on local resources.
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
- π Features
- π Prerequisites
- π οΈ Quick Start
- ποΈ Project Structure
- βοΈ Configuration
- π€ Claude Desktop Integration
- π§ Command Line Tools
- π Database Schema
- π― Example Use Cases
- π Debugging and Testing
- π Performance and Optimization
- π Supported Documentation Sites
- π€ Contributing
- β οΈ Responsible Usage
- π License
- π Getting Started Examples
- π Attribution
- π License
Related MCP Servers
- -securityFlicense-qualityA smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.Last updated -10
- AsecurityAlicenseAqualityA custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.Last updated -14MIT License
- -securityFlicense-qualityAn MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.Last updated -
- -securityAlicense-qualityIntegrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.Last updated -1,918Apache 2.0