Skip to main content
Glama

Crawl4Claude

2
  • Apple

Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

🚀 Features

Core Functionality

  • 🌐 Universal Documentation Scraper: Works with any documentation website
  • 📊 Structured Database: SQLite database with full-text search capabilities
  • 🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
  • 📝 LLM-Optimized Output: Ready-to-use context files for AI applications
  • ⚙️ Configuration-Driven: Single config file controls all settings

Advanced Tools

  • 🔍 Query Interface: Command-line tool for searching and analyzing scraped content
  • 🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
  • 📋 Auto-Configuration: Automatic MCP setup file generation
  • 📈 Progress Tracking: Detailed logging and error handling
  • 💾 Resumable Crawls: Smart caching for interrupted crawls

📋 Prerequisites

  • Python 3.8 or higher
  • Internet connection
  • ~500MB free disk space per documentation site

🛠️ Quick Start

1. Installation

# Clone the repository git clone <repository-url> cd documentation-scraper # Install dependencies pip install -r requirements.txt

2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", # Your documentation site "output_dir": "docs_db", "max_pages": 200, # ... other settings }

3. Run the Scraper

python docs_scraper.py

4. Query Your Documentation

# Search for content python query_docs.py --search "tutorial" # Browse by section python query_docs.py --section "getting-started" # Get statistics python query_docs.py --stats

5. Set Up Claude Integration

# Generate MCP configuration files python utils/gen_mcp.py # Follow the instructions to add to Claude Desktop

🏗️ Project Structure

📁 documentation-scraper/ ├── 📄 config.py # Central configuration file ├── 🕷️ docs_scraper.py # Main scraper script ├── 🔍 query_docs.py # Query and analysis tool ├── 🤖 mcp_docs_server.py # MCP server for Claude integration ├── 📋 requirements.txt # Python dependencies ├── 📁 utils/ # Debug and utility tools │ ├── 🛠️ gen_mcp.py # Generate MCP config files │ ├── 🧪 debug_scraper.py # Test scraper functionality │ ├── 🔧 debug_mcp_server.py # Debug MCP server │ ├── 🎯 debug_mcp_client.py # Test MCP tools directly │ ├── 📡 debug_mcp_server_protocol.py # Test MCP via JSON-RPC │ └── 🌐 debug_site_content.py # Debug content extraction ├── 📁 docs_db/ # Generated documentation database │ ├── 📊 documentation.db # SQLite database │ ├── 📄 documentation.json # JSON export │ ├── 📋 scrape_summary.json # Statistics │ └── 📁 llm_context/ # LLM-ready context files └── 📁 mcp/ # Generated MCP configuration ├── 🔧 run_mcp_server.bat # Windows launcher script └── ⚙️ claude_mcp_config.json # Claude Desktop config

⚙️ Configuration

Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", "output_dir": "docs_db", "max_depth": 3, "max_pages": 200, "delay_between_requests": 0.5, } # URL filtering rules URL_FILTER_CONFIG = { "skip_patterns": [r'/api/', r'\.pdf$'], "allowed_domains": ["docs.example.com"], } # MCP server settings MCP_CONFIG = { "server_name": "docs-server", "default_search_limit": 10, "max_search_limit": 50, }

Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db" export DOCS_BASE_URL="https://different-docs.com/" python mcp_docs_server.py

🤖 Claude Desktop Integration

Automatic Setup

  1. Generate configuration files:
    python utils/gen_mcp.py
  2. Copy the generated config to Claude Desktop:
    • Windows: %APPDATA%\Claude\claude_desktop_config.json
    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  3. Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{ "mcpServers": { "docs": { "command": "python", "args": ["path/to/mcp_docs_server.py"], "cwd": "path/to/project", "env": { "DOCS_DB_PATH": "path/to/docs_db/documentation.db" } } } }

Available MCP Tools

Once connected, Claude can use these tools:

  • 🔍 search_documentation: Search for content across all documentation
  • 📚 get_documentation_sections: List all available sections
  • 📄 get_page_content: Get full content of specific pages
  • 🗂️ browse_section: Browse pages within a section
  • 📊 get_documentation_stats: Get database statistics

🔧 Command Line Tools

Documentation Scraper

# Basic scraping python docs_scraper.py # Override config settings python docs_scraper.py # Settings from config.py

Query Tool

# Search for content python query_docs.py --search "authentication guide" # Browse specific sections python query_docs.py --section "api-reference" # Get database statistics python query_docs.py --stats # List all sections python query_docs.py --list-sections # Export section to file python query_docs.py --export-section "tutorials" --format markdown > tutorials.md # Use custom database python query_docs.py --db "custom/path/docs.db" --search "example"

Debug Tools

# Test scraper functionality python utils/debug_scraper.py # Test MCP server python utils/debug_mcp_server.py # Test MCP tools directly python utils/debug_mcp_client.py # Test MCP protocol python utils/debug_mcp_server_protocol.py # Debug content extraction python utils/debug_site_content.py # Generate MCP config files python utils/gen_mcp.py

📊 Database Schema

Pages Table

CREATE TABLE pages ( id INTEGER PRIMARY KEY, url TEXT UNIQUE NOT NULL, title TEXT, content TEXT, markdown TEXT, word_count INTEGER, section TEXT, subsection TEXT, scraped_at TIMESTAMP, metadata TEXT );
-- Search using FTS5 SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term'; -- Or use the query tool python query_docs.py --search "your search term"

🎯 Example Use Cases

1. Documentation Analysis

# Get overview of documentation python query_docs.py --stats # Find all tutorial content python query_docs.py --search "tutorial guide example" # Export specific sections python query_docs.py --export-section "getting-started" > onboarding.md

2. AI Integration with Claude

# Once MCP is set up, ask Claude: # "Search the documentation for authentication examples" # "What sections are available in the documentation?" # "Show me the content for the API reference page"

3. Custom Applications

import sqlite3 # Connect to your scraped documentation conn = sqlite3.connect('docs_db/documentation.db') # Query for specific content results = conn.execute(""" SELECT title, url, markdown FROM pages WHERE section = 'tutorials' AND word_count > 500 ORDER BY word_count DESC """).fetchall() # Build your own tools on top of the structured data

🔍 Debugging and Testing

Test Scraper Before Full Run

python utils/debug_scraper.py

Validate Content Extraction

python utils/debug_site_content.py

Test MCP Integration

# Test server functionality python utils/debug_mcp_server.py # Test tools directly python utils/debug_mcp_client.py # Test JSON-RPC protocol python utils/debug_mcp_server_protocol.py

📈 Performance and Optimization

Scraping Performance

  • Start small: Use max_pages=50 for testing
  • Adjust depth: max_depth=2 covers most content efficiently
  • Rate limiting: Increase delay_between_requests if getting blocked
  • Caching: Enabled by default for resumable crawls

Database Performance

  • Full-text search: Automatic FTS5 index for fast searching
  • Indexing: Optimized indexes on URL and section columns
  • Word counts: Pre-calculated for quick statistics

MCP Performance

  • Configurable limits: Set appropriate search and section limits
  • Snippet length: Adjust snippet size for optimal response times
  • Connection pooling: Efficient database connections

🌐 Supported Documentation Sites

This scraper works with most documentation websites including:

  • Static sites: Hugo, Jekyll, MkDocs, Docusaurus
  • Documentation platforms: GitBook, Notion, Confluence
  • API docs: Swagger/OpenAPI documentation
  • Wiki-style: MediaWiki, TiddlyWiki
  • Custom sites: Any site with consistent HTML structure

Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = { "skip_patterns": [ r'/api/', # Skip API endpoint docs r'/edit/', # Skip edit pages r'\.pdf$', # Skip PDF files ], "allowed_domains": ["docs.yoursite.com"], } CONTENT_FILTER_CONFIG = { "remove_patterns": [ r'Edit this page.*?\n', # Remove edit links r'Was this helpful\?.*?\n', # Remove feedback sections ], }

🤝 Contributing

We welcome contributions! Here are some areas where you can help:

  • New export formats: PDF, EPUB, Word documents
  • Enhanced content filtering: Better noise removal
  • Additional debug tools: More comprehensive testing
  • Documentation: Improve guides and examples
  • Performance optimizations: Faster scraping and querying

⚠️ Responsible Usage

  • Respect robots.txt: Check the target site's robots.txt file
  • Rate limiting: Use appropriate delays between requests
  • Terms of service: Respect the documentation site's terms
  • Fair use: Use for educational, research, or personal purposes
  • Attribution: Credit the original documentation source

📄 License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.


🎉 Getting Started Examples

Example 1: Scrape Python Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://docs.python.org/3/", "max_pages": 500, "max_depth": 3, }

Example 2: Scrape API Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://api-docs.example.com/", "max_pages": 200, } URL_FILTER_CONFIG = { "skip_patterns": [r'/changelog/', r'/releases/'], }

Example 3: Corporate Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://internal-docs.company.com/", "output_dir": "company_docs", } MCP_CONFIG = { "server_name": "company-docs-server", "docs_display_name": "Company Internal Docs", }

Happy Documenting! 📚✨

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.


🙏 Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀

Check out Crawl4AI:

📄 License

-
security - not tested
F
license - not found
-
quality - not tested

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

Related MCP Servers

  • -
    security
    F
    license
    -
    quality
    A smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.
    Last updated -
    7
    TypeScript
  • A
    security
    A
    license
    A
    quality
    A custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.
    Last updated -
    1
    2
    JavaScript
    MIT License
    • Apple
  • -
    security
    F
    license
    -
    quality
    An MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.
    Last updated -
    Python
  • -
    security
    A
    license
    -
    quality
    Integrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.
    Last updated -
    1,569
    TypeScript
    Apache 2.0
    • Apple
    • Linux

View all related MCP servers

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dragomirweb/Crawl4Claude'

If you have feedback or need assistance with the MCP directory API, please join our Discord server