Skip to main content
Glama

Crawl4Claude

Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

๐Ÿš€ Features

Core Functionality

  • ๐ŸŒ Universal Documentation Scraper: Works with any documentation website

  • ๐Ÿ“Š Structured Database: SQLite database with full-text search capabilities

  • ๐Ÿค– MCP Server Integration: Native Claude Desktop integration via Model Context Protocol

  • ๐Ÿ“ LLM-Optimized Output: Ready-to-use context files for AI applications

  • โš™๏ธ Configuration-Driven: Single config file controls all settings

Advanced Tools

  • ๐Ÿ” Query Interface: Command-line tool for searching and analyzing scraped content

  • ๐Ÿ› ๏ธ Debug Suite: Comprehensive debugging tools for testing and validation

  • ๐Ÿ“‹ Auto-Configuration: Automatic MCP setup file generation

  • ๐Ÿ“ˆ Progress Tracking: Detailed logging and error handling

  • ๐Ÿ’พ Resumable Crawls: Smart caching for interrupted crawls

๐Ÿ“‹ Prerequisites

  • Python 3.8 or higher

  • Internet connection

  • ~500MB free disk space per documentation site

๐Ÿ› ๏ธ Quick Start

1. Installation

# Clone the repository git clone <repository-url> cd documentation-scraper # Install dependencies pip install -r requirements.txt

2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", # Your documentation site "output_dir": "docs_db", "max_pages": 200, # ... other settings }

3. Run the Scraper

python docs_scraper.py

4. Query Your Documentation

# Search for content python query_docs.py --search "tutorial" # Browse by section python query_docs.py --section "getting-started" # Get statistics python query_docs.py --stats

5. Set Up Claude Integration

# Generate MCP configuration files python utils/gen_mcp.py # Follow the instructions to add to Claude Desktop

๐Ÿ—๏ธ Project Structure

๐Ÿ“ documentation-scraper/ โ”œโ”€โ”€ ๐Ÿ“„ config.py # Central configuration file โ”œโ”€โ”€ ๐Ÿ•ท๏ธ docs_scraper.py # Main scraper script โ”œโ”€โ”€ ๐Ÿ” query_docs.py # Query and analysis tool โ”œโ”€โ”€ ๐Ÿค– mcp_docs_server.py # MCP server for Claude integration โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt # Python dependencies โ”œโ”€โ”€ ๐Ÿ“ utils/ # Debug and utility tools โ”‚ โ”œโ”€โ”€ ๐Ÿ› ๏ธ gen_mcp.py # Generate MCP config files โ”‚ โ”œโ”€โ”€ ๐Ÿงช debug_scraper.py # Test scraper functionality โ”‚ โ”œโ”€โ”€ ๐Ÿ”ง debug_mcp_server.py # Debug MCP server โ”‚ โ”œโ”€โ”€ ๐ŸŽฏ debug_mcp_client.py # Test MCP tools directly โ”‚ โ”œโ”€โ”€ ๐Ÿ“ก debug_mcp_server_protocol.py # Test MCP via JSON-RPC โ”‚ โ””โ”€โ”€ ๐ŸŒ debug_site_content.py # Debug content extraction โ”œโ”€โ”€ ๐Ÿ“ docs_db/ # Generated documentation database โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š documentation.db # SQLite database โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ documentation.json # JSON export โ”‚ โ”œโ”€โ”€ ๐Ÿ“‹ scrape_summary.json # Statistics โ”‚ โ””โ”€โ”€ ๐Ÿ“ llm_context/ # LLM-ready context files โ””โ”€โ”€ ๐Ÿ“ mcp/ # Generated MCP configuration โ”œโ”€โ”€ ๐Ÿ”ง run_mcp_server.bat # Windows launcher script โ””โ”€โ”€ โš™๏ธ claude_mcp_config.json # Claude Desktop config

โš™๏ธ Configuration

Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", "output_dir": "docs_db", "max_depth": 3, "max_pages": 200, "delay_between_requests": 0.5, } # URL filtering rules URL_FILTER_CONFIG = { "skip_patterns": [r'/api/', r'\.pdf$'], "allowed_domains": ["docs.example.com"], } # MCP server settings MCP_CONFIG = { "server_name": "docs-server", "default_search_limit": 10, "max_search_limit": 50, }

Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db" export DOCS_BASE_URL="https://different-docs.com/" python mcp_docs_server.py

๐Ÿค– Claude Desktop Integration

Automatic Setup

  1. Generate configuration files:

    python utils/gen_mcp.py
  2. Copy the generated config to Claude Desktop:

    • Windows: %APPDATA%\Claude\claude_desktop_config.json

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

  3. Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{ "mcpServers": { "docs": { "command": "python", "args": ["path/to/mcp_docs_server.py"], "cwd": "path/to/project", "env": { "DOCS_DB_PATH": "path/to/docs_db/documentation.db" } } } }

Available MCP Tools

Once connected, Claude can use these tools:

  • ๐Ÿ” search_documentation: Search for content across all documentation

  • ๐Ÿ“š get_documentation_sections: List all available sections

  • ๐Ÿ“„ get_page_content: Get full content of specific pages

  • ๐Ÿ—‚๏ธ browse_section: Browse pages within a section

  • ๐Ÿ“Š get_documentation_stats: Get database statistics

๐Ÿ”ง Command Line Tools

Documentation Scraper

# Basic scraping python docs_scraper.py # Override config settings python docs_scraper.py # Settings from config.py

Query Tool

# Search for content python query_docs.py --search "authentication guide" # Browse specific sections python query_docs.py --section "api-reference" # Get database statistics python query_docs.py --stats # List all sections python query_docs.py --list-sections # Export section to file python query_docs.py --export-section "tutorials" --format markdown > tutorials.md # Use custom database python query_docs.py --db "custom/path/docs.db" --search "example"

Debug Tools

# Test scraper functionality python utils/debug_scraper.py # Test MCP server python utils/debug_mcp_server.py # Test MCP tools directly python utils/debug_mcp_client.py # Test MCP protocol python utils/debug_mcp_server_protocol.py # Debug content extraction python utils/debug_site_content.py # Generate MCP config files python utils/gen_mcp.py

๐Ÿ“Š Database Schema

Pages Table

CREATE TABLE pages ( id INTEGER PRIMARY KEY, url TEXT UNIQUE NOT NULL, title TEXT, content TEXT, markdown TEXT, word_count INTEGER, section TEXT, subsection TEXT, scraped_at TIMESTAMP, metadata TEXT );

Full-Text Search

-- Search using FTS5 SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term'; -- Or use the query tool python query_docs.py --search "your search term"

๐ŸŽฏ Example Use Cases

1. Documentation Analysis

# Get overview of documentation python query_docs.py --stats # Find all tutorial content python query_docs.py --search "tutorial guide example" # Export specific sections python query_docs.py --export-section "getting-started" > onboarding.md

2. AI Integration with Claude

# Once MCP is set up, ask Claude: # "Search the documentation for authentication examples" # "What sections are available in the documentation?" # "Show me the content for the API reference page"

3. Custom Applications

import sqlite3 # Connect to your scraped documentation conn = sqlite3.connect('docs_db/documentation.db') # Query for specific content results = conn.execute(""" SELECT title, url, markdown FROM pages WHERE section = 'tutorials' AND word_count > 500 ORDER BY word_count DESC """).fetchall() # Build your own tools on top of the structured data

๐Ÿ” Debugging and Testing

Test Scraper Before Full Run

python utils/debug_scraper.py

Validate Content Extraction

python utils/debug_site_content.py

Test MCP Integration

# Test server functionality python utils/debug_mcp_server.py # Test tools directly python utils/debug_mcp_client.py # Test JSON-RPC protocol python utils/debug_mcp_server_protocol.py

๐Ÿ“ˆ Performance and Optimization

Scraping Performance

  • Start small: Use max_pages=50 for testing

  • Adjust depth: max_depth=2 covers most content efficiently

  • Rate limiting: Increase delay_between_requests if getting blocked

  • Caching: Enabled by default for resumable crawls

Database Performance

  • Full-text search: Automatic FTS5 index for fast searching

  • Indexing: Optimized indexes on URL and section columns

  • Word counts: Pre-calculated for quick statistics

MCP Performance

  • Configurable limits: Set appropriate search and section limits

  • Snippet length: Adjust snippet size for optimal response times

  • Connection pooling: Efficient database connections

๐ŸŒ Supported Documentation Sites

This scraper works with most documentation websites including:

  • Static sites: Hugo, Jekyll, MkDocs, Docusaurus

  • Documentation platforms: GitBook, Notion, Confluence

  • API docs: Swagger/OpenAPI documentation

  • Wiki-style: MediaWiki, TiddlyWiki

  • Custom sites: Any site with consistent HTML structure

Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = { "skip_patterns": [ r'/api/', # Skip API endpoint docs r'/edit/', # Skip edit pages r'\.pdf$', # Skip PDF files ], "allowed_domains": ["docs.yoursite.com"], } CONTENT_FILTER_CONFIG = { "remove_patterns": [ r'Edit this page.*?\n', # Remove edit links r'Was this helpful\?.*?\n', # Remove feedback sections ], }

๐Ÿค Contributing

We welcome contributions! Here are some areas where you can help:

  • New export formats: PDF, EPUB, Word documents

  • Enhanced content filtering: Better noise removal

  • Additional debug tools: More comprehensive testing

  • Documentation: Improve guides and examples

  • Performance optimizations: Faster scraping and querying

โš ๏ธ Responsible Usage

  • Respect robots.txt: Check the target site's robots.txt file

  • Rate limiting: Use appropriate delays between requests

  • Terms of service: Respect the documentation site's terms

  • Fair use: Use for educational, research, or personal purposes

  • Attribution: Credit the original documentation source

๐Ÿ“„ License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.


๐ŸŽ‰ Getting Started Examples

Example 1: Scrape Python Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://docs.python.org/3/", "max_pages": 500, "max_depth": 3, }

Example 2: Scrape API Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://api-docs.example.com/", "max_pages": 200, } URL_FILTER_CONFIG = { "skip_patterns": [r'/changelog/', r'/releases/'], }

Example 3: Corporate Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://internal-docs.company.com/", "output_dir": "company_docs", } MCP_CONFIG = { "server_name": "company-docs-server", "docs_display_name": "Company Internal Docs", }

Happy Documenting! ๐Ÿ“šโœจ

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.


๐Ÿ™ Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! ๐Ÿš€

Check out Crawl4AI:

๐Ÿ“„ License

-
security - not tested
F
license - not found
-
quality - not tested

local-only server

The server can only run on the client's local machine because it depends on local resources.

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

  1. ๐Ÿš€ Features
    1. Core Functionality
    2. Advanced Tools
  2. ๐Ÿ“‹ Prerequisites
    1. ๐Ÿ› ๏ธ Quick Start
      1. 1. Installation
      2. 2. Configure Your Target
      3. 3. Run the Scraper
      4. 4. Query Your Documentation
      5. 5. Set Up Claude Integration
    2. ๐Ÿ—๏ธ Project Structure
      1. โš™๏ธ Configuration
        1. Main Configuration (config.py)
        2. Environment Overrides
      2. ๐Ÿค– Claude Desktop Integration
        1. Automatic Setup
        2. Manual Setup
        3. Available MCP Tools
      3. ๐Ÿ”ง Command Line Tools
        1. Documentation Scraper
        2. Query Tool
        3. Debug Tools
      4. ๐Ÿ“Š Database Schema
        1. Pages Table
        2. Full-Text Search
      5. ๐ŸŽฏ Example Use Cases
        1. 1. Documentation Analysis
        2. 2. AI Integration with Claude
        3. 3. Custom Applications
      6. ๐Ÿ” Debugging and Testing
        1. Test Scraper Before Full Run
        2. Validate Content Extraction
        3. Test MCP Integration
      7. ๐Ÿ“ˆ Performance and Optimization
        1. Scraping Performance
        2. Database Performance
        3. MCP Performance
      8. ๐ŸŒ Supported Documentation Sites
        1. Site-Specific Configuration
      9. ๐Ÿค Contributing
        1. โš ๏ธ Responsible Usage
          1. ๐Ÿ“„ License
            1. ๐ŸŽ‰ Getting Started Examples
              1. Example 1: Scrape Python Documentation
              2. Example 2: Scrape API Documentation
              3. Example 3: Corporate Documentation
            2. ๐Ÿ™ Attribution
              1. ๐Ÿ“„ License

                Related MCP Servers

                • -
                  security
                  F
                  license
                  -
                  quality
                  A smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.
                  Last updated -
                  10
                • A
                  security
                  A
                  license
                  A
                  quality
                  A custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.
                  Last updated -
                  1
                  4
                  MIT License
                  • Apple
                • -
                  security
                  F
                  license
                  -
                  quality
                  An MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.
                  Last updated -
                • -
                  security
                  A
                  license
                  -
                  quality
                  Integrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.
                  Last updated -
                  1,918
                  Apache 2.0
                  • Apple
                  • Linux

                View all related MCP servers

                MCP directory API

                We provide all the information about MCP servers via our MCP API.

                curl -X GET 'https://glama.ai/api/mcp/v1/servers/dragomirweb/Crawl4Claude'

                If you have feedback or need assistance with the MCP directory API, please join our Discord server