Skip to main content
Glama

Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

πŸš€ Features

Core Functionality

  • 🌐 Universal Documentation Scraper: Works with any documentation website

  • πŸ“Š Structured Database: SQLite database with full-text search capabilities

  • πŸ€– MCP Server Integration: Native Claude Desktop integration via Model Context Protocol

  • πŸ“ LLM-Optimized Output: Ready-to-use context files for AI applications

  • βš™οΈ Configuration-Driven: Single config file controls all settings

Advanced Tools

  • πŸ” Query Interface: Command-line tool for searching and analyzing scraped content

  • πŸ› οΈ Debug Suite: Comprehensive debugging tools for testing and validation

  • πŸ“‹ Auto-Configuration: Automatic MCP setup file generation

  • πŸ“ˆ Progress Tracking: Detailed logging and error handling

  • πŸ’Ύ Resumable Crawls: Smart caching for interrupted crawls

πŸ“‹ Prerequisites

  • Python 3.8 or higher

  • Internet connection

  • ~500MB free disk space per documentation site

πŸ› οΈ Quick Start

1. Installation

# Clone the repository git clone <repository-url> cd documentation-scraper # Install dependencies pip install -r requirements.txt

2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", # Your documentation site "output_dir": "docs_db", "max_pages": 200, # ... other settings }

3. Run the Scraper

python docs_scraper.py

4. Query Your Documentation

# Search for content python query_docs.py --search "tutorial" # Browse by section python query_docs.py --section "getting-started" # Get statistics python query_docs.py --stats

5. Set Up Claude Integration

# Generate MCP configuration files python utils/gen_mcp.py # Follow the instructions to add to Claude Desktop

πŸ—οΈ Project Structure

πŸ“ documentation-scraper/ β”œβ”€β”€ πŸ“„ config.py # Central configuration file β”œβ”€β”€ πŸ•·οΈ docs_scraper.py # Main scraper script β”œβ”€β”€ πŸ” query_docs.py # Query and analysis tool β”œβ”€β”€ πŸ€– mcp_docs_server.py # MCP server for Claude integration β”œβ”€β”€ πŸ“‹ requirements.txt # Python dependencies β”œβ”€β”€ πŸ“ utils/ # Debug and utility tools β”‚ β”œβ”€β”€ πŸ› οΈ gen_mcp.py # Generate MCP config files β”‚ β”œβ”€β”€ πŸ§ͺ debug_scraper.py # Test scraper functionality β”‚ β”œβ”€β”€ πŸ”§ debug_mcp_server.py # Debug MCP server β”‚ β”œβ”€β”€ 🎯 debug_mcp_client.py # Test MCP tools directly β”‚ β”œβ”€β”€ πŸ“‘ debug_mcp_server_protocol.py # Test MCP via JSON-RPC β”‚ └── 🌐 debug_site_content.py # Debug content extraction β”œβ”€β”€ πŸ“ docs_db/ # Generated documentation database β”‚ β”œβ”€β”€ πŸ“Š documentation.db # SQLite database β”‚ β”œβ”€β”€ πŸ“„ documentation.json # JSON export β”‚ β”œβ”€β”€ πŸ“‹ scrape_summary.json # Statistics β”‚ └── πŸ“ llm_context/ # LLM-ready context files └── πŸ“ mcp/ # Generated MCP configuration β”œβ”€β”€ πŸ”§ run_mcp_server.bat # Windows launcher script └── βš™οΈ claude_mcp_config.json # Claude Desktop config

βš™οΈ Configuration

Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings SCRAPER_CONFIG = { "base_url": "https://docs.example.com/", "output_dir": "docs_db", "max_depth": 3, "max_pages": 200, "delay_between_requests": 0.5, } # URL filtering rules URL_FILTER_CONFIG = { "skip_patterns": [r'/api/', r'\.pdf$'], "allowed_domains": ["docs.example.com"], } # MCP server settings MCP_CONFIG = { "server_name": "docs-server", "default_search_limit": 10, "max_search_limit": 50, }

Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db" export DOCS_BASE_URL="https://different-docs.com/" python mcp_docs_server.py

πŸ€– Claude Desktop Integration

Automatic Setup

  1. Generate configuration files:

    python utils/gen_mcp.py
  2. Copy the generated config to Claude Desktop:

    • Windows: %APPDATA%\Claude\claude_desktop_config.json

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

  3. Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{ "mcpServers": { "docs": { "command": "python", "args": ["path/to/mcp_docs_server.py"], "cwd": "path/to/project", "env": { "DOCS_DB_PATH": "path/to/docs_db/documentation.db" } } } }

Available MCP Tools

Once connected, Claude can use these tools:

  • πŸ” search_documentation: Search for content across all documentation

  • πŸ“š get_documentation_sections: List all available sections

  • πŸ“„ get_page_content: Get full content of specific pages

  • πŸ—‚οΈ browse_section: Browse pages within a section

  • πŸ“Š get_documentation_stats: Get database statistics

πŸ”§ Command Line Tools

Documentation Scraper

# Basic scraping python docs_scraper.py # Override config settings python docs_scraper.py # Settings from config.py

Query Tool

# Search for content python query_docs.py --search "authentication guide" # Browse specific sections python query_docs.py --section "api-reference" # Get database statistics python query_docs.py --stats # List all sections python query_docs.py --list-sections # Export section to file python query_docs.py --export-section "tutorials" --format markdown > tutorials.md # Use custom database python query_docs.py --db "custom/path/docs.db" --search "example"

Debug Tools

# Test scraper functionality python utils/debug_scraper.py # Test MCP server python utils/debug_mcp_server.py # Test MCP tools directly python utils/debug_mcp_client.py # Test MCP protocol python utils/debug_mcp_server_protocol.py # Debug content extraction python utils/debug_site_content.py # Generate MCP config files python utils/gen_mcp.py

πŸ“Š Database Schema

Pages Table

CREATE TABLE pages ( id INTEGER PRIMARY KEY, url TEXT UNIQUE NOT NULL, title TEXT, content TEXT, markdown TEXT, word_count INTEGER, section TEXT, subsection TEXT, scraped_at TIMESTAMP, metadata TEXT );

Full-Text Search

-- Search using FTS5 SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term'; -- Or use the query tool python query_docs.py --search "your search term"

🎯 Example Use Cases

1. Documentation Analysis

# Get overview of documentation python query_docs.py --stats # Find all tutorial content python query_docs.py --search "tutorial guide example" # Export specific sections python query_docs.py --export-section "getting-started" > onboarding.md

2. AI Integration with Claude

# Once MCP is set up, ask Claude: # "Search the documentation for authentication examples" # "What sections are available in the documentation?" # "Show me the content for the API reference page"

3. Custom Applications

import sqlite3 # Connect to your scraped documentation conn = sqlite3.connect('docs_db/documentation.db') # Query for specific content results = conn.execute(""" SELECT title, url, markdown FROM pages WHERE section = 'tutorials' AND word_count > 500 ORDER BY word_count DESC """).fetchall() # Build your own tools on top of the structured data

πŸ” Debugging and Testing

Test Scraper Before Full Run

python utils/debug_scraper.py

Validate Content Extraction

python utils/debug_site_content.py

Test MCP Integration

# Test server functionality python utils/debug_mcp_server.py # Test tools directly python utils/debug_mcp_client.py # Test JSON-RPC protocol python utils/debug_mcp_server_protocol.py

πŸ“ˆ Performance and Optimization

Scraping Performance

  • Start small: Use max_pages=50 for testing

  • Adjust depth: max_depth=2 covers most content efficiently

  • Rate limiting: Increase delay_between_requests if getting blocked

  • Caching: Enabled by default for resumable crawls

Database Performance

  • Full-text search: Automatic FTS5 index for fast searching

  • Indexing: Optimized indexes on URL and section columns

  • Word counts: Pre-calculated for quick statistics

MCP Performance

  • Configurable limits: Set appropriate search and section limits

  • Snippet length: Adjust snippet size for optimal response times

  • Connection pooling: Efficient database connections

🌐 Supported Documentation Sites

This scraper works with most documentation websites including:

  • Static sites: Hugo, Jekyll, MkDocs, Docusaurus

  • Documentation platforms: GitBook, Notion, Confluence

  • API docs: Swagger/OpenAPI documentation

  • Wiki-style: MediaWiki, TiddlyWiki

  • Custom sites: Any site with consistent HTML structure

Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = { "skip_patterns": [ r'/api/', # Skip API endpoint docs r'/edit/', # Skip edit pages r'\.pdf$', # Skip PDF files ], "allowed_domains": ["docs.yoursite.com"], } CONTENT_FILTER_CONFIG = { "remove_patterns": [ r'Edit this page.*?\n', # Remove edit links r'Was this helpful\?.*?\n', # Remove feedback sections ], }

🀝 Contributing

We welcome contributions! Here are some areas where you can help:

  • New export formats: PDF, EPUB, Word documents

  • Enhanced content filtering: Better noise removal

  • Additional debug tools: More comprehensive testing

  • Documentation: Improve guides and examples

  • Performance optimizations: Faster scraping and querying

⚠️ Responsible Usage

  • Respect robots.txt: Check the target site's robots.txt file

  • Rate limiting: Use appropriate delays between requests

  • Terms of service: Respect the documentation site's terms

  • Fair use: Use for educational, research, or personal purposes

  • Attribution: Credit the original documentation source

πŸ“„ License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.


πŸŽ‰ Getting Started Examples

Example 1: Scrape Python Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://docs.python.org/3/", "max_pages": 500, "max_depth": 3, }

Example 2: Scrape API Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://api-docs.example.com/", "max_pages": 200, } URL_FILTER_CONFIG = { "skip_patterns": [r'/changelog/', r'/releases/'], }

Example 3: Corporate Documentation

# config.py SCRAPER_CONFIG = { "base_url": "https://internal-docs.company.com/", "output_dir": "company_docs", } MCP_CONFIG = { "server_name": "company-docs-server", "docs_display_name": "Company Internal Docs", }

Happy Documenting! πŸ“šβœ¨

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.


πŸ™ Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! πŸš€

Check out Crawl4AI:

πŸ“„ License

-
security - not tested
F
license - not found
-
quality - not tested

Related MCP Servers

  • -
    security
    F
    license
    -
    quality
    A smart documentation server that provides AI-assisted code improvement and documentation management through Claude Desktop integration.
    Last updated -
    10
  • A
    security
    A
    license
    A
    quality
    A custom MCP tool that integrates Perplexity AI's API with Claude Desktop, allowing Claude to perform web-based research and provide answers with citations.
    Last updated -
    1
    4
    MIT License
    • Apple
  • -
    security
    F
    license
    -
    quality
    An MCP server that integrates with Claude to provide smart documentation search capabilities across multiple AI/ML libraries, allowing users to retrieve and process technical information through natural language queries.
    Last updated -
  • -
    security
    A
    license
    -
    quality
    Integrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.
    Last updated -
    1,969
    Apache 2.0
    • Apple
    • Linux

View all related MCP servers

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dragomirweb/Crawl4Claude'

If you have feedback or need assistance with the MCP directory API, please join our Discord server