Skip to main content
Glama

Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

πŸš€ Features

Core Functionality

  • 🌐 Universal Documentation Scraper: Works with any documentation website

  • πŸ“Š Structured Database: SQLite database with full-text search capabilities

  • πŸ€– MCP Server Integration: Native Claude Desktop integration via Model Context Protocol

  • πŸ“ LLM-Optimized Output: Ready-to-use context files for AI applications

  • βš™οΈ Configuration-Driven: Single config file controls all settings

Advanced Tools

  • πŸ” Query Interface: Command-line tool for searching and analyzing scraped content

  • πŸ› οΈ Debug Suite: Comprehensive debugging tools for testing and validation

  • πŸ“‹ Auto-Configuration: Automatic MCP setup file generation

  • πŸ“ˆ Progress Tracking: Detailed logging and error handling

  • πŸ’Ύ Resumable Crawls: Smart caching for interrupted crawls

Related MCP server: Perplexity Tool for Claude Desktop

πŸ“‹ Prerequisites

  • Python 3.8 or higher

  • Internet connection

  • ~500MB free disk space per documentation site

πŸ› οΈ Quick Start

1. Installation

# Clone the repository
git clone <repository-url>
cd documentation-scraper

# Install dependencies
pip install -r requirements.txt

2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",  # Your documentation site
    "output_dir": "docs_db",
    "max_pages": 200,
    # ... other settings
}

3. Run the Scraper

python docs_scraper.py

4. Query Your Documentation

# Search for content
python query_docs.py --search "tutorial"

# Browse by section
python query_docs.py --section "getting-started"

# Get statistics
python query_docs.py --stats

5. Set Up Claude Integration

# Generate MCP configuration files
python utils/gen_mcp.py

# Follow the instructions to add to Claude Desktop

πŸ—οΈ Project Structure

πŸ“ documentation-scraper/
β”œβ”€β”€ πŸ“„ config.py                    # Central configuration file
β”œβ”€β”€ πŸ•·οΈ docs_scraper.py              # Main scraper script
β”œβ”€β”€ πŸ” query_docs.py                # Query and analysis tool
β”œβ”€β”€ πŸ€– mcp_docs_server.py           # MCP server for Claude integration
β”œβ”€β”€ πŸ“‹ requirements.txt             # Python dependencies
β”œβ”€β”€ πŸ“ utils/                       # Debug and utility tools
β”‚   β”œβ”€β”€ πŸ› οΈ gen_mcp.py               # Generate MCP config files
β”‚   β”œβ”€β”€ πŸ§ͺ debug_scraper.py         # Test scraper functionality
β”‚   β”œβ”€β”€ πŸ”§ debug_mcp_server.py      # Debug MCP server
β”‚   β”œβ”€β”€ 🎯 debug_mcp_client.py      # Test MCP tools directly
β”‚   β”œβ”€β”€ πŸ“‘ debug_mcp_server_protocol.py # Test MCP via JSON-RPC
β”‚   └── 🌐 debug_site_content.py    # Debug content extraction
β”œβ”€β”€ πŸ“ docs_db/                     # Generated documentation database
β”‚   β”œβ”€β”€ πŸ“Š documentation.db         # SQLite database
β”‚   β”œβ”€β”€ πŸ“„ documentation.json       # JSON export
β”‚   β”œβ”€β”€ πŸ“‹ scrape_summary.json      # Statistics
β”‚   └── πŸ“ llm_context/             # LLM-ready context files
└── πŸ“ mcp/                         # Generated MCP configuration
    β”œβ”€β”€ πŸ”§ run_mcp_server.bat       # Windows launcher script
    └── βš™οΈ claude_mcp_config.json   # Claude Desktop config

βš™οΈ Configuration

Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings
SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",
    "output_dir": "docs_db",
    "max_depth": 3,
    "max_pages": 200,
    "delay_between_requests": 0.5,
}

# URL filtering rules
URL_FILTER_CONFIG = {
    "skip_patterns": [r'/api/', r'\.pdf$'],
    "allowed_domains": ["docs.example.com"],
}

# MCP server settings
MCP_CONFIG = {
    "server_name": "docs-server",
    "default_search_limit": 10,
    "max_search_limit": 50,
}

Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py

πŸ€– Claude Desktop Integration

Automatic Setup

  1. Generate configuration files:

    python utils/gen_mcp.py
  2. Copy the generated config to Claude Desktop:

    • Windows: %APPDATA%\Claude\claude_desktop_config.json

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

  3. Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{
  "mcpServers": {
    "docs": {
      "command": "python",
      "args": ["path/to/mcp_docs_server.py"],
      "cwd": "path/to/project",
      "env": {
        "DOCS_DB_PATH": "path/to/docs_db/documentation.db"
      }
    }
  }
}

Available MCP Tools

Once connected, Claude can use these tools:

  • πŸ” search_documentation: Search for content across all documentation

  • πŸ“š get_documentation_sections: List all available sections

  • πŸ“„ get_page_content: Get full content of specific pages

  • πŸ—‚οΈ browse_section: Browse pages within a section

  • πŸ“Š get_documentation_stats: Get database statistics

πŸ”§ Command Line Tools

Documentation Scraper

# Basic scraping
python docs_scraper.py

# Override config settings
python docs_scraper.py  # Settings from config.py

Query Tool

# Search for content
python query_docs.py --search "authentication guide"

# Browse specific sections  
python query_docs.py --section "api-reference"

# Get database statistics
python query_docs.py --stats

# List all sections
python query_docs.py --list-sections

# Export section to file
python query_docs.py --export-section "tutorials" --format markdown > tutorials.md

# Use custom database
python query_docs.py --db "custom/path/docs.db" --search "example"

Debug Tools

# Test scraper functionality
python utils/debug_scraper.py

# Test MCP server
python utils/debug_mcp_server.py

# Test MCP tools directly
python utils/debug_mcp_client.py

# Test MCP protocol
python utils/debug_mcp_server_protocol.py

# Debug content extraction
python utils/debug_site_content.py

# Generate MCP config files
python utils/gen_mcp.py

πŸ“Š Database Schema

Pages Table

CREATE TABLE pages (
    id INTEGER PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT,
    markdown TEXT,
    word_count INTEGER,
    section TEXT,
    subsection TEXT,
    scraped_at TIMESTAMP,
    metadata TEXT
);
-- Search using FTS5
SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term';

-- Or use the query tool
python query_docs.py --search "your search term"

🎯 Example Use Cases

1. Documentation Analysis

# Get overview of documentation
python query_docs.py --stats

# Find all tutorial content
python query_docs.py --search "tutorial guide example"

# Export specific sections
python query_docs.py --export-section "getting-started" > onboarding.md

2. AI Integration with Claude

# Once MCP is set up, ask Claude:
# "Search the documentation for authentication examples"
# "What sections are available in the documentation?"
# "Show me the content for the API reference page"

3. Custom Applications

import sqlite3

# Connect to your scraped documentation
conn = sqlite3.connect('docs_db/documentation.db')

# Query for specific content
results = conn.execute("""
    SELECT title, url, markdown 
    FROM pages 
    WHERE section = 'tutorials' 
    AND word_count > 500
    ORDER BY word_count DESC
""").fetchall()

# Build your own tools on top of the structured data

πŸ” Debugging and Testing

Test Scraper Before Full Run

python utils/debug_scraper.py

Validate Content Extraction

python utils/debug_site_content.py

Test MCP Integration

# Test server functionality
python utils/debug_mcp_server.py

# Test tools directly
python utils/debug_mcp_client.py

# Test JSON-RPC protocol
python utils/debug_mcp_server_protocol.py

πŸ“ˆ Performance and Optimization

Scraping Performance

  • Start small: Use max_pages=50 for testing

  • Adjust depth: max_depth=2 covers most content efficiently

  • Rate limiting: Increase delay_between_requests if getting blocked

  • Caching: Enabled by default for resumable crawls

Database Performance

  • Full-text search: Automatic FTS5 index for fast searching

  • Indexing: Optimized indexes on URL and section columns

  • Word counts: Pre-calculated for quick statistics

MCP Performance

  • Configurable limits: Set appropriate search and section limits

  • Snippet length: Adjust snippet size for optimal response times

  • Connection pooling: Efficient database connections

🌐 Supported Documentation Sites

This scraper works with most documentation websites including:

  • Static sites: Hugo, Jekyll, MkDocs, Docusaurus

  • Documentation platforms: GitBook, Notion, Confluence

  • API docs: Swagger/OpenAPI documentation

  • Wiki-style: MediaWiki, TiddlyWiki

  • Custom sites: Any site with consistent HTML structure

Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = {
    "skip_patterns": [
        r'/api/',           # Skip API endpoint docs
        r'/edit/',          # Skip edit pages  
        r'\.pdf$',          # Skip PDF files
    ],
    "allowed_domains": ["docs.yoursite.com"],
}

CONTENT_FILTER_CONFIG = {
    "remove_patterns": [
        r'Edit this page.*?\n',      # Remove edit links
        r'Was this helpful\?.*?\n',  # Remove feedback sections
    ],
}

🀝 Contributing

We welcome contributions! Here are some areas where you can help:

  • New export formats: PDF, EPUB, Word documents

  • Enhanced content filtering: Better noise removal

  • Additional debug tools: More comprehensive testing

  • Documentation: Improve guides and examples

  • Performance optimizations: Faster scraping and querying

⚠️ Responsible Usage

  • Respect robots.txt: Check the target site's robots.txt file

  • Rate limiting: Use appropriate delays between requests

  • Terms of service: Respect the documentation site's terms

  • Fair use: Use for educational, research, or personal purposes

  • Attribution: Credit the original documentation source

πŸ“„ License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.


πŸŽ‰ Getting Started Examples

Example 1: Scrape Python Documentation

# config.py
SCRAPER_CONFIG = {
    "base_url": "https://docs.python.org/3/",
    "max_pages": 500,
    "max_depth": 3,
}

Example 2: Scrape API Documentation

# config.py  
SCRAPER_CONFIG = {
    "base_url": "https://api-docs.example.com/",
    "max_pages": 200,
}

URL_FILTER_CONFIG = {
    "skip_patterns": [r'/changelog/', r'/releases/'],
}

Example 3: Corporate Documentation

# config.py
SCRAPER_CONFIG = {
    "base_url": "https://internal-docs.company.com/",
    "output_dir": "company_docs",
}

MCP_CONFIG = {
    "server_name": "company-docs-server",
    "docs_display_name": "Company Internal Docs",
}

Happy Documenting! πŸ“šβœ¨

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.


πŸ™ Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! πŸš€

Check out Crawl4AI:

  • Repository: https://github.com/unclecode/crawl4ai

  • Documentation: https://crawl4ai.com

  • Discord Community: https://discord.gg/jP8KfhDhyN

πŸ“„ License

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dragomirweb/Crawl4Claude'

If you have feedback or need assistance with the MCP directory API, please join our Discord server