ScrapeGraph MCP Server

Official

MIT License

Overview InspectNew Endpoints Schema Related Servers Reviews Score

project_architecture.md•21.4 kB

# ScrapeGraph MCP Server - Project Architecture **Last Updated:** October 2025 **Version:** 1.0.0 ## Table of Contents - [System Overview](#system-overview) - [Technology Stack](#technology-stack) - [Project Structure](#project-structure) - [Core Architecture](#core-architecture) - [MCP Tools](#mcp-tools) - [API Integration](#api-integration) - [Deployment](#deployment) - [Recent Updates](#recent-updates) --- ## System Overview The ScrapeGraph MCP Server is a production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration between AI assistants (like Claude, Cursor, etc.) and the [ScrapeGraphAI API](https://scrapegraphai.com). This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability. **Key Capabilities:** - **Markdownify** - Convert webpages to clean, structured markdown - **SmartScraper** - AI-powered structured data extraction from webpages - **SearchScraper** - AI-powered web searches with structured results - **SmartCrawler** - Intelligent multi-page web crawling with AI extraction or markdown conversion **Purpose:** - Bridge AI assistants (Claude, Cursor, etc.) with web scraping capabilities - Enable LLMs to extract structured data from any website - Provide clean, formatted markdown conversion of web content - Execute multi-page crawling operations with AI-powered extraction --- ## Technology Stack ### Core Framework - **Python 3.10+** - Programming language (minimum version) - **mcp[cli] 1.3.0+** - Model Context Protocol SDK for Python - **FastMCP** - Lightweight MCP server framework built on top of mcp ### HTTP Client - **httpx 0.24.0+** - Modern async HTTP client for API requests - **Timeout:** 60 seconds for all API requests ### Development Tools - **Ruff 0.1.0+** - Fast Python linter - **mypy 1.0.0+** - Static type checker ### Build System - **Hatchling** - Modern Python build backend - **pyproject.toml** - PEP 621 compliant project metadata ### Deployment - **Docker** - Containerization with Alpine Linux base - **Smithery** - Automated MCP server deployment and distribution - **stdio transport** - Standard input/output for MCP communication --- ## Project Structure ``` scrapegraph-mcp/ ├── src/ │ └── scrapegraph_mcp/ │ ├── __init__.py # Package initialization │ └── server.py # Main MCP server implementation │ ├── assets/ │ ├── sgai_smithery.png # Smithery integration badge │ └── cursor_mcp.png # Cursor integration screenshot │ ├── .github/ │ └── workflows/ # CI/CD workflows (if any) │ ├── pyproject.toml # Project metadata and dependencies ├── Dockerfile # Docker container definition ├── smithery.yaml # Smithery deployment configuration ├── README.md # User-facing documentation ├── LICENSE # MIT License └── .python-version # Python version specification ``` ### Key Files **`src/scrapegraph_mcp/server.py`** - Main server implementation - `ScapeGraphClient` - API client wrapper - MCP tool definitions (`@mcp.tool()` decorators) - Server initialization and main entry point **`pyproject.toml`** - Project metadata (name, version, authors) - Dependencies (mcp, httpx) - Build configuration (hatchling) - Tool configuration (ruff, mypy) - Entry point: `scrapegraph-mcp` → `scrapegraph_mcp.server:main` **`Dockerfile`** - Python 3.12 Alpine base image - Build dependencies (gcc, musl-dev, libffi-dev) - Package installation - Entrypoint: `scrapegraph-mcp` **`smithery.yaml`** - Smithery deployment configuration - NPM package metadata - Installation instructions --- ## Core Architecture ### Model Context Protocol (MCP) The server implements the Model Context Protocol, which defines a standard way for AI assistants to interact with external tools and services. **MCP Components:** 1. **Server** - Exposes tools to AI assistants (this project) 2. **Client** - AI assistant that uses the tools (Claude, Cursor, etc.) 3. **Transport** - Communication layer (stdio) 4. **Tools** - Functions that the AI can call **Communication Flow:** ``` AI Assistant (Claude/Cursor) ↓ (stdio via MCP) FastMCP Server (this project) ↓ (HTTPS API calls) ScrapeGraphAI API (https://api.scrapegraphai.com/v1) ↓ (web scraping) Target Websites ``` ### Server Architecture The server follows a simple, single-file architecture: **`ScapeGraphClient` Class:** - HTTP client wrapper for ScrapeGraphAI API - Base URL: `https://api.scrapegraphai.com/v1` - API key authentication via `SGAI-APIKEY` header - Methods: `markdownify()`, `smartscraper()`, `searchscraper()`, `smartcrawler_initiate()`, `smartcrawler_fetch_results()` **FastMCP Server:** - Created with `FastMCP("ScapeGraph API MCP Server")` - Exposes tools via `@mcp.tool()` decorators - Tool functions wrap `ScapeGraphClient` methods - Error handling with try/except blocks - Returns dictionaries with results or error messages **Initialization Flow:** 1. Import dependencies (`httpx`, `mcp.server.fastmcp`) 2. Define `ScapeGraphClient` class 3. Create `FastMCP` server instance 4. Initialize `ScapeGraphClient` with API key from env or config 5. Define MCP tools with `@mcp.tool()` decorators 6. Start server with `mcp.run(transport="stdio")` ### Design Patterns **1. Wrapper Pattern** - `ScapeGraphClient` wraps the ScrapeGraphAI REST API - Simplifies API interactions with typed methods - Centralizes authentication and error handling **2. Decorator Pattern** - `@mcp.tool()` decorators expose functions as MCP tools - Automatic serialization/deserialization - Type hints → MCP schema generation **3. Singleton Pattern** - Single `scrapegraph_client` instance - Shared across all tool invocations - Reused HTTP client connection **4. Error Handling Pattern** - Try/except blocks in all tool functions - Return error dictionaries instead of raising exceptions - Ensures graceful degradation for AI assistants --- ## MCP Tools The server exposes 5 tools to AI assistants: ### 1. `markdownify(website_url: str)` **Purpose:** Convert a webpage into clean, formatted markdown **Parameters:** - `website_url` (str) - URL of the webpage to convert **Returns:** ```json { "result": "# Page Title\n\nContent in markdown format..." } ``` **Error Response:** ```json { "error": "Error 404: Not Found" } ``` **Example Usage (from AI):** ``` "Convert https://scrapegraphai.com to markdown" → AI calls: markdownify("https://scrapegraphai.com") ``` **API Endpoint:** `POST /v1/markdownify` **Credits:** 2 credits per request --- ### 2. `smartscraper(user_prompt: str, website_url: str, number_of_scrolls: int = None, markdown_only: bool = None)` **Purpose:** Extract structured data from a webpage using AI **Parameters:** - `user_prompt` (str) - Instructions for what data to extract - `website_url` (str) - URL of the webpage to scrape - `number_of_scrolls` (int, optional) - Number of infinite scrolls to perform - `markdown_only` (bool, optional) - Return only markdown without AI processing **Returns:** ```json { "result": { "extracted_field_1": "value1", "extracted_field_2": "value2" } } ``` **Example Usage:** ``` "Extract all product names and prices from https://example.com/products" → AI calls: smartscraper( user_prompt="Extract product names and prices", website_url="https://example.com/products" ) ``` **API Endpoint:** `POST /v1/smartscraper` **Credits:** 10 credits (base) + 1 credit per scroll + additional charges --- ### 3. `searchscraper(user_prompt: str, num_results: int = None, number_of_scrolls: int = None)` **Purpose:** Perform AI-powered web searches with structured results **Parameters:** - `user_prompt` (str) - Search query or instructions - `num_results` (int, optional) - Number of websites to search (default: 3 = 30 credits) - `number_of_scrolls` (int, optional) - Number of infinite scrolls per website **Returns:** ```json { "result": { "answer": "Aggregated answer from multiple sources", "sources": [ {"url": "https://source1.com", "data": {...}}, {"url": "https://source2.com", "data": {...}} ] } } ``` **Example Usage:** ``` "Research the latest AI developments in 2025" → AI calls: searchscraper( user_prompt="Latest AI developments in 2025", num_results=5 ) ``` **API Endpoint:** `POST /v1/searchscraper` **Credits:** Variable (3-20 websites × 10 credits per website) --- ### 4. `smartcrawler_initiate(url: str, prompt: str = None, extraction_mode: str = "ai", depth: int = None, max_pages: int = None, same_domain_only: bool = None)` **Purpose:** Initiate intelligent multi-page web crawling (asynchronous) **Parameters:** - `url` (str) - Starting URL to crawl - `prompt` (str, optional) - AI prompt for data extraction (required for AI mode) - `extraction_mode` (str) - "ai" for AI extraction (10 credits/page) or "markdown" for markdown conversion (2 credits/page) - `depth` (int, optional) - Maximum link traversal depth - `max_pages` (int, optional) - Maximum number of pages to crawl - `same_domain_only` (bool, optional) - Crawl only within the same domain **Returns:** ```json { "request_id": "uuid-here", "status": "processing" } ``` **Example Usage:** ``` "Crawl https://docs.python.org and extract all function signatures" → AI calls: smartcrawler_initiate( url="https://docs.python.org", prompt="Extract function signatures and descriptions", extraction_mode="ai", max_pages=50, same_domain_only=True ) ``` **API Endpoint:** `POST /v1/crawl` **Credits:** 100 credits (base) + 10 credits per page (AI mode) or 2 credits per page (markdown mode) **Note:** This is an asynchronous operation. Use `smartcrawler_fetch_results()` to retrieve results. --- ### 5. `smartcrawler_fetch_results(request_id: str)` **Purpose:** Fetch the results of a SmartCrawler operation **Parameters:** - `request_id` (str) - The request ID returned by `smartcrawler_initiate()` **Returns (while processing):** ```json { "status": "processing", "pages_processed": 15, "total_pages": 50 } ``` **Returns (completed):** ```json { "status": "completed", "results": [ {"url": "https://page1.com", "data": {...}}, {"url": "https://page2.com", "data": {...}} ], "pages_processed": 50, "total_pages": 50 } ``` **Example Usage:** ``` AI: "Check the status of crawl request abc-123" → AI calls: smartcrawler_fetch_results("abc-123") If status is "processing": → AI: "Still processing, 15/50 pages completed" If status is "completed": → AI: "Crawl complete! Here are the results..." ``` **API Endpoint:** `GET /v1/crawl/{request_id}` **Polling Strategy:** - AI assistants should poll this endpoint until `status == "completed"` - Recommended polling interval: 5-10 seconds - Maximum wait time: ~30 minutes for large crawls --- ## API Integration ### ScrapeGraphAI API **Base URL:** `https://api.scrapegraphai.com/v1` **Authentication:** - Header: `SGAI-APIKEY: your-api-key` - Obtain API key from: [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) **Endpoints Used:** | Endpoint | Method | Tool | |----------|--------|------| | `/v1/markdownify` | POST | `markdownify()` | | `/v1/smartscraper` | POST | `smartscraper()` | | `/v1/searchscraper` | POST | `searchscraper()` | | `/v1/crawl` | POST | `smartcrawler_initiate()` | | `/v1/crawl/{request_id}` | GET | `smartcrawler_fetch_results()` | **Request Format:** ```json { "website_url": "https://example.com", "user_prompt": "Extract product names" } ``` **Response Format:** ```json { "result": {...}, "credits_used": 10 } ``` **Error Handling:** ```python response = self.client.post(url, headers=self.headers, json=data) if response.status_code != 200: error_msg = f"Error {response.status_code}: {response.text}" raise Exception(error_msg) return response.json() ``` **HTTP Client Configuration:** - Library: `httpx` - Timeout: 60 seconds - Synchronous client (not async) ### Credit System The MCP server is a pass-through to the ScrapeGraphAI API, so all credit costs are determined by the API: - **Markdownify:** 2 credits - **SmartScraper:** 10 credits (base) + variable - **SearchScraper:** Variable (websites × 10 credits) - **SmartCrawler (AI mode):** 100 + (pages × 10) credits - **SmartCrawler (Markdown mode):** 100 + (pages × 2) credits Credits are deducted from the API key balance on the ScrapeGraphAI platform. --- ## Deployment ### Installation Methods #### 1. Automated Installation via Smithery **Smithery** is the recommended deployment method for MCP servers. ```bash npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude ``` This automatically: - Installs the MCP server - Configures the AI client (Claude Desktop) - Prompts for API key #### 2. Manual Claude Desktop Configuration **macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json` **Windows:** `%APPDATA%/Claude/claude_desktop_config.json` ```json { "mcpServers": { "@ScrapeGraphAI-scrapegraph-mcp": { "command": "npx", "args": [ "-y", "@smithery/cli@latest", "run", "@ScrapeGraphAI/scrapegraph-mcp", "--config", "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}" ] } } } ``` **Windows-Specific Command:** ```bash C:\Windows\System32\cmd.exe /c npx -y @smithery/cli@latest run @ScrapeGraphAI/scrapegraph-mcp --config "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}" ``` #### 3. Cursor Integration Add the MCP server in Cursor settings: 1. Open Cursor settings 2. Navigate to MCP section 3. Add ScrapeGraphAI MCP server 4. Configure API key (See `assets/cursor_mcp.png` for screenshot) #### 4. Docker Deployment **Build:** ```bash docker build -t scrapegraph-mcp . ``` **Run:** ```bash docker run -e SGAI_API_KEY=your-api-key scrapegraph-mcp ``` **Dockerfile:** - Base: Python 3.12 Alpine - Build deps: gcc, musl-dev, libffi-dev - Install via pip: `pip install .` - Entrypoint: `scrapegraph-mcp` #### 5. Python Package Installation **From PyPI (once published):** ```bash pip install scrapegraph-mcp ``` **From Source:** ```bash git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp cd scrapegraph-mcp pip install . ``` **Run:** ```bash export SGAI_API_KEY=your-api-key scrapegraph-mcp ``` ### Configuration **API Key Sources (in order of precedence):** 1. `--config` parameter (Smithery): `"{\"scrapegraphApiKey\":\"key\"}"` 2. Environment variable: `SGAI_API_KEY` 3. Default: `None` (server fails to initialize) **Server Transport:** - **stdio** - Standard input/output (default for MCP) - Communication via JSON-RPC over stdin/stdout ### Production Considerations **Error Handling:** - All tool functions return error dictionaries instead of raising exceptions - Prevents server crashes on API errors - Graceful degradation for AI assistants **Timeout:** - 60-second timeout for all API requests - Prevents hanging on slow websites - Consider increasing for large crawls **API Key Security:** - Never commit API keys to version control - Use environment variables or config files - Rotate keys periodically **Rate Limiting:** - Handled by the ScrapeGraphAI API - MCP server has no built-in rate limiting - Consider implementing client-side throttling for high-volume use --- ## Recent Updates ### October 2025 **SmartCrawler Integration (Latest):** - Added `smartcrawler_initiate()` tool for multi-page crawling - Added `smartcrawler_fetch_results()` tool for async result retrieval - Support for AI extraction mode (10 credits/page) and markdown mode (2 credits/page) - Configurable depth, max_pages, and same_domain_only parameters - Enhanced error handling for extraction mode validation **Recent Commits:** - `aebeebd` - Merge PR #5: Update to new features and add SmartCrawler - `b75053d` - Merge PR #4: Fix SmartCrawler issues - `54b330d` - Enhance error handling in ScapeGraphClient for extraction modes - `b3139dc` - Refactor web crawling methods to SmartCrawler terminology - `94173b0` - Add MseeP.ai security assessment badge - `53c2d99` - Add MCP server badge **Key Features:** 1. **SmartCrawler Support** - Multi-page crawling with AI or markdown modes 2. **Enhanced Error Handling** - Validation for extraction modes and prompts 3. **Async Operation Support** - Initiate/fetch pattern for long-running crawls 4. **Security Badges** - MseeP.ai security assessment and MCP server badges --- ## Development ### Running Locally **Prerequisites:** - Python 3.10+ - pip or pipx **Install Dependencies:** ```bash pip install -e ".[dev]" ``` **Run Server:** ```bash export SGAI_API_KEY=your-api-key python -m scrapegraph_mcp.server # or scrapegraph-mcp ``` **Test with MCP Inspector:** ```bash npx @modelcontextprotocol/inspector scrapegraph-mcp ``` ### Code Quality **Linting:** ```bash ruff check src/ ``` **Type Checking:** ```bash mypy src/ ``` **Configuration:** - **Ruff:** Line length 100, target Python 3.12, rules: E, F, I, B, W - **mypy:** Python 3.12, strict mode, disallow untyped defs ### Project Structure Best Practices **Single-File Architecture:** - All code in `src/scrapegraph_mcp/server.py` - Simple, easy to understand - Minimal dependencies - No complex abstractions **When to Refactor:** - If adding 5+ new tools, consider splitting into modules - If adding authentication logic, create separate auth module - If adding caching, create separate cache module --- ## Testing ### Manual Testing **Test markdownify:** ```bash echo '{"method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp ``` **Test smartscraper:** ```bash echo '{"method":"tools/call","params":{"name":"smartscraper","arguments":{"user_prompt":"Extract main features","website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp ``` **Test searchscraper:** ```bash echo '{"method":"tools/call","params":{"name":"searchscraper","arguments":{"user_prompt":"Latest AI news"}}}' | scrapegraph-mcp ``` ### Integration Testing **Claude Desktop:** 1. Configure MCP server in Claude Desktop 2. Restart Claude 3. Ask: "Convert https://scrapegraphai.com to markdown" 4. Verify tool is called and results returned **Cursor:** 1. Add MCP server in settings 2. Test with chat prompts 3. Verify tool integration --- ## Troubleshooting ### Common Issues **Issue: "ScapeGraph client not initialized"** - **Cause:** Missing API key - **Solution:** Set `SGAI_API_KEY` environment variable or pass via `--config` **Issue: "Error 401: Unauthorized"** - **Cause:** Invalid API key - **Solution:** Verify API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com) **Issue: "Error 402: Payment Required"** - **Cause:** Insufficient credits - **Solution:** Add credits to your account **Issue: "Error 504: Gateway Timeout"** - **Cause:** Website took too long to scrape - **Solution:** Retry or use `markdown_only=True` for faster processing **Issue: Windows cmd.exe not found** - **Cause:** Smithery can't find Windows command prompt - **Solution:** Use full path `C:\Windows\System32\cmd.exe` **Issue: SmartCrawler not returning results** - **Cause:** Still processing (async operation) - **Solution:** Keep polling `smartcrawler_fetch_results()` until `status == "completed"` --- ## Contributing ### Adding New Tools 1. Add method to `ScapeGraphClient` class: ```python def new_tool(self, param: str) -> Dict[str, Any]: """Tool description.""" url = f"{self.BASE_URL}/new-endpoint" data = {"param": param} response = self.client.post(url, headers=self.headers, json=data) if response.status_code != 200: raise Exception(f"Error {response.status_code}: {response.text}") return response.json() ``` 2. Add MCP tool decorator: ```python @mcp.tool() def new_tool(param: str) -> Dict[str, Any]: """Tool description for AI.""" if scrapegraph_client is None: return {"error": "Client not initialized"} try: return scrapegraph_client.new_tool(param) except Exception as e: return {"error": str(e)} ``` 3. Update documentation: - Add tool to [MCP Tools](#mcp-tools) section - Update README.md - Update API integration section ### Submitting Changes 1. Fork the repository 2. Create a feature branch 3. Make changes 4. Run linting and type checking 5. Test with Claude Desktop or Cursor 6. Submit pull request --- ## License This project is distributed under the MIT License. See [LICENSE](../../LICENSE) file for details. --- ## Acknowledgments - **[tomekkorbak](https://github.com/tomekkorbak)** - For [oura-mcp-server](https://github.com/tomekkorbak/oura-mcp-server) implementation inspiration - **[Model Context Protocol](https://modelcontextprotocol.io/)** - For the MCP specification - **[Smithery](https://smithery.ai/)** - For MCP server distribution platform - **[ScrapeGraphAI Team](https://scrapegraphai.com)** - For the API and support --- **Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server