Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Dev Tool MCPcrawl the latest blog posts from techcrunch.com and save as markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP Server for Web Crawling
An MCP (Model Context Protocol) server that provides web crawling capabilities using crawl4ai. Supports multiple content formats output (HTML, JSON, PDF, screenshots, Markdown) and browser interaction features.
Quick Start
To launch the server, you have several options:
Option 1: Using the Python launch script
Option 2: Using the shell script (Unix-based systems)
Option 3: Using the batch script (Windows)
Option 4: Direct execution
Table of Contents
MCP Configuration
To configure this server with MCP, you have multiple options depending on your setup:
Method 1: Using the launch script (Recommended)
This is the recommended approach as it handles virtual environment setup and dependency installation automatically:
Make sure to replace /absolute/path/to/your/dev-tool-mcp with the actual absolute path to your project directory.
Method 2: Using the virtual environment Python directly
If you prefer to run the server directly with the virtual environment:
Method 3: Using the installed console script
If the package is installed in the virtual environment, you can use the console script:
Method 4: Windows Configuration
For Windows systems, use the batch script:
Note: The launch scripts (Methods 1-3 on Unix/Linux/Mac and Method 4 on Windows) are recommended because they will:
Check if the virtual environment exists
Create it if needed
Install dependencies from pyproject.toml
Activate the environment
Start the server with all necessary dependencies
This ensures a consistent and reliable setup for the MCP server.
Features
Web Crawling: Advanced web crawling capabilities using crawl4ai
Multiple Output Formats: Support for HTML, JSON, PDF, screenshots, and Markdown output
Browser Interaction: Get page content, console messages, and network requests
LLM Integration: Supports LLM extraction strategies for content processing
File Download: Automatic download and saving of files found on crawled pages
Progress Tracking: Streaming progress updates during crawling operations
Security: URL validation and sanitization to prevent security issues
Architecture
The MCP server is structured into the following modules:
Core Components
Server: Implements MCP protocol with tool registration and execution
BrowserService: Manages Playwright browser instances for page content and network monitoring
Crawler: Uses crawl4ai for advanced web crawling with multiple output formats
Utils: Provides file handling utilities for saving content
Installation
Prerequisites
Python 3.8 or higher
pip package manager
System dependencies for Playwright (Chromium browser)
Steps
Clone the repository:
git clone <repository-url> cd mcp-serverInstall the package and dependencies:
pip install -e .Install Playwright browsers:
python -m playwright install chromium
Dependencies
The server requires the following dependencies (automatically installed via pip):
crawl4ai>=0.7.7: Web crawling librarypydantic>=2.0.0: Data validationmcp==1.0.0: Model Context Protocol implementationhttpx[socks]: HTTP clientlitellm: LLM interfacebeautifulsoup4>=4.12.2: HTML parsinglxml>=4.9.3: XML/HTML processingsentencepiece: Text processingplaywright>=1.40.0: Browser automation
Configuration
Environment Variables
The server uses the following environment variables (optional):
TEST_URL: URL for testing (used in test files)
Available Tools
The server exposes the following tools via MCP:
say_hello
Description: A simple greeting tool that returns personalized messages to users
Parameters:
name(string, optional): The name to greet, defaults to "World"
Returns: Greeting message text
echo_message
Description: Echo tool that returns user-provided information as-is
Parameters:
message(string, required): The message to echo back
Returns: Echoed message text
crawl_web_page
Description: Crawl web page content and save in multiple formats (HTML, JSON, PDF, screenshots) while downloading file resources from the page
Parameters:
url(string, required): The URL of the web page to crawlsave_path(string, required): The base file path to save the crawled content and downloaded filesinstruction(string, optional): The instruction to use for the LLM (default: "")save_screenshot(boolean, optional): Save a screenshot of the page (default: false)save_pdf(boolean, optional): Save a PDF of the page (default: false)generate_markdown(boolean, optional): Generate a Markdown representation of the page (default: false)
Returns: Success message with file count and save location
get_page_content
Description: Get complete content of a specified URL webpage, including HTML structure and page data
Parameters:
url(string, required): The URL of the web page to get content fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting contentwait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
Returns: JSON object containing page content, title, HTML, text, metadata, links, and images
get_console_messages
Description: Capture console output information from specified URL webpage (including logs, warnings, errors, etc.)
Parameters:
url(string, required): The URL of the web page to get console messages fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting console messageswait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
Returns: JSON object containing console messages with type, text, location, and stack information
get_network_requests
Description: Monitor and retrieve all network requests initiated by specified URL webpage (API calls, resource loading, etc.)
Parameters:
url(string, required): The URL of the web page to get network requests fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting network requestswait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
Returns: JSON object containing requests and responses with URLs, status, headers, and timing information
Usage
Running the Server
The server can be started using the console script defined in pyproject.toml:
Or directly via Python:
The server uses stdio for MCP communication, making it compatible with MCP clients.
Tool Examples
Crawling a Web Page
To crawl a web page and save content in multiple formats:
This will create a timestamped subdirectory with:
output.html- Page HTML contentoutput.json- Page content in JSON formatoutput.png- Screenshot of the page (if requested)output.pdf- PDF of the page (if requested)raw_markdown.md- Markdown representation of the page (if requested)downloaded_files.json- List of downloaded filesfiles/- Directory containing downloaded files
Getting Page Content
To retrieve page content:
Getting Console Messages
To capture console messages from a page:
Getting Network Requests
To monitor network requests made by a page:
Testing
The project includes comprehensive tests for both browser and crawler functionality:
Run Tests
Test Coverage
test_crawler.py: Tests the complete crawler functionality, including file saving and format generationtest_browser.py: Tests browser service functions for page content, console messages, and network requests
The crawler test specifically verifies:
HTML, JSON, PDF, screenshot, and Markdown file generation
File download and saving functionality
Proper error handling and directory creation
Deployment
Production Deployment
Install the package in your production environment:
pip install dev-tool-mcpInstall Playwright browsers:
python -m playwright install chromium --with-depsRun the server:
dev-tool-mcp
Docker Deployment (Optional)
Create a Dockerfile for containerized deployment:
Security
URL Validation
The server includes security measures to prevent malicious URLs:
URL length limit (2048 characters)
Protocol validation (only http/https allowed)
Input sanitization to prevent injection attacks
Validation to prevent access to local network addresses
Browser Security
Playwright runs in headless mode by default
Security flags are set to disable dangerous features
Browser runs with limited privileges
File System Security
Files are saved only to explicitly specified paths
No arbitrary file system access is allowed
Temporary files are properly cleaned up
Troubleshooting
Common Issues
Playwright Browser Not Found
If you encounter browser-related errors:
Permission Errors
Ensure the server has write permissions to the specified save paths:
Network Issues
For network-related crawling issues, verify:
The target URL is accessible
No firewall restrictions exist
Appropriate timeout values are set
Memory Usage
Large page crawls can consume significant memory. For large-scale crawling:
Monitor memory usage
Process pages sequentially rather than in parallel
Clean up temporary files regularly
Debugging
Enable verbose logging to troubleshoot issues:
Version Information
Package: dev-tool-mcp
Version: 0.1.0
Python Support: 3.8+
Dependencies: See pyproject.toml for exact versions
License
This project is licensed under the terms specified in the pyproject.toml file. See the project repository for full license details.