crawl-mcp-server

README.md•17.6 KiB

# crawl-mcp-server A comprehensive MCP (Model Context Protocol) server providing **11 powerful tools** for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration. [![CI Tests](https://github.com/Git-Fg/searchcrawl-mcp-server/workflows/CI/CD%20Tests/badge.svg)](https://github.com/Git-Fg/searchcrawl-mcp-server/actions) [![codecov](https://codecov.io/gh/Git-Fg/searchcrawl-mcp-server/branch/main/graph/badge.svg)](https://codecov.io/gh/Git-Fg/searchcrawl-mcp-server) ## ✨ Features - 🔍 **SearXNG Web Search** - Search the web with automatic browser management - 📄 **4 Crawling Tools** - Extract and convert web content to Markdown - 🚀 **Auto-Browser-Launch** - Search tools automatically manage browser lifecycle - 📦 **11 Total Tools** - Complete toolkit for web interaction - 💾 **Built-in Caching** - SHA-256 based caching with graceful fallbacks - ⚡ **Concurrent Processing** - Handle multiple URLs simultaneously (up to 50) - 🎯 **LLM-Optimized Output** - Clean Markdown perfect for AI consumption - 🛡️ **Robust Error Handling** - Graceful failure with detailed error messages - 🧪 **Comprehensive Testing** - Full CI/CD with performance benchmarks ## 📦 Installation ### Method 1: npm (Recommended) ```bash npm install crawl-mcp-server ``` ### Method 2: Direct from Git ```bash # Install latest from GitHub npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git # Or specific branch npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main # Or from a fork npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git ``` ### Method 3: Clone and Build ```bash git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git cd crawl-mcp-server npm install npm run build ``` ### Method 4: npx (No Installation) ```bash # Run directly without installing npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git ``` ## 🔧 Setup for Claude Code ### Option 1: MCP Desktop (Recommended) Add to your Claude Desktop configuration file: ** macOS/Linux: `~/.config/claude/claude_desktop_config.json`** ```json { "mcpServers": { "crawl-server": { "command": "npx", "args": [ "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git" ], "env": { "NODE_ENV": "production" } } } } ``` ** Windows: `%APPDATA%\Claude\claude_desktop_config.json`** ```json { "mcpServers": { "crawl-server": { "command": "npx", "args": [ "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git" ], "env": { "NODE_ENV": "production" } } } } ``` ### Option 2: Local Installation If you've installed locally: ```json { "mcpServers": { "crawl-server": { "command": "node", "args": [ "/path/to/crawl-mcp-server/dist/index.js" ], "env": {} } } } ``` ### Option 3: Custom Path For a specific installation: ```json { "mcpServers": { "crawl-server": { "command": "node", "args": [ "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js" ], "env": {} } } } ``` After configuration, restart Claude Desktop. ## 🔧 Setup for Other MCP Clients ### Claude CLI ```bash # Using npx claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git # Using local installation claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js ``` ### Zed Editor Add to `~/.config/zed/settings.json`: ```json { "assistant": { "mcp": { "servers": { "crawl-server": { "command": "node", "args": ["/path/to/crawl-mcp-server/dist/index.js"] } } } } } ``` ### VSCode with Copilot Chat ```json { "mcpServers": { "crawl-server": { "command": "node", "args": ["/path/to/crawl-mcp-server/dist/index.js"] } } } ``` ## 🚀 Quick Start ### Using MCP Inspector (Testing) ```bash # Install MCP Inspector globally npm install -g @modelcontextprotocol/inspector # Run the server node dist/index.js # In another terminal, test tools npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list ``` ### Development Mode ```bash # Watch mode (auto-rebuild on changes) npm run dev # Build TypeScript npm run build # Run tests npm run test:run ``` ## 📚 Available Tools ### Search Tools (7 tools) #### 1. search_searx Search the web using SearXNG with automatic browser management. ```typescript // Example call { "query": "TypeScript MCP server", "maxResults": 10, "category": "general", "timeRange": "week", "language": "en" } ``` **Parameters:** - `query` (string, required): Search query - `maxResults` (number, default: 20): Results to return (1-50) - `category` (enum, default: general): one of general, images, videos, news, map, music, it, science - `timeRange` (enum, optional): one of day, week, month, year - `language` (string, default: en): Language code **Returns:** JSON with search results array, URLs, and metadata --- #### 2. launch_chrome_cdp Launch system Chrome with remote debugging for advanced SearX usage. ```typescript { "headless": true, "port": 9222, "userDataDir": "/path/to/profile" } ``` **Parameters:** - `headless` (boolean, default: true): Run Chrome headless - `port` (number, default: 9222): Remote debugging port - `userDataDir` (string, optional): Custom Chrome profile --- #### 3. connect_cdp Connect to remote CDP browser (Browserbase, etc.). ```typescript { "cdpWsUrl": "http://localhost:9222" } ``` **Parameters:** - `cdpWsUrl` (string, required): CDP WebSocket URL or HTTP endpoint --- #### 4. launch_local Launch bundled Chromium for SearX search. ```typescript { "headless": true, "userAgent": "custom user agent string" } ``` **Parameters:** - `headless` (boolean, default: true): Run headless - `userAgent` (string, optional): Custom user agent --- #### 5. chrome_status Check Chrome CDP status and health. ```typescript {} ``` **Returns:** Running status, health, endpoint URL, and PID --- #### 6. close Close browser session (keeps Chrome CDP running). ```typescript {} ``` --- #### 7. shutdown_chrome_cdp Shutdown Chrome CDP and cleanup resources. ```typescript {} ``` ### Crawling Tools (4 tools) #### 1. crawl_read ⭐ (Simple & Fast) Quick single-page extraction to Markdown. ```typescript { "url": "https://example.com/article", "options": { "timeout": 30000 } } ``` **Best for:** - ✅ News articles - ✅ Blog posts - ✅ Documentation pages - ✅ Simple content extraction **Returns:** Clean Markdown content --- #### 2. crawl_read_batch ⭐ (Multiple URLs) Process 1-50 URLs concurrently. ```typescript { "urls": [ "https://example.com/article1", "https://example.com/article2", "https://example.com/article3" ], "options": { "maxConcurrency": 5, "timeout": 30000, "maxResults": 10 } } ``` **Best for:** - ✅ Processing multiple articles - ✅ Building content aggregates - ✅ Bulk content extraction **Returns:** Array of Markdown results with summary statistics --- #### 3. crawl_fetch_markdown Controlled single-page extraction with full option control. ```typescript { "url": "https://example.com/article", "options": { "timeout": 30000 } } ``` **Best for:** - ✅ Advanced crawling options - ✅ Custom timeout control - ✅ Detailed extraction --- #### 4. crawl_fetch Multi-page crawling with intelligent link extraction. ```typescript { "url": "https://example.com", "options": { "pages": 5, "maxConcurrency": 3, "sameOriginOnly": true, "timeout": 30000, "maxResults": 20 } } ``` **Best for:** - ✅ Crawling entire sites - ✅ Link-based discovery - ✅ Multi-page scraping **Features:** - Extracts links from starting page - Crawls discovered pages - Concurrent processing - Same-origin filtering (configurable) ## 💡 Usage Examples ### Example 1: Search + Crawl Workflow ```typescript // Step 1: Search for topics { "tool": "search_searx", "arguments": { "query": "TypeScript best practices 2024", "maxResults": 5 } } // Step 2: Extract URLs from results // (Parse the search results to get URLs) // Step 3: Crawl selected articles { "tool": "crawl_read_batch", "arguments": { "urls": [ "https://example.com/article1", "https://example.com/article2", "https://example.com/article3" ] } } ``` ### Example 2: Batch Content Extraction ```typescript { "tool": "crawl_read_batch", "arguments": { "urls": [ "https://news.site/article1", "https://news.site/article2", "https://news.site/article3" ], "options": { "maxConcurrency": 10, "timeout": 30000, "maxResults": 3 } } } ``` ### Example 3: Site Crawling ```typescript { "tool": "crawl_fetch", "arguments": { "url": "https://docs.example.com", "options": { "pages": 10, "maxConcurrency": 5, "sameOriginOnly": true, "timeout": 30000, "maxResults": 10 } } } ``` ## 🎯 Tool Selection Guide | Use Case | Recommended Tool | Complexity | |----------|----------------|------------| | Single article | `crawl_read` | Simple | | Multiple articles | `crawl_read_batch` | Simple | | Advanced options | `crawl_fetch_markdown` | Medium | | Site crawling | `crawl_fetch` | Complex | | Web search | `search_searx` | Simple | | Research workflow | `search_searx` → `crawl_read` | Medium | ## 🏗️ Architecture ### Core Components ``` ┌─────────────────────────────────────────┐ │ crawl-mcp-server │ ├─────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────┐ │ │ │ MCP Server Core │ │ │ │ - 11 registered tools │ │ │ │ - STDIO/HTTP transport │ │ │ └──────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────────┐ │ │ │ @just-every/crawl │ │ │ │ - HTML → Markdown │ │ │ │ - Mozilla Readability │ │ │ │ - Concurrent crawling │ │ │ └──────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────────┐ │ │ │ Playwright (Browser) │ │ │ │ - SearXNG integration │ │ │ │ - Auto browser management │ │ │ │ - Anti-detection │ │ │ └──────────────────────────────┘ │ │ │ └─────────────────────────────────────────┘ ``` ### Technology Stack - **Runtime**: Node.js 18+ - **Language**: TypeScript 5.7 - **Framework**: MCP SDK (@modelcontextprotocol/sdk) - **Crawling**: @just-every/crawl - **Browser**: Playwright Core - **Validation**: Zod - **Transport**: STDIO (local) + HTTP (remote) ### Data Flow ``` Client Request ↓ MCP Protocol ↓ Tool Handler ↓ ┌─────────────────────┐ │ Crawl/Search │ │ @just-every/crawl │ → HTML content │ or SearXNG │ → Search results └─────────────────────┘ ↓ HTML → Markdown ↓ Result Formatting ↓ MCP Response ↓ Client ``` ## 🧪 Testing ### Run Test Suite ```bash # All unit tests npm run test:run # Performance benchmarks npm run test:performance # Full CI suite npm run test:ci # Individual tool test npx @modelcontextprotocol/inspector --cli node dist/index.js \ --method tools/call \ --tool-name crawl_read \ --tool-arg url="https://example.com" ``` ### Test Coverage - ✅ All 11 tools tested - ✅ Error handling validated - ✅ Performance benchmarks - ✅ Integration workflows - ✅ Multi-Node support (Node 18, 20, 22) ### CI/CD Pipeline ``` ┌────────────────────────────────────┐ │ GitHub Actions │ ├────────────────────────────────────┤ │ 1. Test (Matrix: Node 18,20,22) │ │ 2. Integration Tests (PR only) │ │ 3. Performance Tests (main) │ │ 4. Security Scan │ │ 5. Coverage Report │ └────────────────────────────────────┘ ``` ## 🔧 Development ### Prerequisites - Node.js 18 or higher - npm or yarn ### Setup ```bash # Clone the repository git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git cd crawl-mcp-server # Install dependencies npm install # Build TypeScript npm run build # Run in development mode (watch) npm run dev ``` ### Development Commands ```bash # Build project npm run build # Watch mode (auto-rebuild) npm run dev # Run tests npm run test:run # Lint code npm run lint # Type check npm run typecheck # Clean build artifacts npm run clean ``` ### Project Structure ``` crawl-mcp-server/ ├── src/ │ ├── index.ts # Main server (11 tools) │ ├── types.ts # TypeScript interfaces │ └── cdp.ts # Chrome CDP manager ├── test/ │ ├── run-tests.ts # Unit test suite │ ├── performance.ts # Performance tests │ └── config.ts # Test configuration ├── dist/ # Compiled JavaScript ├── .github/workflows/ # CI/CD pipeline └── package.json ``` ## 📊 Performance ### Benchmarks | Operation | Avg Duration | Max Memory | |-----------|-------------|------------| | crawl_read | ~1500ms | 32MB | | crawl_read_batch (2 URLs) | ~2500ms | 64MB | | search_searx | ~4000ms | 128MB | | crawl_fetch | ~2000ms | 48MB | | tools/list | ~100ms | 8MB | ### Performance Features - ✅ Concurrent request processing (up to 20) - ✅ Built-in caching (SHA-256) - ✅ Automatic timeout management - ✅ Memory optimization - ✅ Resource cleanup ## 🛡️ Error Handling All tools include comprehensive error handling: - **Network errors**: Graceful degradation with error messages - **Timeout handling**: Configurable timeouts - **Partial failures**: Batch operations continue on individual failures - **Structured errors**: Clear error codes and messages - **Recovery**: Automatic retries where appropriate Example error response: ```json { "content": [ { "type": "text", "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms" } ], "structuredContent": { "error": "Network timeout", "url": "https://example.com", "code": "TIMEOUT" } } ``` ## 🔐 Security - **No API keys required** for basic crawling - **Respect robots.txt** (configurable) - **User agent rotation** - **Rate limiting** (built-in via concurrency limits) - **Input validation** (Zod schemas) - **Dependency scanning** (npm audit, Snyk) ## 🌐 Transport Modes ### STDIO (Default) For local MCP clients: ```bash node dist/index.js ``` ### HTTP For remote access: ```bash TRANSPORT=http PORT=3000 node dist/index.js ``` Server runs on: `http://localhost:3000/mcp` ## 📝 Configuration ### Environment Variables ```bash # Transport mode (stdio or http) TRANSPORT=stdio # HTTP port (when TRANSPORT=http) PORT=3000 # Node environment NODE_ENV=production ``` ### Tool Configuration Each tool accepts an `options` object: ```typescript { "timeout": 30000, // Request timeout (ms) "maxConcurrency": 5, // Concurrent requests (1-20) "maxResults": 10, // Limit results (1-50) "respectRobots": false, // Respect robots.txt "sameOriginOnly": true // Only same-origin URLs } ``` ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch: `git checkout -b feature/amazing-feature` 3. Make changes and add tests 4. Run tests: `npm run test:ci` 5. Commit: `git commit -m 'Add amazing feature'` 6. Push: `git push origin feature/amazing-feature` 7. Open a Pull Request ### Development Guidelines - Follow TypeScript strict mode - Add tests for new features - Update documentation - Run linting: `npm run lint` - Ensure CI passes ## 📄 License MIT License - see [LICENSE](LICENSE) file ## 🙏 Acknowledgments - [@just-every/crawl](https://github.com/just-every/crawl) - Web crawling - [Model Context Protocol](https://modelcontextprotocol.io) - MCP specification - [SearXNG](https://github.com/searxng/searxng) - Search aggregator - [Playwright](https://playwright.dev) - Browser automation ## 📞 Support - **Issues**: [GitHub Issues](https://github.com/Git-Fg/searchcrawl-mcp-server/issues) - **Discussions**: [GitHub Discussions](https://github.com/Git-Fg/searchcrawl-mcp-server/discussions) - **Email**: your-email@example.com ## 🚀 What's Next? - [ ] Add DuckDuckGo search support - [ ] Implement content filtering - [ ] Add screenshot capabilities - [ ] Support for authenticated content - [ ] PDF extraction - [ ] Real-time monitoring --- **Built with ❤️ using TypeScript, MCP, and modern web technologies.**

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Git-Fg/searchcrawl-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•17.6 KiB