Skip to main content
Glama
Git-Fg

crawl-mcp-server

by Git-Fg

crawl-mcp-server

A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.

CI Tests codecov

โœจ Features

  • ๐Ÿ” SearXNG Web Search - Search the web with automatic browser management

  • ๐Ÿ“„ 4 Crawling Tools - Extract and convert web content to Markdown

  • ๐Ÿš€ Auto-Browser-Launch - Search tools automatically manage browser lifecycle

  • ๐Ÿ“ฆ 11 Total Tools - Complete toolkit for web interaction

  • ๐Ÿ’พ Built-in Caching - SHA-256 based caching with graceful fallbacks

  • โšก Concurrent Processing - Handle multiple URLs simultaneously (up to 50)

  • ๐ŸŽฏ LLM-Optimized Output - Clean Markdown perfect for AI consumption

  • ๐Ÿ›ก๏ธ Robust Error Handling - Graceful failure with detailed error messages

  • ๐Ÿงช Comprehensive Testing - Full CI/CD with performance benchmarks

๐Ÿ“ฆ Installation

npm install crawl-mcp-server

Method 2: Direct from Git

# Install latest from GitHub npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git # Or specific branch npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main # Or from a fork npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git

Method 3: Clone and Build

git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git cd crawl-mcp-server npm install npm run build

Method 4: npx (No Installation)

# Run directly without installing npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

๐Ÿ”ง Setup for Claude Code

Add to your Claude Desktop configuration file:

** macOS/Linux: ~/.config/claude/claude_desktop_config.json**

{ "mcpServers": { "crawl-server": { "command": "npx", "args": [ "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git" ], "env": { "NODE_ENV": "production" } } } }

** Windows: %APPDATA%\Claude\claude_desktop_config.json**

{ "mcpServers": { "crawl-server": { "command": "npx", "args": [ "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git" ], "env": { "NODE_ENV": "production" } } } }

Option 2: Local Installation

If you've installed locally:

{ "mcpServers": { "crawl-server": { "command": "node", "args": [ "/path/to/crawl-mcp-server/dist/index.js" ], "env": {} } } }

Option 3: Custom Path

For a specific installation:

{ "mcpServers": { "crawl-server": { "command": "node", "args": [ "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js" ], "env": {} } } }

After configuration, restart Claude Desktop.

๐Ÿ”ง Setup for Other MCP Clients

Claude CLI

# Using npx claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git # Using local installation claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js

Zed Editor

Add to ~/.config/zed/settings.json:

{ "assistant": { "mcp": { "servers": { "crawl-server": { "command": "node", "args": ["/path/to/crawl-mcp-server/dist/index.js"] } } } } }

VSCode with Copilot Chat

{ "mcpServers": { "crawl-server": { "command": "node", "args": ["/path/to/crawl-mcp-server/dist/index.js"] } } }

๐Ÿš€ Quick Start

Using MCP Inspector (Testing)

# Install MCP Inspector globally npm install -g @modelcontextprotocol/inspector # Run the server node dist/index.js # In another terminal, test tools npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

Development Mode

# Watch mode (auto-rebuild on changes) npm run dev # Build TypeScript npm run build # Run tests npm run test:run

๐Ÿ“š Available Tools

Search Tools (7 tools)

1. search_searx

Search the web using SearXNG with automatic browser management.

// Example call { "query": "TypeScript MCP server", "maxResults": 10, "category": "general", "timeRange": "week", "language": "en" }

Parameters:

  • query (string, required): Search query

  • maxResults (number, default: 20): Results to return (1-50)

  • category (enum, default: general): one of general, images, videos, news, map, music, it, science

  • timeRange (enum, optional): one of day, week, month, year

  • language (string, default: en): Language code

Returns: JSON with search results array, URLs, and metadata


2. launch_chrome_cdp

Launch system Chrome with remote debugging for advanced SearX usage.

{ "headless": true, "port": 9222, "userDataDir": "/path/to/profile" }

Parameters:

  • headless (boolean, default: true): Run Chrome headless

  • port (number, default: 9222): Remote debugging port

  • userDataDir (string, optional): Custom Chrome profile


3. connect_cdp

Connect to remote CDP browser (Browserbase, etc.).

{ "cdpWsUrl": "http://localhost:9222" }

Parameters:

  • cdpWsUrl (string, required): CDP WebSocket URL or HTTP endpoint


4. launch_local

Launch bundled Chromium for SearX search.

{ "headless": true, "userAgent": "custom user agent string" }

Parameters:

  • headless (boolean, default: true): Run headless

  • userAgent (string, optional): Custom user agent


5. chrome_status

Check Chrome CDP status and health.

{}

Returns: Running status, health, endpoint URL, and PID


6. close

Close browser session (keeps Chrome CDP running).

{}

7. shutdown_chrome_cdp

Shutdown Chrome CDP and cleanup resources.

{}

Crawling Tools (4 tools)

1. crawl_read โญ (Simple & Fast)

Quick single-page extraction to Markdown.

{ "url": "https://example.com/article", "options": { "timeout": 30000 } }

Best for:

  • โœ… News articles

  • โœ… Blog posts

  • โœ… Documentation pages

  • โœ… Simple content extraction

Returns: Clean Markdown content


2. crawl_read_batch โญ (Multiple URLs)

Process 1-50 URLs concurrently.

{ "urls": [ "https://example.com/article1", "https://example.com/article2", "https://example.com/article3" ], "options": { "maxConcurrency": 5, "timeout": 30000, "maxResults": 10 } }

Best for:

  • โœ… Processing multiple articles

  • โœ… Building content aggregates

  • โœ… Bulk content extraction

Returns: Array of Markdown results with summary statistics


3. crawl_fetch_markdown

Controlled single-page extraction with full option control.

{ "url": "https://example.com/article", "options": { "timeout": 30000 } }

Best for:

  • โœ… Advanced crawling options

  • โœ… Custom timeout control

  • โœ… Detailed extraction


4. crawl_fetch

Multi-page crawling with intelligent link extraction.

{ "url": "https://example.com", "options": { "pages": 5, "maxConcurrency": 3, "sameOriginOnly": true, "timeout": 30000, "maxResults": 20 } }

Best for:

  • โœ… Crawling entire sites

  • โœ… Link-based discovery

  • โœ… Multi-page scraping

Features:

  • Extracts links from starting page

  • Crawls discovered pages

  • Concurrent processing

  • Same-origin filtering (configurable)

๐Ÿ’ก Usage Examples

Example 1: Search + Crawl Workflow

// Step 1: Search for topics { "tool": "search_searx", "arguments": { "query": "TypeScript best practices 2024", "maxResults": 5 } } // Step 2: Extract URLs from results // (Parse the search results to get URLs) // Step 3: Crawl selected articles { "tool": "crawl_read_batch", "arguments": { "urls": [ "https://example.com/article1", "https://example.com/article2", "https://example.com/article3" ] } }

Example 2: Batch Content Extraction

{ "tool": "crawl_read_batch", "arguments": { "urls": [ "https://news.site/article1", "https://news.site/article2", "https://news.site/article3" ], "options": { "maxConcurrency": 10, "timeout": 30000, "maxResults": 3 } } }

Example 3: Site Crawling

{ "tool": "crawl_fetch", "arguments": { "url": "https://docs.example.com", "options": { "pages": 10, "maxConcurrency": 5, "sameOriginOnly": true, "timeout": 30000, "maxResults": 10 } } }

๐ŸŽฏ Tool Selection Guide

Use Case

Recommended Tool

Complexity

Single article

crawl_read

Simple

Multiple articles

crawl_read_batch

Simple

Advanced options

crawl_fetch_markdown

Medium

Site crawling

crawl_fetch

Complex

Web search

search_searx

Simple

Research workflow

search_searx โ†’ crawl_read

Medium

๐Ÿ—๏ธ Architecture

Core Components

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ crawl-mcp-server โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ MCP Server Core โ”‚ โ”‚ โ”‚ โ”‚ - 11 registered tools โ”‚ โ”‚ โ”‚ โ”‚ - STDIO/HTTP transport โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ @just-every/crawl โ”‚ โ”‚ โ”‚ โ”‚ - HTML โ†’ Markdown โ”‚ โ”‚ โ”‚ โ”‚ - Mozilla Readability โ”‚ โ”‚ โ”‚ โ”‚ - Concurrent crawling โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Playwright (Browser) โ”‚ โ”‚ โ”‚ โ”‚ - SearXNG integration โ”‚ โ”‚ โ”‚ โ”‚ - Auto browser management โ”‚ โ”‚ โ”‚ โ”‚ - Anti-detection โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Technology Stack

  • Runtime: Node.js 18+

  • Language: TypeScript 5.7

  • Framework: MCP SDK (@modelcontextprotocol/sdk)

  • Crawling: @just-every/crawl

  • Browser: Playwright Core

  • Validation: Zod

  • Transport: STDIO (local) + HTTP (remote)

Data Flow

Client Request โ†“ MCP Protocol โ†“ Tool Handler โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Crawl/Search โ”‚ โ”‚ @just-every/crawl โ”‚ โ†’ HTML content โ”‚ or SearXNG โ”‚ โ†’ Search results โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ HTML โ†’ Markdown โ†“ Result Formatting โ†“ MCP Response โ†“ Client

๐Ÿงช Testing

Run Test Suite

# All unit tests npm run test:run # Performance benchmarks npm run test:performance # Full CI suite npm run test:ci # Individual tool test npx @modelcontextprotocol/inspector --cli node dist/index.js \ --method tools/call \ --tool-name crawl_read \ --tool-arg url="https://example.com"

Test Coverage

  • โœ… All 11 tools tested

  • โœ… Error handling validated

  • โœ… Performance benchmarks

  • โœ… Integration workflows

  • โœ… Multi-Node support (Node 18, 20, 22)

CI/CD Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ GitHub Actions โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 1. Test (Matrix: Node 18,20,22) โ”‚ โ”‚ 2. Integration Tests (PR only) โ”‚ โ”‚ 3. Performance Tests (main) โ”‚ โ”‚ 4. Security Scan โ”‚ โ”‚ 5. Coverage Report โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Development

Prerequisites

  • Node.js 18 or higher

  • npm or yarn

Setup

# Clone the repository git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git cd crawl-mcp-server # Install dependencies npm install # Build TypeScript npm run build # Run in development mode (watch) npm run dev

Development Commands

# Build project npm run build # Watch mode (auto-rebuild) npm run dev # Run tests npm run test:run # Lint code npm run lint # Type check npm run typecheck # Clean build artifacts npm run clean

Project Structure

crawl-mcp-server/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ index.ts # Main server (11 tools) โ”‚ โ”œโ”€โ”€ types.ts # TypeScript interfaces โ”‚ โ””โ”€โ”€ cdp.ts # Chrome CDP manager โ”œโ”€โ”€ test/ โ”‚ โ”œโ”€โ”€ run-tests.ts # Unit test suite โ”‚ โ”œโ”€โ”€ performance.ts # Performance tests โ”‚ โ””โ”€โ”€ config.ts # Test configuration โ”œโ”€โ”€ dist/ # Compiled JavaScript โ”œโ”€โ”€ .github/workflows/ # CI/CD pipeline โ””โ”€โ”€ package.json

๐Ÿ“Š Performance

Benchmarks

Operation

Avg Duration

Max Memory

crawl_read

~1500ms

32MB

crawl_read_batch (2 URLs)

~2500ms

64MB

search_searx

~4000ms

128MB

crawl_fetch

~2000ms

48MB

tools/list

~100ms

8MB

Performance Features

  • โœ… Concurrent request processing (up to 20)

  • โœ… Built-in caching (SHA-256)

  • โœ… Automatic timeout management

  • โœ… Memory optimization

  • โœ… Resource cleanup

๐Ÿ›ก๏ธ Error Handling

All tools include comprehensive error handling:

  • Network errors: Graceful degradation with error messages

  • Timeout handling: Configurable timeouts

  • Partial failures: Batch operations continue on individual failures

  • Structured errors: Clear error codes and messages

  • Recovery: Automatic retries where appropriate

Example error response:

{ "content": [ { "type": "text", "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms" } ], "structuredContent": { "error": "Network timeout", "url": "https://example.com", "code": "TIMEOUT" } }

๐Ÿ” Security

  • No API keys required for basic crawling

  • Respect robots.txt (configurable)

  • User agent rotation

  • Rate limiting (built-in via concurrency limits)

  • Input validation (Zod schemas)

  • Dependency scanning (npm audit, Snyk)

๐ŸŒ Transport Modes

STDIO (Default)

For local MCP clients:

node dist/index.js

HTTP

For remote access:

TRANSPORT=http PORT=3000 node dist/index.js

Server runs on: http://localhost:3000/mcp

๐Ÿ“ Configuration

Environment Variables

# Transport mode (stdio or http) TRANSPORT=stdio # HTTP port (when TRANSPORT=http) PORT=3000 # Node environment NODE_ENV=production

Tool Configuration

Each tool accepts an options object:

{ "timeout": 30000, // Request timeout (ms) "maxConcurrency": 5, // Concurrent requests (1-20) "maxResults": 10, // Limit results (1-50) "respectRobots": false, // Respect robots.txt "sameOriginOnly": true // Only same-origin URLs }

๐Ÿค Contributing

  1. Fork the repository

  2. Create a feature branch: git checkout -b feature/amazing-feature

  3. Make changes and add tests

  4. Run tests: npm run test:ci

  5. Commit: git commit -m 'Add amazing feature'

  6. Push: git push origin feature/amazing-feature

  7. Open a Pull Request

Development Guidelines

  • Follow TypeScript strict mode

  • Add tests for new features

  • Update documentation

  • Run linting: npm run lint

  • Ensure CI passes

๐Ÿ“„ License

MIT License - see LICENSE file

๐Ÿ™ Acknowledgments

๐Ÿ“ž Support

๐Ÿš€ What's Next?

  • Add DuckDuckGo search support

  • Implement content filtering

  • Add screenshot capabilities

  • Support for authenticated content

  • PDF extraction

  • Real-time monitoring


Built with โค๏ธ using TypeScript, MCP, and modern web technologies.

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Git-Fg/searchcrawl-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server