Skip to main content
Glama
Git-Fg

crawl-mcp-server

by Git-Fg

crawl-mcp-server

A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.

CI Tests codecov

โœจ Features

  • ๐Ÿ” SearXNG Web Search - Search the web with automatic browser management

  • ๐Ÿ“„ 4 Crawling Tools - Extract and convert web content to Markdown

  • ๐Ÿš€ Auto-Browser-Launch - Search tools automatically manage browser lifecycle

  • ๐Ÿ“ฆ 11 Total Tools - Complete toolkit for web interaction

  • ๐Ÿ’พ Built-in Caching - SHA-256 based caching with graceful fallbacks

  • โšก Concurrent Processing - Handle multiple URLs simultaneously (up to 50)

  • ๐ŸŽฏ LLM-Optimized Output - Clean Markdown perfect for AI consumption

  • ๐Ÿ›ก๏ธ Robust Error Handling - Graceful failure with detailed error messages

  • ๐Ÿงช Comprehensive Testing - Full CI/CD with performance benchmarks

๐Ÿ“ฆ Installation

npm install crawl-mcp-server

Method 2: Direct from Git

# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main

# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git

Method 3: Clone and Build

git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build

Method 4: npx (No Installation)

# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

๐Ÿ”ง Setup for Claude Code

Add to your Claude Desktop configuration file:

** macOS/Linux: ~/.config/claude/claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

** Windows: %APPDATA%\Claude\claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Option 2: Local Installation

If you've installed locally:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/path/to/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

Option 3: Custom Path

For a specific installation:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

After configuration, restart Claude Desktop.

๐Ÿ”ง Setup for Other MCP Clients

Claude CLI

# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js

Zed Editor

Add to ~/.config/zed/settings.json:

{
  "assistant": {
    "mcp": {
      "servers": {
        "crawl-server": {
          "command": "node",
          "args": ["/path/to/crawl-mcp-server/dist/index.js"]
        }
      }
    }
  }
}

VSCode with Copilot Chat

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": ["/path/to/crawl-mcp-server/dist/index.js"]
    }
  }
}

๐Ÿš€ Quick Start

Using MCP Inspector (Testing)

# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector

# Run the server
node dist/index.js

# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

Development Mode

# Watch mode (auto-rebuild on changes)
npm run dev

# Build TypeScript
npm run build

# Run tests
npm run test:run

๐Ÿ“š Available Tools

Search Tools (7 tools)

1. search_searx

Search the web using SearXNG with automatic browser management.

// Example call
{
  "query": "TypeScript MCP server",
  "maxResults": 10,
  "category": "general",
  "timeRange": "week",
  "language": "en"
}

Parameters:

  • query (string, required): Search query

  • maxResults (number, default: 20): Results to return (1-50)

  • category (enum, default: general): one of general, images, videos, news, map, music, it, science

  • timeRange (enum, optional): one of day, week, month, year

  • language (string, default: en): Language code

Returns: JSON with search results array, URLs, and metadata


2. launch_chrome_cdp

Launch system Chrome with remote debugging for advanced SearX usage.

{
  "headless": true,
  "port": 9222,
  "userDataDir": "/path/to/profile"
}

Parameters:

  • headless (boolean, default: true): Run Chrome headless

  • port (number, default: 9222): Remote debugging port

  • userDataDir (string, optional): Custom Chrome profile


3. connect_cdp

Connect to remote CDP browser (Browserbase, etc.).

{
  "cdpWsUrl": "http://localhost:9222"
}

Parameters:

  • cdpWsUrl (string, required): CDP WebSocket URL or HTTP endpoint


4. launch_local

Launch bundled Chromium for SearX search.

{
  "headless": true,
  "userAgent": "custom user agent string"
}

Parameters:

  • headless (boolean, default: true): Run headless

  • userAgent (string, optional): Custom user agent


5. chrome_status

Check Chrome CDP status and health.

{}

Returns: Running status, health, endpoint URL, and PID


6. close

Close browser session (keeps Chrome CDP running).

{}

7. shutdown_chrome_cdp

Shutdown Chrome CDP and cleanup resources.

{}

Crawling Tools (4 tools)

1. crawl_read โญ (Simple & Fast)

Quick single-page extraction to Markdown.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • โœ… News articles

  • โœ… Blog posts

  • โœ… Documentation pages

  • โœ… Simple content extraction

Returns: Clean Markdown content


2. crawl_read_batch โญ (Multiple URLs)

Process 1-50 URLs concurrently.

{
  "urls": [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
  ],
  "options": {
    "maxConcurrency": 5,
    "timeout": 30000,
    "maxResults": 10
  }
}

Best for:

  • โœ… Processing multiple articles

  • โœ… Building content aggregates

  • โœ… Bulk content extraction

Returns: Array of Markdown results with summary statistics


3. crawl_fetch_markdown

Controlled single-page extraction with full option control.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • โœ… Advanced crawling options

  • โœ… Custom timeout control

  • โœ… Detailed extraction


4. crawl_fetch

Multi-page crawling with intelligent link extraction.

{
  "url": "https://example.com",
  "options": {
    "pages": 5,
    "maxConcurrency": 3,
    "sameOriginOnly": true,
    "timeout": 30000,
    "maxResults": 20
  }
}

Best for:

  • โœ… Crawling entire sites

  • โœ… Link-based discovery

  • โœ… Multi-page scraping

Features:

  • Extracts links from starting page

  • Crawls discovered pages

  • Concurrent processing

  • Same-origin filtering (configurable)

๐Ÿ’ก Usage Examples

Example 1: Search + Crawl Workflow

// Step 1: Search for topics
{
  "tool": "search_searx",
  "arguments": {
    "query": "TypeScript best practices 2024",
    "maxResults": 5
  }
}

// Step 2: Extract URLs from results
// (Parse the search results to get URLs)

// Step 3: Crawl selected articles
{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }
}

Example 2: Batch Content Extraction

{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://news.site/article1",
      "https://news.site/article2",
      "https://news.site/article3"
    ],
    "options": {
      "maxConcurrency": 10,
      "timeout": 30000,
      "maxResults": 3
    }
  }
}

Example 3: Site Crawling

{
  "tool": "crawl_fetch",
  "arguments": {
    "url": "https://docs.example.com",
    "options": {
      "pages": 10,
      "maxConcurrency": 5,
      "sameOriginOnly": true,
      "timeout": 30000,
      "maxResults": 10
    }
  }
}

๐ŸŽฏ Tool Selection Guide

Use Case

Recommended Tool

Complexity

Single article

crawl_read

Simple

Multiple articles

crawl_read_batch

Simple

Advanced options

crawl_fetch_markdown

Medium

Site crawling

crawl_fetch

Complex

Web search

search_searx

Simple

Research workflow

search_searx โ†’ crawl_read

Medium

๐Ÿ—๏ธ Architecture

Core Components

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         crawl-mcp-server                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚     MCP Server Core         โ”‚      โ”‚
โ”‚  โ”‚  - 11 registered tools      โ”‚      โ”‚
โ”‚  โ”‚  - STDIO/HTTP transport    โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚              โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   @just-every/crawl         โ”‚      โ”‚
โ”‚  โ”‚  - HTML โ†’ Markdown          โ”‚      โ”‚
โ”‚  โ”‚  - Mozilla Readability       โ”‚      โ”‚
โ”‚  โ”‚  - Concurrent crawling      โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚              โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   Playwright (Browser)       โ”‚      โ”‚
โ”‚  โ”‚  - SearXNG integration       โ”‚      โ”‚
โ”‚  โ”‚  - Auto browser management   โ”‚      โ”‚
โ”‚  โ”‚  - Anti-detection            โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Technology Stack

  • Runtime: Node.js 18+

  • Language: TypeScript 5.7

  • Framework: MCP SDK (@modelcontextprotocol/sdk)

  • Crawling: @just-every/crawl

  • Browser: Playwright Core

  • Validation: Zod

  • Transport: STDIO (local) + HTTP (remote)

Data Flow

Client Request
    โ†“
MCP Protocol
    โ†“
Tool Handler
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Crawl/Search     โ”‚
โ”‚  @just-every/crawl โ”‚  โ†’  HTML content
โ”‚   or SearXNG       โ”‚  โ†’  Search results
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
HTML โ†’ Markdown
    โ†“
Result Formatting
    โ†“
MCP Response
    โ†“
Client

๐Ÿงช Testing

Run Test Suite

# All unit tests
npm run test:run

# Performance benchmarks
npm run test:performance

# Full CI suite
npm run test:ci

# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call \
  --tool-name crawl_read \
  --tool-arg url="https://example.com"

Test Coverage

  • โœ… All 11 tools tested

  • โœ… Error handling validated

  • โœ… Performance benchmarks

  • โœ… Integration workflows

  • โœ… Multi-Node support (Node 18, 20, 22)

CI/CD Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚        GitHub Actions              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1. Test (Matrix: Node 18,20,22) โ”‚
โ”‚  2. Integration Tests (PR only)    โ”‚
โ”‚  3. Performance Tests (main)       โ”‚
โ”‚  4. Security Scan                  โ”‚
โ”‚  5. Coverage Report                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Development

Prerequisites

  • Node.js 18 or higher

  • npm or yarn

Setup

# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run in development mode (watch)
npm run dev

Development Commands

# Build project
npm run build

# Watch mode (auto-rebuild)
npm run dev

# Run tests
npm run test:run

# Lint code
npm run lint

# Type check
npm run typecheck

# Clean build artifacts
npm run clean

Project Structure

crawl-mcp-server/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.ts          # Main server (11 tools)
โ”‚   โ”œโ”€โ”€ types.ts           # TypeScript interfaces
โ”‚   โ””โ”€โ”€ cdp.ts            # Chrome CDP manager
โ”œโ”€โ”€ test/
โ”‚   โ”œโ”€โ”€ run-tests.ts       # Unit test suite
โ”‚   โ”œโ”€โ”€ performance.ts     # Performance tests
โ”‚   โ””โ”€โ”€ config.ts          # Test configuration
โ”œโ”€โ”€ dist/                  # Compiled JavaScript
โ”œโ”€โ”€ .github/workflows/      # CI/CD pipeline
โ””โ”€โ”€ package.json

๐Ÿ“Š Performance

Benchmarks

Operation

Avg Duration

Max Memory

crawl_read

~1500ms

32MB

crawl_read_batch (2 URLs)

~2500ms

64MB

search_searx

~4000ms

128MB

crawl_fetch

~2000ms

48MB

tools/list

~100ms

8MB

Performance Features

  • โœ… Concurrent request processing (up to 20)

  • โœ… Built-in caching (SHA-256)

  • โœ… Automatic timeout management

  • โœ… Memory optimization

  • โœ… Resource cleanup

๐Ÿ›ก๏ธ Error Handling

All tools include comprehensive error handling:

  • Network errors: Graceful degradation with error messages

  • Timeout handling: Configurable timeouts

  • Partial failures: Batch operations continue on individual failures

  • Structured errors: Clear error codes and messages

  • Recovery: Automatic retries where appropriate

Example error response:

{
  "content": [
    {
      "type": "text",
      "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
    }
  ],
  "structuredContent": {
    "error": "Network timeout",
    "url": "https://example.com",
    "code": "TIMEOUT"
  }
}

๐Ÿ” Security

  • No API keys required for basic crawling

  • Respect robots.txt (configurable)

  • User agent rotation

  • Rate limiting (built-in via concurrency limits)

  • Input validation (Zod schemas)

  • Dependency scanning (npm audit, Snyk)

๐ŸŒ Transport Modes

STDIO (Default)

For local MCP clients:

node dist/index.js

HTTP

For remote access:

TRANSPORT=http PORT=3000 node dist/index.js

Server runs on: http://localhost:3000/mcp

๐Ÿ“ Configuration

Environment Variables

# Transport mode (stdio or http)
TRANSPORT=stdio

# HTTP port (when TRANSPORT=http)
PORT=3000

# Node environment
NODE_ENV=production

Tool Configuration

Each tool accepts an options object:

{
  "timeout": 30000,          // Request timeout (ms)
  "maxConcurrency": 5,       // Concurrent requests (1-20)
  "maxResults": 10,          // Limit results (1-50)
  "respectRobots": false,    // Respect robots.txt
  "sameOriginOnly": true     // Only same-origin URLs
}

๐Ÿค Contributing

  1. Fork the repository

  2. Create a feature branch: git checkout -b feature/amazing-feature

  3. Make changes and add tests

  4. Run tests: npm run test:ci

  5. Commit: git commit -m 'Add amazing feature'

  6. Push: git push origin feature/amazing-feature

  7. Open a Pull Request

Development Guidelines

  • Follow TypeScript strict mode

  • Add tests for new features

  • Update documentation

  • Run linting: npm run lint

  • Ensure CI passes

๐Ÿ“„ License

MIT License - see LICENSE file

๐Ÿ™ Acknowledgments

๐Ÿ“ž Support

๐Ÿš€ What's Next?

  • Add DuckDuckGo search support

  • Implement content filtering

  • Add screenshot capabilities

  • Support for authenticated content

  • PDF extraction

  • Real-time monitoring


Built with โค๏ธ using TypeScript, MCP, and modern web technologies.

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Git-Fg/searchcrawl-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server