# crawl-mcp-server
A comprehensive MCP (Model Context Protocol) server providing **11 powerful tools** for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.
[](https://github.com/Git-Fg/searchcrawl-mcp-server/actions)
[](https://codecov.io/gh/Git-Fg/searchcrawl-mcp-server)
## β¨ Features
- π **SearXNG Web Search** - Search the web with automatic browser management
- π **4 Crawling Tools** - Extract and convert web content to Markdown
- π **Auto-Browser-Launch** - Search tools automatically manage browser lifecycle
- π¦ **11 Total Tools** - Complete toolkit for web interaction
- πΎ **Built-in Caching** - SHA-256 based caching with graceful fallbacks
- β‘ **Concurrent Processing** - Handle multiple URLs simultaneously (up to 50)
- π― **LLM-Optimized Output** - Clean Markdown perfect for AI consumption
- π‘οΈ **Robust Error Handling** - Graceful failure with detailed error messages
- π§ͺ **Comprehensive Testing** - Full CI/CD with performance benchmarks
## π¦ Installation
### Method 1: npm (Recommended)
```bash
npm install crawl-mcp-server
```
### Method 2: Direct from Git
```bash
# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main
# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git
```
### Method 3: Clone and Build
```bash
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build
```
### Method 4: npx (No Installation)
```bash
# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
```
## π§ Setup for Claude Code
### Option 1: MCP Desktop (Recommended)
Add to your Claude Desktop configuration file:
** macOS/Linux: `~/.config/claude/claude_desktop_config.json`**
```json
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
```
** Windows: `%APPDATA%\Claude\claude_desktop_config.json`**
```json
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
```
### Option 2: Local Installation
If you've installed locally:
```json
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/path/to/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
```
### Option 3: Custom Path
For a specific installation:
```json
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
```
After configuration, restart Claude Desktop.
## π§ Setup for Other MCP Clients
### Claude CLI
```bash
# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js
```
### Zed Editor
Add to `~/.config/zed/settings.json`:
```json
{
"assistant": {
"mcp": {
"servers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
}
}
```
### VSCode with Copilot Chat
```json
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
```
## π Quick Start
### Using MCP Inspector (Testing)
```bash
# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector
# Run the server
node dist/index.js
# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list
```
### Development Mode
```bash
# Watch mode (auto-rebuild on changes)
npm run dev
# Build TypeScript
npm run build
# Run tests
npm run test:run
```
## π Available Tools
### Search Tools (7 tools)
#### 1. search_searx
Search the web using SearXNG with automatic browser management.
```typescript
// Example call
{
"query": "TypeScript MCP server",
"maxResults": 10,
"category": "general",
"timeRange": "week",
"language": "en"
}
```
**Parameters:**
- `query` (string, required): Search query
- `maxResults` (number, default: 20): Results to return (1-50)
- `category` (enum, default: general): one of general, images, videos, news, map, music, it, science
- `timeRange` (enum, optional): one of day, week, month, year
- `language` (string, default: en): Language code
**Returns:** JSON with search results array, URLs, and metadata
---
#### 2. launch_chrome_cdp
Launch system Chrome with remote debugging for advanced SearX usage.
```typescript
{
"headless": true,
"port": 9222,
"userDataDir": "/path/to/profile"
}
```
**Parameters:**
- `headless` (boolean, default: true): Run Chrome headless
- `port` (number, default: 9222): Remote debugging port
- `userDataDir` (string, optional): Custom Chrome profile
---
#### 3. connect_cdp
Connect to remote CDP browser (Browserbase, etc.).
```typescript
{
"cdpWsUrl": "http://localhost:9222"
}
```
**Parameters:**
- `cdpWsUrl` (string, required): CDP WebSocket URL or HTTP endpoint
---
#### 4. launch_local
Launch bundled Chromium for SearX search.
```typescript
{
"headless": true,
"userAgent": "custom user agent string"
}
```
**Parameters:**
- `headless` (boolean, default: true): Run headless
- `userAgent` (string, optional): Custom user agent
---
#### 5. chrome_status
Check Chrome CDP status and health.
```typescript
{}
```
**Returns:** Running status, health, endpoint URL, and PID
---
#### 6. close
Close browser session (keeps Chrome CDP running).
```typescript
{}
```
---
#### 7. shutdown_chrome_cdp
Shutdown Chrome CDP and cleanup resources.
```typescript
{}
```
### Crawling Tools (4 tools)
#### 1. crawl_read β (Simple & Fast)
Quick single-page extraction to Markdown.
```typescript
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
```
**Best for:**
- β
News articles
- β
Blog posts
- β
Documentation pages
- β
Simple content extraction
**Returns:** Clean Markdown content
---
#### 2. crawl_read_batch β (Multiple URLs)
Process 1-50 URLs concurrently.
```typescript
{
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
],
"options": {
"maxConcurrency": 5,
"timeout": 30000,
"maxResults": 10
}
}
```
**Best for:**
- β
Processing multiple articles
- β
Building content aggregates
- β
Bulk content extraction
**Returns:** Array of Markdown results with summary statistics
---
#### 3. crawl_fetch_markdown
Controlled single-page extraction with full option control.
```typescript
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
```
**Best for:**
- β
Advanced crawling options
- β
Custom timeout control
- β
Detailed extraction
---
#### 4. crawl_fetch
Multi-page crawling with intelligent link extraction.
```typescript
{
"url": "https://example.com",
"options": {
"pages": 5,
"maxConcurrency": 3,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 20
}
}
```
**Best for:**
- β
Crawling entire sites
- β
Link-based discovery
- β
Multi-page scraping
**Features:**
- Extracts links from starting page
- Crawls discovered pages
- Concurrent processing
- Same-origin filtering (configurable)
## π‘ Usage Examples
### Example 1: Search + Crawl Workflow
```typescript
// Step 1: Search for topics
{
"tool": "search_searx",
"arguments": {
"query": "TypeScript best practices 2024",
"maxResults": 5
}
}
// Step 2: Extract URLs from results
// (Parse the search results to get URLs)
// Step 3: Crawl selected articles
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
}
}
```
### Example 2: Batch Content Extraction
```typescript
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://news.site/article1",
"https://news.site/article2",
"https://news.site/article3"
],
"options": {
"maxConcurrency": 10,
"timeout": 30000,
"maxResults": 3
}
}
}
```
### Example 3: Site Crawling
```typescript
{
"tool": "crawl_fetch",
"arguments": {
"url": "https://docs.example.com",
"options": {
"pages": 10,
"maxConcurrency": 5,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 10
}
}
}
```
## π― Tool Selection Guide
| Use Case | Recommended Tool | Complexity |
|----------|----------------|------------|
| Single article | `crawl_read` | Simple |
| Multiple articles | `crawl_read_batch` | Simple |
| Advanced options | `crawl_fetch_markdown` | Medium |
| Site crawling | `crawl_fetch` | Complex |
| Web search | `search_searx` | Simple |
| Research workflow | `search_searx` β `crawl_read` | Medium |
## ποΈ Architecture
### Core Components
```
βββββββββββββββββββββββββββββββββββββββββββ
β crawl-mcp-server β
βββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββ β
β β MCP Server Core β β
β β - 11 registered tools β β
β β - STDIO/HTTP transport β β
β ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β β @just-every/crawl β β
β β - HTML β Markdown β β
β β - Mozilla Readability β β
β β - Concurrent crawling β β
β ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β β Playwright (Browser) β β
β β - SearXNG integration β β
β β - Auto browser management β β
β β - Anti-detection β β
β ββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββ
```
### Technology Stack
- **Runtime**: Node.js 18+
- **Language**: TypeScript 5.7
- **Framework**: MCP SDK (@modelcontextprotocol/sdk)
- **Crawling**: @just-every/crawl
- **Browser**: Playwright Core
- **Validation**: Zod
- **Transport**: STDIO (local) + HTTP (remote)
### Data Flow
```
Client Request
β
MCP Protocol
β
Tool Handler
β
βββββββββββββββββββββββ
β Crawl/Search β
β @just-every/crawl β β HTML content
β or SearXNG β β Search results
βββββββββββββββββββββββ
β
HTML β Markdown
β
Result Formatting
β
MCP Response
β
Client
```
## π§ͺ Testing
### Run Test Suite
```bash
# All unit tests
npm run test:run
# Performance benchmarks
npm run test:performance
# Full CI suite
npm run test:ci
# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
--method tools/call \
--tool-name crawl_read \
--tool-arg url="https://example.com"
```
### Test Coverage
- β
All 11 tools tested
- β
Error handling validated
- β
Performance benchmarks
- β
Integration workflows
- β
Multi-Node support (Node 18, 20, 22)
### CI/CD Pipeline
```
ββββββββββββββββββββββββββββββββββββββ
β GitHub Actions β
ββββββββββββββββββββββββββββββββββββββ€
β 1. Test (Matrix: Node 18,20,22) β
β 2. Integration Tests (PR only) β
β 3. Performance Tests (main) β
β 4. Security Scan β
β 5. Coverage Report β
ββββββββββββββββββββββββββββββββββββββ
```
## π§ Development
### Prerequisites
- Node.js 18 or higher
- npm or yarn
### Setup
```bash
# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run in development mode (watch)
npm run dev
```
### Development Commands
```bash
# Build project
npm run build
# Watch mode (auto-rebuild)
npm run dev
# Run tests
npm run test:run
# Lint code
npm run lint
# Type check
npm run typecheck
# Clean build artifacts
npm run clean
```
### Project Structure
```
crawl-mcp-server/
βββ src/
β βββ index.ts # Main server (11 tools)
β βββ types.ts # TypeScript interfaces
β βββ cdp.ts # Chrome CDP manager
βββ test/
β βββ run-tests.ts # Unit test suite
β βββ performance.ts # Performance tests
β βββ config.ts # Test configuration
βββ dist/ # Compiled JavaScript
βββ .github/workflows/ # CI/CD pipeline
βββ package.json
```
## π Performance
### Benchmarks
| Operation | Avg Duration | Max Memory |
|-----------|-------------|------------|
| crawl_read | ~1500ms | 32MB |
| crawl_read_batch (2 URLs) | ~2500ms | 64MB |
| search_searx | ~4000ms | 128MB |
| crawl_fetch | ~2000ms | 48MB |
| tools/list | ~100ms | 8MB |
### Performance Features
- β
Concurrent request processing (up to 20)
- β
Built-in caching (SHA-256)
- β
Automatic timeout management
- β
Memory optimization
- β
Resource cleanup
## π‘οΈ Error Handling
All tools include comprehensive error handling:
- **Network errors**: Graceful degradation with error messages
- **Timeout handling**: Configurable timeouts
- **Partial failures**: Batch operations continue on individual failures
- **Structured errors**: Clear error codes and messages
- **Recovery**: Automatic retries where appropriate
Example error response:
```json
{
"content": [
{
"type": "text",
"text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
}
],
"structuredContent": {
"error": "Network timeout",
"url": "https://example.com",
"code": "TIMEOUT"
}
}
```
## π Security
- **No API keys required** for basic crawling
- **Respect robots.txt** (configurable)
- **User agent rotation**
- **Rate limiting** (built-in via concurrency limits)
- **Input validation** (Zod schemas)
- **Dependency scanning** (npm audit, Snyk)
## π Transport Modes
### STDIO (Default)
For local MCP clients:
```bash
node dist/index.js
```
### HTTP
For remote access:
```bash
TRANSPORT=http PORT=3000 node dist/index.js
```
Server runs on: `http://localhost:3000/mcp`
## π Configuration
### Environment Variables
```bash
# Transport mode (stdio or http)
TRANSPORT=stdio
# HTTP port (when TRANSPORT=http)
PORT=3000
# Node environment
NODE_ENV=production
```
### Tool Configuration
Each tool accepts an `options` object:
```typescript
{
"timeout": 30000, // Request timeout (ms)
"maxConcurrency": 5, // Concurrent requests (1-20)
"maxResults": 10, // Limit results (1-50)
"respectRobots": false, // Respect robots.txt
"sameOriginOnly": true // Only same-origin URLs
}
```
## π€ Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Make changes and add tests
4. Run tests: `npm run test:ci`
5. Commit: `git commit -m 'Add amazing feature'`
6. Push: `git push origin feature/amazing-feature`
7. Open a Pull Request
### Development Guidelines
- Follow TypeScript strict mode
- Add tests for new features
- Update documentation
- Run linting: `npm run lint`
- Ensure CI passes
## π License
MIT License - see [LICENSE](LICENSE) file
## π Acknowledgments
- [@just-every/crawl](https://github.com/just-every/crawl) - Web crawling
- [Model Context Protocol](https://modelcontextprotocol.io) - MCP specification
- [SearXNG](https://github.com/searxng/searxng) - Search aggregator
- [Playwright](https://playwright.dev) - Browser automation
## π Support
- **Issues**: [GitHub Issues](https://github.com/Git-Fg/searchcrawl-mcp-server/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Git-Fg/searchcrawl-mcp-server/discussions)
- **Email**: your-email@example.com
## π What's Next?
- [ ] Add DuckDuckGo search support
- [ ] Implement content filtering
- [ ] Add screenshot capabilities
- [ ] Support for authenticated content
- [ ] PDF extraction
- [ ] Real-time monitoring
---
**Built with β€οΈ using TypeScript, MCP, and modern web technologies.**