Skip to main content
Glama
elchika-inc

Open Crawler MCP Server

Official
by elchika-inc

Open Crawler MCP Server

license npm version npm downloads GitHub stars

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

  • Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON

  • Smart Content Extraction: CSS selector support for targeted content extraction

  • Robots.txt Compliance: Automatic robots.txt checking and compliance

  • Rate Limiting: Built-in rate limiting (1 second minimum between requests)

  • Size Protection: Maximum page size limit (10MB) to prevent memory issues

  • Structured Content: Extract headings, paragraphs, links, images, and lists separately

  • Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

  • url (required): Target URL to crawl

  • selector (optional): CSS selector for specific content extraction

  • format (optional): Output format - text, markdown, xml, or json (default: text)

  • text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

  • text: Clean, plain text content with whitespace normalized

  • markdown: Well-formatted Markdown with headings, links, images, and lists preserved

  • xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists

  • json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

  • url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

  • Network connection issues

  • Invalid HTML or missing content

  • Robots.txt restrictions

  • Request timeouts or rate limits

  • Content size too large (>10MB)

License

MIT

A
license - permissive license
-
quality - not tested
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/elchika-inc/open-crawler-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server