Spider MCP Server

Overview Schema Related Servers Score Discussions

spider-mcp

plan.md•8.21 KiB

# mcp documentation spider server plan ## overview a fully implemented mcp server that crawls and spiders entire documentation websites, extracts clean text content, and uses llm-powered analysis to summarize content and identify key links, providing intelligent context through the model context protocol. the server is production-ready with comprehensive test coverage and supports both basic crawling and advanced llm-enhanced analysis. ## core architecture ### 1. project structure ``` spider-mcp/ ├── src/ │ ├── index.ts # mcp server entry point │ ├── spider/ │ │ ├── crawler.ts # main spider engine │ │ ├── parser.ts # html parsing and content extraction │ │ ├── queue.ts # url queue management │ │ ├── cache.ts # caching layer │ │ ├── robots.ts # robots.txt parser │ │ └── types.ts # spider-related types │ ├── mcp/ │ │ ├── server.ts # mcp server implementation │ │ ├── tools.ts # mcp tool definitions │ │ ├── handlers.ts # request handlers │ │ └── types.ts # mcp-related types │ ├── extractors/ │ │ ├── base.ts # base extractor interface │ │ ├── readability.ts # readability-based extractor │ │ ├── cheerio.ts # cheerio-based extractor │ │ └── markdown.ts # html to markdown converter │ ├── llm/ │ │ ├── client.ts # llm client interface and factory │ │ ├── providers/ │ │ │ ├── anthropic.ts # claude integration │ │ │ ├── openai.ts # openai integration │ │ │ └── local.ts # local llm support (ollama) │ │ ├── prompts.ts # llm prompt templates │ │ ├── analyzer.ts # content analysis and summarization │ │ └── types.ts # llm-related types │ └── utils/ │ ├── url.ts # url utilities │ ├── text.ts # text processing │ ├── config.ts # configuration management │ ├── logger.ts # logging utilities │ └── retry.ts # retry logic with backoff ├── test/ │ ├── unit/ │ │ ├── spider/ │ │ │ ├── crawler.test.ts │ │ │ ├── parser.test.ts │ │ │ ├── queue.test.ts │ │ │ └── cache.test.ts │ │ ├── extractors/ │ │ │ └── markdown.test.ts │ │ ├── llm/ │ │ │ ├── analyzer.test.ts │ │ │ └── providers.test.ts │ │ └── utils/ │ │ ├── url.test.ts │ │ └── text.test.ts │ ├── integration/ │ │ ├── mcp-server.test.ts │ │ ├── crawling.test.ts │ │ └── caching.test.ts │ └── fixtures/ │ ├── sample-pages/ │ └── mock-responses/ ├── cache/ # file system cache directory │ └── .gitignore ├── config/ │ ├── default.json # default configuration │ └── example.json # example custom config ├── scripts/ │ ├── build.ts # build script │ └── test-crawl.ts # manual testing script ├── .env.example # environment variables example ├── .gitignore ├── package.json ├── tsconfig.json ├── bunfig.toml # bun configuration ├── LICENSE ├── README.md └── plan.md ``` ### 2. technology stack - [x] runtime: bun (for typescript execution and testing) - [x] mcp sdk: @modelcontextprotocol/sdk - [x] web scraping: fetch for http requests - [x] html parsing: cheerio and jsdom - [x] content extraction: readability and custom extractors - [x] storage: in-memory map with file system persistence for content cache - [x] queue: in-memory with state tracking - [x] llm integration: anthropic claude (haiku/sonnet) - [x] llm libraries: @anthropic-ai/sdk ## implementation phases ### phase 1: foundation (setup) - [x] initialize bun project with typescript - [x] install mcp sdk and dependencies - [x] create basic mcp server structure - [x] implement stdio transport for mcp - [x] add basic health check tool ### phase 2: spider engine - [x] implement url queue with breadth-first traversal - [x] create crawler with configurable depth limits - [x] add robots.txt compliance - [x] implement retry logic with exponential backoff - [x] add user-agent configuration ### phase 3: content extraction - [x] implement html to markdown conversion - [x] extract main content (remove nav, ads, etc) - [x] preserve code blocks and formatting - [x] handle different documentation layouts - [x] extract metadata (title, description, etc) ### phase 4: caching and storage - [x] implement persistent cache with ttl - [x] store crawled urls and metadata in memory - [x] add cache invalidation strategies - [x] implement incremental updates - [x] add compression for stored content ### phase 5: mcp tools - [x] `spider_docs` - initiate crawl of documentation site - params: url, max_depth, include_patterns, exclude_patterns, enable_llm_analysis, llm_analysis_type - [x] `get_page` - retrieve specific page content - params: url or path - [x] `search_docs` - search through crawled content - params: query, limit - [x] `list_pages` - list all crawled pages - params: filter, sort - [x] `clear_cache` - clear cache for specific site or all - params: url_pattern ### phase 6: testing - [x] unit tests for url utilities and text processing - [x] unit tests for parser and content extractor - [x] integration tests for crawler with mock server - [x] integration tests for cache persistence - [x] mcp protocol compliance tests - [x] end-to-end tests with real documentation sites - [x] performance tests for concurrent crawling - [x] memory usage tests for large crawls ### phase 7: llm integration - [x] add anthropic sdk dependency and configuration - [x] implement llm client with provider abstraction - [x] create structured prompt templates for analysis - [x] build content analyzer with multiple analysis types - [x] integrate llm analysis into crawler pipeline ### phase 8: enhanced mcp tools - [x] `analyze_content` - perform llm analysis on cached pages - params: url, analysis_type (full, summary, links, classification) - [x] `get_summary` - get intelligent summaries of cached pages - params: url, summary_length, focus_areas - [x] enhance existing tools with llm parameters and output ### phase 9: comprehensive testing - [x] unit tests for llm providers and client - [x] unit tests for content analyzer with mocking - [x] integration tests for llm-enhanced crawling - [x] test coverage for all analysis types and error handling - [x] performance tests for llm integration overhead ## project status **status**: ✅ complete and production-ready the mcp spider server is fully implemented with all core features operational: - complete spider engine with robots.txt compliance and rate limiting - multiple content extraction methods (readability, cheerio, markdown) - file-based caching with ttl and compression - llm-powered content analysis using anthropic claude - comprehensive mcp tool suite with 7 available tools - 90 passing tests with comprehensive coverage - detailed documentation and configuration examples the server can be run immediately with `bun run dev` and supports both basic crawling (without api key) and advanced llm analysis (with anthropic api key). ### phase 10: code example extraction enhancement - [x] enhance llm prompts to identify and extract code examples from documentation - [x] update llm analysis types to include dedicated code examples section - [x] modify content analyzer to handle code example extraction and categorization - [x] update crawler integration to preserve extracted code examples - [x] add comprehensive tests for code example extraction functionality - [x] update documentation to reflect code example extraction capabilities

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/oeo/spider-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

plan.md•8.21 KiB