docs-mcp-server

Overview Schema Related Servers Score Discussions

docs-mcp-server
openspec
changes
add-llmstxt-discovery

proposal.md•2.99 KiB

# Change: Add llms.txt discovery, Markdown URL preference, and Markdown content negotiation ## Why BFS crawling from a root URL is noisy - it discovers nav bars, blog posts, changelogs, and marketing pages that aren't useful documentation. The [llms.txt specification](https://llmstxt.org/) is an emerging standard where websites provide a curated `/llms.txt` Markdown file listing the most important pages for LLM consumption. By automatically detecting and using this file during scraping, we can produce higher-quality indexes with less user configuration. Additionally, the spec proposes that pages offer clean Markdown at `url.md`, which would bypass our HTML-to-Markdown conversion and yield better chunks. Separately, [Cloudflare's Markdown for Agents](https://blog.cloudflare.com/markdown-for-agents/) introduces server-side content negotiation: when a client sends `Accept: text/markdown`, supporting servers convert HTML to Markdown on the fly. This is complementary to the `.md` URL convention and benefits all web-scraped pages, not just those discovered via llms.txt. Cloudflare reports ~80% token reduction compared to raw HTML. ## What Changes - **Automatic llms.txt probe**: Before BFS crawling begins, `WebScraperStrategy` probes for `/llms.txt` at the site (subpath directory first, then site root). If found, the listed URLs are parsed and added to the BFS crawl queue as seeds alongside the original URL. - **llms.txt parser**: A new utility (`llmsTxtParser`) parses the well-defined Markdown format of llms.txt files, extracting project name, summary, section groupings, and link lists. - **Markdown URL preference**: When fetching pages discovered via llms.txt, the fetcher first attempts to retrieve the `.md` variant of the URL (e.g., `page.html.md`). Falls back to the original URL on failure. - **Markdown content negotiation**: All web scraper HTTP requests include `Accept: text/markdown, text/html;q=0.9, */*;q=0.8`. When a server responds with `Content-Type: text/markdown`, the response bypasses HTML-to-Markdown conversion. This applies to all web-scraped pages, not just llms.txt-discovered ones. - **QueueItem extension**: `QueueItem` gains an optional `fromLlmsTxt` flag so `processItem` knows which pages to try `.md` variants for. - **Logging**: Discovery and usage of llms.txt is logged at info level; probe failures logged at debug level. ## Impact - Affected specs: New capability `llmstxt-discovery` (no existing specs modified) - Affected code: - `src/scraper/strategies/WebScraperStrategy.ts` (probe logic, `.md` preference in `processItem`) - `src/scraper/strategies/BaseScraperStrategy.ts` (minor: handle additional seed URLs) - `src/scraper/types.ts` (`QueueItem` extension) - `src/scraper/fetchers/` (Accept header addition for content negotiation, Content-Type handling) - New: `src/scraper/utils/llmsTxtParser.ts` (parser) - New: `src/scraper/utils/llmsTxtParser.test.ts` (tests) - `src/scraper/strategies/WebScraperStrategy.test.ts` (tests for probe, `.md` fallback, and content negotiation)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

proposal.md•2.99 KiB