docs-mcp-server

Overview Schema Related Servers Score Discussions

docs-mcp-server
openspec
changes
add-llmstxt-discovery

tasks.md•4.87 KiB

## 1. llms.txt Parser - [ ] 1.1 Create `src/scraper/utils/llmsTxtParser.ts` with types (`LlmsTxtResult`, `LlmsTxtLink`, `LlmsTxtSection`) and a `parseLlmsTxt(content: string)` function that extracts H1 (project name), blockquote (summary), H2 sections, and link lists from the llms.txt Markdown format - [ ] 1.2 Create `src/scraper/utils/llmsTxtParser.test.ts` with tests covering: complete llms.txt with all sections, minimal (H1 + one link), optional section flagging, links with and without descriptions, invalid/empty content, HTML/binary content, edge cases (multiple H1s, malformed links) ## 2. QueueItem Extension - [ ] 2.1 Add optional `fromLlmsTxt?: boolean` field to `QueueItem` in `src/scraper/types.ts` ## 3. llms.txt Probe and URL Seeding - [ ] 3.1 Add a `probeLlmsTxt(baseUrl: string, inputUrl: string, signal?: AbortSignal)` method to `WebScraperStrategy` that derives candidate llms.txt URLs (parent directory of input URL path first, then site root), fetches via the existing fetcher (`AutoDetectFetcher`), parses valid responses, and returns the parsed result or null. Skip the probe entirely when `isRefresh` is true. - [ ] 3.2 Integrate the probe into the crawl initialization in `WebScraperStrategy` (or override the relevant hook in `BaseScraperStrategy.scrape()`): call `probeLlmsTxt` before the BFS loop, filter returned URLs through `shouldProcessUrl()`, and add passing URLs to the queue at depth 0 with `fromLlmsTxt: true` - [ ] 3.3 Hardcode llms.txt exclusion in `shouldProcessUrl()` or the URL filtering path in `BaseScraperStrategy` (not via configurable `defaultPatterns.ts`) so that `llms.txt` files are always excluded from indexing even when the user provides custom `excludePatterns` ## 4. Markdown Content Negotiation (Accept: text/markdown) - [ ] 4.1 Add `Accept: text/markdown, text/html;q=0.9, */*;q=0.8` header to all web page fetch requests in the fetcher layer (e.g., `AutoDetectFetcher` or the HTTP utility used by `WebScraperStrategy`). This applies to all web requests, not just llms.txt-discovered pages. - [ ] 4.2 In the content processing pipeline, detect `Content-Type: text/markdown` (or `text/plain`) responses and route them through the Markdown pipeline, bypassing HTML-to-Markdown conversion. If `Content-Type: text/html`, process through the normal HTML pipeline as before. ## 5. Markdown URL Preference (.md variant for llms.txt pages) - [ ] 5.1 In `WebScraperStrategy.processItem()`, when `item.fromLlmsTxt` is true, attempt to fetch the `.md` variant before fetching the original URL. For file-like URLs (path has a file extension), append `.md` (e.g., `page.html.md`). For directory-like URLs (trailing `/` or no extension in last segment), append `index.html.md`. Accept the `.md` response only if HTTP 200 and Content-Type is text-based (`text/markdown`, `text/plain`, `text/x-markdown`). Fall back to the original URL on failure. The `.md` variant request SHALL also include the `Accept: text/markdown` header. ## 6. Integration Tests - [ ] 6.1 Add tests to `src/scraper/strategies/WebScraperStrategy.test.ts` covering: llms.txt probe at subpath, probe fallback to root, probe failure (404), URL seeding with scope filtering, `.md` URL preference success and fallback, deduplication of llms.txt URLs with original URL, llms.txt exclusion from indexing - [ ] 6.2 Add tests for Markdown content negotiation: verify `Accept: text/markdown` header is sent on all web requests, verify `Content-Type: text/markdown` responses bypass HTML conversion, verify `Content-Type: text/html` responses are processed normally ## 7. Logging - [ ] 7.1 Add info-level log when llms.txt is detected (include URL and number of URLs extracted) - [ ] 7.2 Add debug-level log for probe failures (404, network error, parse failure) - [ ] 7.3 Add debug-level log when `.md` URL preference succeeds or falls back - [ ] 7.4 Add debug-level log when server responds with `Content-Type: text/markdown` via content negotiation ## 8. Documentation - [ ] 8.1 Update `README.md` to document automatic llms.txt discovery: explain that the scraper probes for `/llms.txt` before BFS crawling, uses discovered URLs as seeds, and prefers `.md` URL variants for llms.txt pages. Note that this is fully automatic with no configuration required. - [ ] 8.2 Update `README.md` to document Markdown content negotiation: explain that the scraper sends `Accept: text/markdown` on all web requests, and servers supporting this (e.g., Cloudflare's Markdown for Agents) will return Markdown directly, bypassing HTML-to-Markdown conversion for higher quality output. - [ ] 8.3 Update `ARCHITECTURE.md` if the llms.txt probe or content negotiation introduces meaningful changes to the scraping pipeline flow or data model (e.g., the `fromLlmsTxt` flag on `QueueItem`, the new parser utility). ## 9. Validation - [ ] 9.1 Run `npm run lint` and `npm run typecheck` to ensure no regressions - [ ] 9.2 Run `npm test` to confirm all existing and new tests pass

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

tasks.md•4.87 KiB