docs-mcp-server

Overview Schema Related Servers Score Discussions

docs-mcp-server
openspec
changes
archive
2026-01-11-enhance-markdown-chunking

design.md•2.85 KiB

# Design: Enhanced Semantic Markdown Chunking ## Problem The current `SemanticMarkdownSplitter` treats most non-heading elements as generic text. This leads to suboptimal chunking for structured content like lists, blockquotes, and media. Additionally, horizontal rules (`<hr>`) are ignored or treated as text separators, whereas they should act as explicit chunk boundaries. ## Solution ### 1. New Content Types We will extend `SectionContentType` to include: - `list`: For `<ul>` and `<ol>` elements. - `blockquote`: For `<blockquote>` elements. - `media`: For `<img>` elements. ### 2. Horizontal Rules (`<hr>`) - In `splitIntoSections`, `<hr>` elements are detected and skipped (no `DocumentSection` is created for them). - Skipping `<hr>` acts as a hard split point between the sections created for the elements before and after it. - The `<hr>` itself will not be included in any section content, but its presence ensures that content before and after it are never merged into the same chunk. ### 3. List Handling - `<ul>` and `<ol>` elements will be captured as `list` content type. - A new `ListContentSplitter` will be implemented. - It will attempt to keep the list intact if it fits in `preferredChunkSize`. - If larger, it will split at `<li>` boundaries (preserving list structure/indentation if possible, or at least ensuring items are not split mid-text unless an individual item is huge). ### 4. Blockquote Handling - `<blockquote>` elements will be captured as `blockquote` content type. - They will be split using `TextContentSplitter` (or a specialized one) but preserving the `>` prefix on split chunks if possible is desirable (though standard `turndown` might just handle it as text with `>` markers). Keeping it as a separate type ensures we don't merge it with surrounding paragraphs. ### 5. Media Handling - `<img>` elements will be captured as `media` content type. - This ensures images (and their alt text/src) are kept as distinct chunks or at least recognized as such. ## Architecture Changes ### Reference Pattern: Code Block Handling We will follow the existing pattern used for code blocks (`PRE`/`code` tags): 1. **Detection**: Identify the specific HTML tag in `splitIntoSections` (e.g., `BLOCKQUOTE`, `UL`, `IMG`). 2. **Extraction**: Convert the inner HTML back to Markdown using `turndown`. 3. **Section Creation**: Create a new `DocumentSection` with the specific content type. 4. **Splitting**: Delegate to a specific splitter in `splitSectionContent` (e.g., `ListContentSplitter` for lists, `TextContentSplitter` for blockquotes). ### `SemanticMarkdownSplitter.ts` - Update `splitIntoSections` loop to handle `HR`, `UL/OL`, `BLOCKQUOTE`, `IMG`. - Update `splitSectionContent` switch case to handle new types. ### `splitters/ListContentSplitter.ts` - New class implementing `ContentSplitter`. - Logic to split markdown lists. ### `types.ts` - Update `SectionContentType`.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

design.md•2.85 KiB