semantic-code-mcp

Overview Schema Related Servers Score Discussions

005-markdown-chunking.md•4.39 KiB

# 005: Markdown Chunking **Status**: implemented **Date**: 2026-02-01 ## Context The semantic search index only covers code files (Python, Rust). Projects also contain significant knowledge in markdown files — READMEs, docs, decision records, changelogs, rule files. When an agent searches for "how authentication works," relevant documentation doesn't surface because `.md` files aren't indexed. `tree-sitter-markdown` (v0.5.1) is available on PyPI and follows the same `language()` pattern as the existing tree-sitter grammars. ### Spike findings (tree-sitter-markdown AST) The AST uses **nested `section` nodes** rather than flat heading siblings: ``` document section # H1 scope atx_heading atx_h1_marker # "#" inline # heading text (without markers) paragraph # content under H1 section # H2 nested inside H1 atx_heading atx_h2_marker inline paragraph section # another H1 scope ... ``` Key properties: - `section` nodes **nest by heading level** — H2 sections are children of H1 sections - Pre-heading content gets its own `section` (no `atx_heading` child) - No-heading documents: single `section` wrapping all content - Empty documents: zero children - `section` nodes carry correct line ranges spanning their full content ## Decision ### Chunking strategy: recursive section walking Walk `section` nodes recursively. Each `section` that contains an `atx_heading` becomes a `ChunkType.section` chunk. Each `section` without a heading (preamble) becomes a `ChunkType.module` chunk. **Flatten nested sections.** A document with `# H1`, `## H2a`, `## H2b` produces three separate chunks, not one big H1 chunk containing H2s. This keeps chunks small and focused — better for embedding similarity search. For each section node: 1. Find its `atx_heading` child (if any) to determine the chunk name. 2. Extract the heading text from the `inline` child of the heading node. 3. Collect the section's own non-section content (paragraphs, code blocks, lists, etc.) as the chunk content — excluding nested sub-sections. 4. Emit a chunk spanning from the section's heading to the end of its direct content (before first sub-section, or end of section if no sub-sections). 5. Recurse into child `section` nodes. ### ChunkType extension Add `section = auto()` to the `ChunkType` enum. Storage is string-based (LanceDB UTF-8), so fully backward-compatible. Existing indexes simply won't have rows with this value. Using a new type rather than mapping to existing code types (`module`/`function`) because markdown sections are semantically different from code constructs and consumers may want to filter by type. Exception: preamble content (before first heading) uses `ChunkType.module`, consistent with how Python/Rust chunkers use `module` for file-level docstrings. ### Chunk naming - Section with heading: name = heading text (e.g., "Installation", "API Reference") - Preamble without heading: name = file stem (e.g., "README", "CHANGELOG") - Document with no headings: name = file stem, type = module ## Alternatives Considered ### A: Flat sibling iteration (tracking "current section" state) The initial assumption before the spike. Would iterate document children, tracking heading boundaries manually. Rejected because the AST already provides `section` nodes with correct boundaries — reimplementing this logic would be redundant and error-prone. ### B: Map to existing ChunkTypes (headings as `module`, code blocks as `function`) Avoids adding a new enum value. Rejected because it's misleading — a markdown heading isn't a module, and search result consumers may filter or display based on chunk type. ### C: Extract fenced code blocks as separate chunks Would give code examples within docs their own chunks for better code-specific search. Rejected for now — adds complexity, fragments surrounding context the embedding model needs, and can be added later if search quality for in-doc code is poor. ## Consequences - New `ChunkType.section` value — any code that exhaustively matches on `ChunkType` needs updating (check with grep) - `tree-sitter-markdown` added as dependency (~small, pure wheel) - `.md` files will be indexed on next `index_codebase` call — existing indexes untouched until re-indexed - The `test_composite_chunker.py` test using `.md` as "unknown extension" needs updating

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vrppaul/semantic-code-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

005-markdown-chunking.md•4.39 KiB