chunk_text
Split Markdown text into deterministic, embedding-ready chunks using configurable max characters and overlap for retrieval.
Instructions
Split Markdown text into deterministic, embedding-ready chunks. This is a read-only low-level retrieval primitive: it does not index, embed, or write anything.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | Markdown text to split | |
| source_id | No | Optional source identifier used in stable chunk ids | |
| max_chars | No | Target maximum chunk body size before overlap. Default 1200. | |
| overlap_chars | No | Prefix overlap for chunks after the first. Default 120. |
Implementation Reference
- nouz_mcp/server.py:1537-1547 (handler)Tool handler for 'chunk_text' in handle_call_tool. It extracts text, source_id, max_chars, overlap_chars from arguments, calls chunk_markdown(), and returns the result as JSON.
elif name == "chunk_text": text = args.get("text", "") source_id = args.get("source_id", "") chunks = chunk_markdown( text, source_id=source_id, max_chars=int(args.get("max_chars", 1200)), overlap_chars=int(args.get("overlap_chars", 120)), ) result = {"source_id": source_id, "chunk_count": len(chunks), "chunks": chunks} return [types.TextContent(type="text", text=json.dumps(result, ensure_ascii=False, indent=2))] - nouz_mcp/server.py:1239-1253 (registration)Tool registration for 'chunk_text' in handle_list_tools. Defines name 'chunk_text', description, and inputSchema with text (required), source_id, max_chars, overlap_chars.
types.Tool( name="chunk_text", description="Split Markdown text into deterministic, embedding-ready chunks. This is a read-only low-level " "retrieval primitive: it does not index, embed, or write anything.", inputSchema={ "type": "object", "properties": { "text": {"type": "string", "description": "Markdown text to split"}, "source_id": {"type": "string", "description": "Optional source identifier used in stable chunk ids"}, "max_chars": {"type": "integer", "description": "Target maximum chunk body size before overlap. Default 1200."}, "overlap_chars": {"type": "integer", "description": "Prefix overlap for chunks after the first. Default 120."}, }, "required": ["text"] } ), - nouz_mcp/chunks.py:24-77 (helper)The actual chunk_markdown() function that splits Markdown into deterministic chunks. Called by the chunk_text tool handler.
def chunk_markdown( text: str, *, source_id: str = "", max_chars: int = 1200, overlap_chars: int = 120, ) -> list[dict[str, Any]]: """Split Markdown into stable, embedding-ready chunks. This is a pure low-level primitive: it does not read files, write SQLite, or call an embedding model. Higher-level retrieval and context tools can build on this contract without coupling chunking to a specific workflow. """ normalized = text.replace("\r\n", "\n").replace("\r", "\n") if not normalized.strip(): return [] max_chars = max(1, int(max_chars)) overlap_chars = max(0, min(int(overlap_chars), max_chars // 2)) base_chunks = _make_base_chunks(normalized, max_chars=max_chars) chunks: list[dict[str, Any]] = [] body_hash_counts: dict[str, int] = {} for index, block in enumerate(base_chunks): prefix_start = max(0, block.start - overlap_chars) if index else block.start overlap = block.start - prefix_start chunk_text = normalized[prefix_start:block.end] body_text = normalized[block.start:block.end] body_hash = hashlib.sha1(body_text.encode("utf-8")).hexdigest()[:12] text_hash = hashlib.sha1(chunk_text.encode("utf-8")).hexdigest()[:12] occurrence = body_hash_counts.get(body_hash, 0) body_hash_counts[body_hash] = occurrence + 1 digest = hashlib.sha1( f"{CHUNKER_VERSION}\0{source_id}\0{body_hash}\0{occurrence}".encode("utf-8") ).hexdigest()[:12] chunks.append( { "id": f"chunk:{digest}", "chunker_version": CHUNKER_VERSION, "source_id": source_id, "index": index, "start_char": prefix_start, "end_char": block.end, "body_start_char": block.start, "body_end_char": block.end, "overlap_chars": overlap, "char_count": len(chunk_text), "heading": block.heading, "body_hash": body_hash, "text_hash": text_hash, "text": chunk_text, } ) return chunks - nouz_mcp/server.py:31-31 (helper)Import of chunk_markdown from nouz_mcp.chunks into server.py, enabling the tool handler to call the chunking logic.
from nouz_mcp.chunks import CHUNKER_VERSION, chunk_markdown