document-qa-prep
Chunks documents at paragraph boundaries for RAG pipelines. Provides deterministic IDs, token counts, metadata, and overlap support—no LLM required.
Instructions
Prepares a document for question-answering and RAG pipelines. Chunks the input text at paragraph/sentence boundaries, assigns deterministic chunk IDs, estimates token counts, and extracts document metadata (word count, type, headings). Returns ready-to-embed chunks with overlap support. No LLM or external API — pure text processing. Use mid-task when you've fetched a document and need it split before querying a vector store.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | No | Document text to prepare (plain text, Markdown, or lightly-structured prose). Max 500,000 chars. | |
| chunk_size_tokens | No | Target chunk size in tokens (default 512, max 4096). Uses 4-char-per-token estimate. | |
| overlap_tokens | No | Token overlap between consecutive chunks for context continuity (default 50, max 512). | |
| metadata | No | Optional key-value metadata to attach to every chunk (e.g. source URL, document ID). |