Datai MCP Server

AGNO_Chunking_Strategies.mdc•9.02 KiB

--- description: Document chunking is a critical part of knowledge management in AGNO globs: alwaysApply: false --- # AGNO Chunking Strategies Document chunking is a critical part of knowledge management in AGNO, breaking down large documents into smaller, semantically meaningful pieces for efficient storage and retrieval. This rule provides guidelines for implementing effective chunking strategies. ## Chunking Basics Chunking in AGNO follows these general principles: 1. Documents are split into smaller segments called "chunks" 2. Each chunk should contain coherent, semantically related content 3. Chunks are individually embedded and stored in vector databases 4. Agents search and retrieve relevant chunks based on query similarity 5. Controlled chunk size balances context retention and relevance ## Default Chunking Configuration Most knowledge sources in AGNO accept chunking parameters: ```python from agno.knowledge.file import FileKnowledge knowledge = FileKnowledge( files=["data/document.pdf"], vector_db=your_vector_db, # Chunking parameters chunk_size=500, # Target size of each chunk in tokens chunk_overlap=50, # Overlap between consecutive chunks in tokens ) ``` ## Chunking Methods AGNO supports different chunking methods: ### Token-based Chunking Splits text based on token count (default method): ```python from agno.chunking.token_text_chunker import TokenTextChunker chunker = TokenTextChunker( chunk_size=500, # Target chunk size in tokens chunk_overlap=50, # Overlap between chunks in tokens ) ``` ### Character-based Chunking Splits text based on character count: ```python from agno.chunking.character_text_chunker import CharacterTextChunker chunker = CharacterTextChunker( chunk_size=2000, # Target chunk size in characters chunk_overlap=200, # Overlap between chunks in characters ) ``` ### Sentence-based Chunking Splits text at sentence boundaries: ```python from agno.chunking.sentence_text_chunker import SentenceTextChunker chunker = SentenceTextChunker( chunk_size=500, # Target chunk size in tokens chunk_overlap=50, # Overlap between chunks in tokens ) ``` ### Paragraph-based Chunking Splits text at paragraph boundaries: ```python from agno.chunking.paragraph_text_chunker import ParagraphTextChunker chunker = ParagraphTextChunker( chunk_size=500, # Target chunk size in tokens chunk_overlap=50, # Overlap between chunks in tokens ) ``` ### Document-Specific Chunking Specialized chunkers for specific document types: ```python # Markdown-aware chunking from agno.chunking.markdown_text_chunker import MarkdownTextChunker md_chunker = MarkdownTextChunker( chunk_size=500, chunk_overlap=50, # Honors header hierarchy in Markdown documents split_on_headings=True, ) # Code-aware chunking from agno.chunking.code_text_chunker import CodeTextChunker code_chunker = CodeTextChunker( chunk_size=500, chunk_overlap=50, # Respects function and class boundaries respect_code_structure=True, ) ``` ## Custom Chunker with Knowledge Source You can provide a custom chunker to knowledge sources: ```python from agno.chunking.sentence_text_chunker import SentenceTextChunker from agno.knowledge.file import FileKnowledge # Create a custom chunker chunker = SentenceTextChunker( chunk_size=800, chunk_overlap=100, ) # Use the chunker with a knowledge source knowledge = FileKnowledge( files=["data/document.pdf"], vector_db=your_vector_db, chunker=chunker, # Use custom chunker ) ``` ## Chunking Best Practices ### Chunk Size Considerations Choosing the right chunk size depends on your use case: ```python # Small chunks (200-300 tokens) # Best for: Precise information retrieval, FAQ-type content small_chunk_knowledge = FileKnowledge( files=["data/faqs.pdf"], vector_db=your_vector_db, chunk_size=250, chunk_overlap=25, ) # Medium chunks (500-800 tokens) # Best for: General purpose, balanced context/precision medium_chunk_knowledge = FileKnowledge( files=["data/documentation.pdf"], vector_db=your_vector_db, chunk_size=600, chunk_overlap=50, ) # Large chunks (1000+ tokens) # Best for: Complex topics requiring extensive context large_chunk_knowledge = FileKnowledge( files=["data/research_paper.pdf"], vector_db=your_vector_db, chunk_size=1200, chunk_overlap=100, ) ``` ### Document Structure Considerations Choose chunking method based on document structure: - **Well-structured documents** (articles, documentation): Paragraph or header-based chunking - **Narrative text** (books, essays): Sentence-based chunking - **Technical content** (code, specifications): Code-aware or specialized chunking - **Mixed content**: Default token-based chunking ## Metadata and Chunk Enrichment Add metadata to chunks for improved retrieval: ```python from agno.knowledge.file import FileKnowledge knowledge = FileKnowledge( files=["data/document.pdf"], vector_db=your_vector_db, chunk_size=500, chunk_overlap=50, # Add metadata to chunks metadata={ "source": "company_handbook", "department": "HR", "version": "2.3", "date": "2024-06-01", }, ) ``` ## Advanced Chunking Strategies ### Hierarchical Chunking For complex documents, implement hierarchical chunking: ```python from agno.chunking.hierarchical_text_chunker import HierarchicalTextChunker from agno.knowledge.file import FileKnowledge # Create hierarchical chunker chunker = HierarchicalTextChunker( parent_chunk_size=1500, # Larger parent chunks child_chunk_size=300, # Smaller child chunks parent_chunk_overlap=150, child_chunk_overlap=30, ) # Use with knowledge source knowledge = FileKnowledge( files=["data/complex_document.pdf"], vector_db=your_vector_db, chunker=chunker, ) ``` ### Dynamic Chunking Adjust chunk size based on content complexity: ```python from agno.chunking.dynamic_text_chunker import DynamicTextChunker from agno.knowledge.file import FileKnowledge # Create dynamic chunker chunker = DynamicTextChunker( min_chunk_size=300, # Minimum chunk size max_chunk_size=800, # Maximum chunk size # Dynamically adjusts based on content complexity adjust_by_complexity=True, ) # Use with knowledge source knowledge = FileKnowledge( files=["data/mixed_content.pdf"], vector_db=your_vector_db, chunker=chunker, ) ``` ## Multi-Document Strategy For diverse document collections, tailor chunking by document type: ```python from agno.chunking.token_text_chunker import TokenTextChunker from agno.chunking.markdown_text_chunker import MarkdownTextChunker from agno.chunking.code_text_chunker import CodeTextChunker from agno.knowledge.file import FileKnowledge # Strategy for documentation (Markdown files) md_knowledge = FileKnowledge( files=["data/docs/*.md"], vector_db=your_vector_db, chunker=MarkdownTextChunker( chunk_size=600, chunk_overlap=60, split_on_headings=True, ), ) # Strategy for code files code_knowledge = FileKnowledge( files=["data/src/*.py"], vector_db=your_vector_db, chunker=CodeTextChunker( chunk_size=400, chunk_overlap=40, respect_code_structure=True, ), ) # Strategy for general text files text_knowledge = FileKnowledge( files=["data/text/*.txt"], vector_db=your_vector_db, chunker=TokenTextChunker( chunk_size=500, chunk_overlap=50, ), ) ``` ## Complete RAG Pipeline with Chunking ```python from agno.agent import Agent from agno.chunking.paragraph_text_chunker import ParagraphTextChunker from agno.embedder.openai import OpenAIEmbedder from agno.knowledge.directory import DirectoryKnowledge from agno.models.anthropic import Claude from agno.tools.reasoning import ReasoningTools from agno.vectordb.lancedb import LanceDb, SearchType # Create embedder embedder = OpenAIEmbedder( id="text-embedding-3-small", dimensions=1536, ) # Create vector database vector_db = LanceDb( uri="data/lancedb", table_name="documentation", search_type=SearchType.hybrid, embedder=embedder, ) # Create custom chunker for technical documentation chunker = ParagraphTextChunker( chunk_size=700, # Larger chunks for technical content chunk_overlap=70, # 10% overlap ) # Create knowledge source with custom chunker knowledge = DirectoryKnowledge( directory="data/technical_docs", file_extensions=[".md", ".txt", ".pdf"], vector_db=vector_db, chunker=chunker, # Use custom chunker metadata={ "domain": "technical", "audience": "developers", }, ) # Create agent with knowledge agent = Agent( model=Claude(id="claude-3-7-sonnet-latest"), knowledge=knowledge, tools=[ReasoningTools(add_instructions=True)], instructions=[ "When searching knowledge, focus on technical accuracy", "Provide context when citing from the knowledge base", "Use code examples when applicable", ], ) # Load knowledge knowledge.load() # Run the agent agent.print_response( "How do I implement hierarchical chunking for complex documents?", stream=True, ) ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Datai-Network/datai-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

AGNO_Chunking_Strategies.mdc•9.02 KiB