Code-Index-MCP

Code-Index-MCP
docs
implementation

PHASE3_CONTEXTUAL_EMBEDDINGS_SUMMARY.md•5.4 KiB

# Phase 3: Contextual Embeddings Implementation Summary ## Overview Phase 3 enhances the base document plugin to use contextual embeddings when indexing chunks. This implementation provides richer semantic understanding by incorporating document structure, section hierarchy, and surrounding context into the embedding generation process. ## Key Enhancements ### 1. Enhanced `_index_chunks_semantically` Method The method now builds comprehensive contextual information for each chunk: ```python def _index_chunks_semantically(self, file_path: str, chunks: List[DocumentChunk], metadata: DocumentMetadata): """Index document chunks for semantic search with contextual embeddings.""" ``` #### Contextual Components 1. **Document-level Context** - Document title - Document type (markdown, plaintext, etc.) - Document tags 2. **Section Hierarchy** - Current section name - Full hierarchy path (e.g., "Installation > Prerequisites > Python") - Parent-child relationships 3. **Surrounding Context** - Last 100 characters from previous chunk - First 100 characters from next chunk - Provides continuity across chunk boundaries 4. **Enhanced Metadata Storage** - Contextual text for embeddings - Original content preservation - Section hierarchy information - Document metadata ### 2. Modified Chunk Metadata Structure Each chunk now includes: ```python chunk.metadata = { 'contextual_text': str, # Full contextual embedding text 'context_before': str, # Previous chunk context 'context_after': str, # Next chunk context 'section_hierarchy': List[str], # Full section path 'document_title': str, # Document title 'document_type': str, # Document type 'document_tags': List[str], # Document tags # ... existing metadata ... } ``` ### 3. Enhanced Semantic Indexer Integration The semantic indexer now accepts additional metadata: ```python self.semantic_indexer.index_symbol( file=file_path, name=chunk_id, kind="chunk", signature=f"Chunk {chunk.chunk_index} of {metadata.title}", line=chunk.chunk_index, span=(chunk.start_pos, chunk.end_pos), doc=chunk.content[:200], content=contextual_text, # Contextual embedding text metadata={...} # Enhanced metadata ) ``` ### 4. Improved Search Results Search results now include rich contextual information: ```python { "file": str, "line": int, "snippet": str, "score": float, "metadata": { "section": str, "section_hierarchy": List[str], "document_title": str, "document_type": str, "tags": List[str], "chunk_index": int, "total_chunks": int }, "context_before": str, # Optional "context_after": str # Optional } ``` ## Implementation Details ### Contextual Text Generation The contextual text for embeddings is built in layers: ``` Document: [Title] Type: [Document Type] Tags: [Tag1, Tag2, ...] Section: [Parent > Child > Current] Previous context: ...[last 100 chars] Following context: [first 100 chars]... Content: [Chunk content] ``` ### Section Hierarchy Building The implementation traverses the document structure to build complete section paths: 1. Identifies current section from chunk metadata 2. Finds parent sections recursively 3. Constructs hierarchy path from root to current section ### Embedding Pipeline 1. **Original Content**: Stored for retrieval 2. **Contextual Text**: Generated with all context 3. **Embedding Generation**: Uses contextual text 4. **Metadata Storage**: Both in chunk and vector DB ## Benefits 1. **Better Semantic Understanding**: Embeddings capture document structure and context 2. **Improved Search Relevance**: Queries match based on contextual meaning 3. **Structure Preservation**: Section hierarchy maintained for navigation 4. **Context Awareness**: Adjacent chunk context reduces boundary effects 5. **Rich Metadata**: Enhanced search results with full context ## Example Usage ```python # The enhanced system automatically provides contextual embeddings plugin = MarkdownPlugin( language_config={'name': 'markdown', 'code': 'md'}, sqlite_store=store, enable_semantic=True ) # Index a document - contextual embeddings are generated automatically result = plugin.indexFile("README.md", content) # Search with semantic understanding of context results = plugin.search("install python dependencies", { 'semantic': True, 'limit': 10 }) # Results include full contextual information for result in results: print(f"Section: {' > '.join(result['metadata']['section_hierarchy'])}") print(f"Context: {result.get('context_before', '')}...{result['snippet']}...{result.get('context_after', '')}") ``` ## Testing Run the test script to see contextual embeddings in action: ```bash python test_contextual_embeddings.py ``` This demonstrates: - Document indexing with contextual embeddings - Semantic search with context awareness - Rich metadata in search results - Section hierarchy preservation ## Future Enhancements 1. **Dynamic Context Window**: Adjust context size based on chunk content 2. **Cross-Document Linking**: Include related document context 3. **Language-Specific Context**: Tailor context for different document types 4. **Query-Time Context**: Adjust context based on query intent 5. **Hierarchical Embeddings**: Separate embeddings for different hierarchy levels

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ViperJuice/Code-Index-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

PHASE3_CONTEXTUAL_EMBEDDINGS_SUMMARY.md•5.4 KiB