# Phase 3: Contextual Embeddings Implementation Summary
## Overview
Phase 3 enhances the base document plugin to use contextual embeddings when indexing chunks. This implementation provides richer semantic understanding by incorporating document structure, section hierarchy, and surrounding context into the embedding generation process.
## Key Enhancements
### 1. Enhanced `_index_chunks_semantically` Method
The method now builds comprehensive contextual information for each chunk:
```python
def _index_chunks_semantically(self, file_path: str, chunks: List[DocumentChunk],
metadata: DocumentMetadata):
"""Index document chunks for semantic search with contextual embeddings."""
```
#### Contextual Components
1. **Document-level Context**
- Document title
- Document type (markdown, plaintext, etc.)
- Document tags
2. **Section Hierarchy**
- Current section name
- Full hierarchy path (e.g., "Installation > Prerequisites > Python")
- Parent-child relationships
3. **Surrounding Context**
- Last 100 characters from previous chunk
- First 100 characters from next chunk
- Provides continuity across chunk boundaries
4. **Enhanced Metadata Storage**
- Contextual text for embeddings
- Original content preservation
- Section hierarchy information
- Document metadata
### 2. Modified Chunk Metadata Structure
Each chunk now includes:
```python
chunk.metadata = {
'contextual_text': str, # Full contextual embedding text
'context_before': str, # Previous chunk context
'context_after': str, # Next chunk context
'section_hierarchy': List[str], # Full section path
'document_title': str, # Document title
'document_type': str, # Document type
'document_tags': List[str], # Document tags
# ... existing metadata ...
}
```
### 3. Enhanced Semantic Indexer Integration
The semantic indexer now accepts additional metadata:
```python
self.semantic_indexer.index_symbol(
file=file_path,
name=chunk_id,
kind="chunk",
signature=f"Chunk {chunk.chunk_index} of {metadata.title}",
line=chunk.chunk_index,
span=(chunk.start_pos, chunk.end_pos),
doc=chunk.content[:200],
content=contextual_text, # Contextual embedding text
metadata={...} # Enhanced metadata
)
```
### 4. Improved Search Results
Search results now include rich contextual information:
```python
{
"file": str,
"line": int,
"snippet": str,
"score": float,
"metadata": {
"section": str,
"section_hierarchy": List[str],
"document_title": str,
"document_type": str,
"tags": List[str],
"chunk_index": int,
"total_chunks": int
},
"context_before": str, # Optional
"context_after": str # Optional
}
```
## Implementation Details
### Contextual Text Generation
The contextual text for embeddings is built in layers:
```
Document: [Title]
Type: [Document Type]
Tags: [Tag1, Tag2, ...]
Section: [Parent > Child > Current]
Previous context: ...[last 100 chars]
Following context: [first 100 chars]...
Content: [Chunk content]
```
### Section Hierarchy Building
The implementation traverses the document structure to build complete section paths:
1. Identifies current section from chunk metadata
2. Finds parent sections recursively
3. Constructs hierarchy path from root to current section
### Embedding Pipeline
1. **Original Content**: Stored for retrieval
2. **Contextual Text**: Generated with all context
3. **Embedding Generation**: Uses contextual text
4. **Metadata Storage**: Both in chunk and vector DB
## Benefits
1. **Better Semantic Understanding**: Embeddings capture document structure and context
2. **Improved Search Relevance**: Queries match based on contextual meaning
3. **Structure Preservation**: Section hierarchy maintained for navigation
4. **Context Awareness**: Adjacent chunk context reduces boundary effects
5. **Rich Metadata**: Enhanced search results with full context
## Example Usage
```python
# The enhanced system automatically provides contextual embeddings
plugin = MarkdownPlugin(
language_config={'name': 'markdown', 'code': 'md'},
sqlite_store=store,
enable_semantic=True
)
# Index a document - contextual embeddings are generated automatically
result = plugin.indexFile("README.md", content)
# Search with semantic understanding of context
results = plugin.search("install python dependencies", {
'semantic': True,
'limit': 10
})
# Results include full contextual information
for result in results:
print(f"Section: {' > '.join(result['metadata']['section_hierarchy'])}")
print(f"Context: {result.get('context_before', '')}...{result['snippet']}...{result.get('context_after', '')}")
```
## Testing
Run the test script to see contextual embeddings in action:
```bash
python test_contextual_embeddings.py
```
This demonstrates:
- Document indexing with contextual embeddings
- Semantic search with context awareness
- Rich metadata in search results
- Section hierarchy preservation
## Future Enhancements
1. **Dynamic Context Window**: Adjust context size based on chunk content
2. **Cross-Document Linking**: Include related document context
3. **Language-Specific Context**: Tailor context for different document types
4. **Query-Time Context**: Adjust context based on query intent
5. **Hierarchical Embeddings**: Separate embeddings for different hierarchy levels