Skip to main content
Glama
README.md11.3 kB
# Semantic Compactor The Semantic Compactor is an advanced code analysis and compression system designed to extract semantically rich information from codebases while significantly reducing token consumption for LLM context. ## Overview The Semantic Compactor implements the suggestions from `docs/contextCompaction.md` by providing a comprehensive pipeline that: 1. **Discovers** relevant source files using intelligent filtering 2. **Parses** code into Abstract Syntax Trees (ASTs) with language-aware processing 3. **Prunes** ASTs to retain only semantically important information 4. **Summarizes** symbols with contextual descriptions and relationships 5. **Deduplicates** similar or identical code patterns across files 6. **Scores** symbols for relevance based on queries and task types 7. **Generates** compact, queryable representations optimized for LLM consumption ## Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ FileDiscovery │───▶│ ASTParser │───▶│ ASTPruner │ │ - globby-based │ │ - Babel/TS │ │ - Symbol │ │ - filtering │ │ - Tree-sitter │ │ extraction │ │ - relevance │ │ - Multi-language │ │ - Importance │ │ sorting │ │ support │ │ scoring │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Deduplicator │◀───│ SemanticSummarizer│◀───│ RelevanceScorer │ │ - Content hash │ │ - Symbol purpose │ │ - Query match │ │ - Signature │ │ - File summaries │ │ - Task context │ │ dedup │ │ - Project arch │ │ - Token budget │ │ - Cross-file │ │ analysis │ │ enforcement │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ └───────────────────────┼───────────────────────┘ ▼ ┌─────────────────────────┐ │ SemanticCompactor │ │ - Orchestrates pipeline │ │ - Caching │ │ - Error handling │ │ - Performance tracking │ └─────────────────────────┘ ``` ## Key Features ### 🔍 Intelligent File Discovery - Uses `globby` for efficient file pattern matching - Filters out irrelevant files (tests, build artifacts, dependencies) - Prioritizes entry points and important modules - Supports configurable file size limits and extensions ### 🌳 Advanced AST Parsing - **Babel Parser**: High-quality TypeScript/JavaScript parsing with full language support - **Tree-sitter**: Fallback for other languages and robust error recovery - **Symbol Extraction**: Functions, classes, interfaces, types, variables with full metadata - **Relationship Mapping**: Import/export relationships, function calls, inheritance ### ✂️ Semantic Pruning - **Importance Scoring**: Exported symbols, documented code, complexity analysis - **Body Compaction**: Intelligently truncates function bodies while preserving signatures - **Quality Filtering**: Removes low-value symbols (auto-generated, internal utilities) - **Context Preservation**: Maintains essential relationships and dependencies ### 🔄 Advanced Deduplication - **Content Hashing**: SHA-256 based duplicate detection - **Signature Matching**: Identifies functionally similar symbols - **Cross-File Analysis**: Finds duplicates across entire project - **Smart Preservation**: Keeps the most important instance (exported > documented > complex) ### 🎯 Relevance Scoring - **Query Matching**: Fuzzy string matching, semantic word overlap - **Task-Specific Scoring**: Debug, implement, understand, refactor, test, document - **Context Awareness**: File relationships, symbol dependencies - **Token Budget Enforcement**: Respects strict token limits ### 📊 Comprehensive Analytics - **Processing Statistics**: Files processed, symbols found, compression ratios - **Performance Metrics**: Processing time, memory usage, throughput - **Quality Measures**: Documentation coverage, export ratios, error rates - **Benchmark Comparisons**: Configuration optimization, regression testing ## API Usage ### Basic Compaction ```typescript import { SemanticCompactor } from './semanticCompactor'; const compactor = new SemanticCompactor('/path/to/project', { maxFileSize: 150000, supportedLanguages: ['typescript', 'javascript'], includeDocstrings: true, maxTotalTokens: 20000 }); const result = await compactor.compact(); console.log(`Compressed ${result.summary.totalSymbols} symbols to ${result.totalTokens} tokens`); ``` ### Query-Specific Context ```typescript const contextBundle = await compactor.compact({ query: 'authentication middleware', taskType: 'understand', maxTokens: 4000 }); // Use the compacted context for LLM prompts const promptContext = compactor.generatePromptContext(contextBundle, 4000); ``` ### File-Specific Analysis ```typescript // Get summary for a specific file const fileSummary = await compactor.getSummary('src/auth/middleware.ts'); // Get context for a specific symbol const symbolContext = await compactor.getContextForSymbol('middleware.ts:authenticateUser:15'); ``` ## Integration with MCP Tools The semantic compactor is integrated into the existing MCP tools to provide enhanced context: ### Enhanced Search Context ```typescript // Before: Basic keyword search const results = await searchContext({ queryText: 'authentication' }); // After: Semantic search with compaction const enhancedResults = await enhancedSearchContext({ queryText: 'authentication', taskType: 'understand', maxTokens: 3000 }); ``` ### Context Bundles with Compression Info ```typescript const bundle = await getContextBundle({ query: 'user authentication flow', token_budget: 4000 }); console.log(`Compression: ${bundle.compression_info.compression_ratio * 100}% of original`); console.log(`Symbols: ${bundle.compression_info.original_symbols} → ${bundle.compression_info.compacted_symbols}`); ``` ## Configuration Options ### File Processing ```typescript { maxFileSize: 100000, // Skip files larger than this supportedLanguages: ['typescript', 'javascript', 'python'], maxConcurrentFiles: 10, // Process files in batches enableCaching: true // Cache results for performance } ``` ### AST Processing ```typescript { astOptions: { includePrivateMethods: false, // Skip private/internal symbols includeComments: true, // Include docstrings and comments maxFunctionBodyLines: 5, // Truncate long function bodies preserveTypeAnnotations: true // Keep TypeScript type info } } ``` ### Deduplication ```typescript { deduplicationOptions: { enableSignatureDeduplication: true, // Dedupe by signature enableBodyDeduplication: true, // Dedupe by content enableCrossFileDeduplication: true, // Dedupe across files similarityThreshold: 0.8 // How similar to consider duplicates } } ``` ### Quality Control ```typescript { minSymbolImportance: 10, // Filter low-importance symbols prioritizeExports: true, // Prefer exported symbols maxTokensPerFile: 2000, // Limit per-file token usage maxTotalTokens: 20000 // Global token budget } ``` ## Performance Characteristics ### Benchmarks (typical project with 100 TypeScript files) - **Processing Time**: 2-5 seconds for initial compaction - **Compression Ratio**: 60-80% token reduction vs raw code - **Memory Usage**: ~50MB peak during processing - **Cache Performance**: 95%+ hit rate for repeated queries ### Scalability - **Small Projects** (< 50 files): Sub-second processing - **Medium Projects** (50-200 files): 2-10 seconds - **Large Projects** (200+ files): 10-30 seconds with progressive loading ## Error Handling The compactor includes comprehensive error handling: - **Parse Errors**: Graceful fallback to tree-sitter or skip problematic files - **Memory Limits**: Automatic cleanup and garbage collection - **Timeouts**: Configurable processing timeouts per file - **Validation**: Input validation and sanitization throughout pipeline ## Testing and Benchmarking ### Running Tests ```bash npm run test:compactor # Run semantic compactor tests ``` ### Performance Benchmarking ```bash npm run benchmark # Benchmark current project npm run benchmark /path # Benchmark specific project ``` ### Regression Testing The benchmark system includes regression testing to ensure performance doesn't degrade: - **Performance Regression**: Processing time increases - **Compression Regression**: Worse compression ratios - **Quality Regression**: Lower symbol importance scores ## Future Enhancements ### Planned Features - **Language Support**: Python, Go, Rust, Java parsing - **ML Integration**: CodeBERT embeddings for semantic similarity - **Incremental Updates**: Delta processing for changed files - **Visual Analysis**: Mermaid diagrams for code structure - **Custom Parsers**: Plugin system for domain-specific languages ### Optimization Opportunities - **Streaming Processing**: Process large projects in chunks - **Worker Threads**: Parallel file processing - **Persistent Caching**: Cross-session result caching - **Memory Optimization**: Reduce peak memory usage ## Conclusion The Semantic Compactor provides a sophisticated solution for code context compression, achieving significant token savings while preserving semantic richness. It integrates seamlessly with existing MCP tools and provides comprehensive analytics for optimization. Key benefits: - ✅ **60-80% token reduction** vs raw code - ✅ **Semantic preservation** of important symbols and relationships - ✅ **Query-aware context** generation with relevance scoring - ✅ **Comprehensive deduplication** across files and projects - ✅ **Performance monitoring** and optimization guidance - ✅ **Extensible architecture** for future enhancements The system successfully implements all requirements from the contextCompaction specification and provides a solid foundation for intelligent code context management in LLM applications.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sbarron/AmbianceMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server