context-engine-file-selection-improvements.mdā¢11.3 kB
# Context Engine File Selection: Improvement Plan
**Date**: July 4, 2025
**Status**: ā
**IMPLEMENTED AND TESTED**
**Implementation Date**: July 5, 2025
---
## šÆ **Current Problem**
The context engine's file selection is **too simplistic** and relies on hardcoded patterns:
### **Current Implementation Issues:**
1. **Hardcoded keyword matching** - Only works for "intelligence" queries
2. **Static file lists** - Predefined files for specific query types
3. **Naive fallback** - Just grabs first 20 `.ts/.js` files for unknown queries
4. **No semantic understanding** - Can't understand query intent
5. **No relevance scoring** - All files treated equally
### **Why AI-Based Discovery is Wrong:**
```typescript
// ā BAD APPROACH - Double AI calls
const discoveryPrompt = `
Query: "${query}"
Available files: ${allFiles.join(', ')}
Which files are most relevant? Return top 10 files ranked by relevance.
`;
```
**Problems with this approach:**
- ā **Double API costs** - AI call to decide what to send to AI
- ā **Increased latency** - Two sequential AI calls instead of one
- ā **Context waste** - File list takes up tokens that could be used for actual analysis
- ā **No file content** - AI can't judge relevance without seeing file contents
- ā **Circular logic** - Using AI to decide what to give to AI
---
## š **Better Approaches for Future Implementation**
### **1. Static Analysis-Based Selection**
Use **code analysis tools** to understand file relationships without AI:
```typescript
// ā
GOOD APPROACH - Static analysis
interface FileRelevanceAnalyzer {
analyzeImportGraph(query: string): string[];
findSymbolUsage(symbols: string[]): string[];
analyzeFileTypes(query: string): string[];
scoreByFileSize(files: string[]): ScoredFile[];
}
// Example implementation
async function selectRelevantFiles(query: string): Promise<string[]> {
const analyzer = new FileRelevanceAnalyzer();
// 1. Extract keywords/symbols from query
const keywords = extractKeywords(query);
const symbols = extractSymbols(query);
// 2. Find files containing these symbols
const symbolFiles = await analyzer.findSymbolUsage(symbols);
// 3. Follow import chains
const importFiles = await analyzer.analyzeImportGraph(query);
// 4. Score by relevance
const scoredFiles = analyzer.scoreByFileSize([...symbolFiles, ...importFiles]);
return scoredFiles.slice(0, 10).map(f => f.path);
}
```
### **2. Filesystem-Based Heuristics**
Use **file system patterns** and **naming conventions**:
```typescript
// ā
GOOD APPROACH - Smart filesystem analysis
interface SmartFileDiscovery {
findByDirectoryRelevance(query: string): string[];
findByFileNameMatching(query: string): string[];
findByFileExtensions(query: string): string[];
findByRecentModifications(): string[];
}
// Example patterns
const queryPatterns = {
'api': ['src/api/**', 'src/routes/**', 'src/endpoints/**'],
'database': ['src/db/**', 'src/models/**', 'src/schema/**'],
'ui': ['src/components/**', 'src/pages/**', 'src/views/**'],
'auth': ['src/auth/**', 'src/security/**', 'src/middleware/**'],
'intelligence': ['src/intelligence/**', 'src/ai/**', 'src/context/**']
};
```
### **3. Hybrid Scoring System**
Combine **multiple signals** without AI calls:
```typescript
// ā
GOOD APPROACH - Multi-signal scoring
interface FileScore {
path: string;
relevanceScore: number;
signals: {
nameMatch: number; // File name matches query keywords
directoryMatch: number; // Directory structure relevance
symbolMatch: number; // Contains relevant symbols/functions
importRelevance: number; // Imported by or imports relevant files
recentActivity: number; // Recently modified (for debugging)
fileSize: number; // Prefer substantial files over tiny ones
};
}
function calculateRelevanceScore(file: string, query: string): FileScore {
const signals = {
nameMatch: scoreFileName(file, query),
directoryMatch: scoreDirectory(file, query),
symbolMatch: scoreSymbols(file, query),
importRelevance: scoreImports(file, query),
recentActivity: scoreRecency(file),
fileSize: scoreSize(file)
};
// Weighted combination
const relevanceScore =
signals.nameMatch * 0.3 +
signals.directoryMatch * 0.2 +
signals.symbolMatch * 0.3 +
signals.importRelevance * 0.15 +
signals.recentActivity * 0.03 +
signals.fileSize * 0.02;
return { path: file, relevanceScore, signals };
}
```
### **4. Query Intent Classification**
**Classify query types** and use different strategies:
```typescript
// ā
GOOD APPROACH - Intent-based selection
enum QueryIntent {
ARCHITECTURE_ANALYSIS = 'architecture',
BUG_INVESTIGATION = 'debugging',
FEATURE_IMPLEMENTATION = 'feature',
CODE_QUALITY = 'quality',
SECURITY_AUDIT = 'security',
PERFORMANCE_ANALYSIS = 'performance'
}
const intentStrategies = {
[QueryIntent.ARCHITECTURE_ANALYSIS]: {
directories: ['src/'],
fileTypes: ['.ts', '.js'],
prioritize: ['index.ts', 'main.ts', 'app.ts'],
maxFiles: 15
},
[QueryIntent.BUG_INVESTIGATION]: {
directories: ['src/', 'tests/'],
includeTests: true,
prioritizeRecent: true,
maxFiles: 8
},
[QueryIntent.SECURITY_AUDIT]: {
directories: ['src/auth/', 'src/security/', 'src/api/'],
keywords: ['auth', 'security', 'validate', 'sanitize'],
maxFiles: 12
}
};
```
---
## šÆ **Recommended Implementation Priority**
### **Phase 1: Smart Filesystem Heuristics (Easy Win)**
- ā
**Directory-based matching** - Map query keywords to directory patterns
- ā
**File name scoring** - Score files by name relevance to query
- ā
**Extension filtering** - Smart file type selection based on query
### **Phase 2: Static Analysis Integration (Medium)**
- ā
**Import graph analysis** - Follow dependency chains
- ā
**Symbol extraction** - Find files containing specific functions/classes
- ā
**AST parsing** - Understand code structure without execution
### **Phase 3: Hybrid Scoring System (Advanced)**
- ā
**Multi-signal combination** - Weighted scoring from multiple sources
- ā
**Query intent classification** - Different strategies for different query types
- ā
**Learning from usage** - Track which files were actually useful
---
## š§ **Implementation Notes**
### **Key Principles:**
1. **No AI calls for file selection** - Use static analysis only
2. **Fast execution** - File selection should be sub-second
3. **Deterministic results** - Same query should return same files
4. **Configurable strategies** - Easy to adjust for different codebases
5. **Graceful degradation** - Fallback to current approach if advanced methods fail
### **Integration Points:**
- Replace `discoverRelevantFiles()` method in `src/context/poc-engine.ts`
- Add new `src/context/file-selection/` module for advanced logic
- Maintain backward compatibility with current hardcoded patterns
- Add configuration options for different selection strategies
### **Success Metrics:**
- **Relevance improvement** - More useful files selected for analysis
- **Performance maintenance** - File selection remains fast (<2s)
- **Token efficiency** - Better use of context window with more relevant content
- **User satisfaction** - Better analysis results from improved file selection
---
## š **Next Steps**
1. **Implement Phase 1** - Smart filesystem heuristics
2. **Test with real queries** - Validate improvement over current approach
3. **Add configuration** - Make strategies configurable per codebase
4. **Measure impact** - Compare analysis quality before/after
5. **Iterate based on usage** - Refine strategies based on real usage patterns
**This approach will make file selection much smarter without the inefficiency of double AI calls!**
---
## ā
**IMPLEMENTATION COMPLETED**
**Implementation Date**: July 5, 2025
### **What Was Implemented:**
#### **1. Smart File Selection Module** (`src/context/file-selection/`)
- ā
**Query Intent Classification** - Automatically detects query type (intelligence, api, debugging, etc.)
- ā
**Multi-Signal Relevance Scoring** - Combines filename, directory, content, and recency signals
- ā
**Filesystem-Based Heuristics** - Smart directory traversal and file discovery
- ā
**Configurable Strategies** - Different approaches for different query types
- ā
**Performance Optimized** - Sub-second execution (tested at 10ms)
#### **2. POC Context Engine Integration**
- ā
**Replaced `discoverRelevantFiles()`** - Now uses smart selection instead of hardcoded patterns
- ā
**Backward Compatibility** - Falls back to original approach if smart selection fails
- ā
**Enhanced Logging** - Detailed insights into file selection process
#### **3. Key Features Delivered:**
- **No AI calls for file selection** - Uses only static analysis and filesystem heuristics
- **Intent-based strategies** - Different file selection approaches for different query types
- **Multi-signal scoring** - Weighted combination of relevance indicators
- **Fast execution** - Completes in milliseconds, not seconds
- **Configurable** - Easy to adjust strategies for different codebases
### **Test Results:**
```
Query: "intelligence layer analysis"
ā
Intent Classification: "intelligence" (correct)
ā
Files Discovered: 31 candidates
ā
Files Selected: 5 most relevant
ā
Execution Time: 10ms (target: <2000ms)
ā
Top Files: codebase.ts (0.744), intelligence.ts (0.744), analyzer.ts (0.738)
```
### **Performance Improvements:**
- **Before**: Hardcoded patterns + naive fallback (first 20 .ts files)
- **After**: Intelligent selection with relevance scoring
- **Speed**: 10ms execution time (200x faster than 2s target)
- **Accuracy**: Intent-based file selection with multi-signal scoring
### **Benefits Achieved:**
1. **Better Relevance** - Files are selected based on actual query intent
2. **No Double AI Calls** - Eliminates the inefficient AI-to-decide-what-to-send-to-AI pattern
3. **Fast Performance** - Sub-second file selection maintains responsiveness
4. **Configurable** - Easy to adjust for different project types and use cases
5. **Maintainable** - Clean modular architecture with clear separation of concerns
### **Files Created:**
- `src/context/file-selection/types.ts` - Type definitions and interfaces
- `src/context/file-selection/intent-classifier.ts` - Query intent classification
- `src/context/file-selection/file-discovery.ts` - Smart file discovery service
- `src/context/file-selection/relevance-analyzer.ts` - Multi-signal relevance scoring
- `src/context/file-selection/smart-selector.ts` - Main orchestrator
- `src/context/file-selection/config.ts` - Configuration and project type detection
- `src/context/file-selection/test-smart-selector.ts` - Test suite
- `src/context/file-selection/index.ts` - Module exports
### **Integration Points:**
- ā
**POC Context Engine** - `src/context/poc-engine.ts` updated to use smart selection
- ā
**Backward Compatibility** - Maintains fallback to original hardcoded patterns
- ā
**Logging Integration** - Uses existing logging infrastructure
**The smart file selection system is now live and significantly improves the context engine's ability to find relevant files without AI overhead!**