Claude Conversation Memory System

SEARCH_OPTIMIZATION_ANALYSIS.md•14.8 KiB

# Search Optimization Analysis ## Current Search Implementation Audit ### 1.1.1 Existing Search Architecture **Location**: `src/conversation_memory.py` lines 195-216 **Core Algorithm**: Linear search through indexed conversations with simple scoring **Search Flow**: 1. Load index from `index.json` file 2. Split query into terms (simple whitespace split) 3. Iterate through ALL conversation records linearly 4. For each conversation: - Load conversation file from disk - Calculate relevance score - Add to results if score > 0 5. Sort results by score 6. Return top N results ### Current Implementation Details **Scoring Algorithm** (`_calculate_search_score`): - Content matches: +1 point per occurrence - Title matches: +3 points per occurrence - Topic matches: +5 points per exact match - Simple additive scoring **Search Process** (`search_conversations`): ```python def search_conversations(self, query: str, limit: int = 10) -> List[Dict[str, Any]]: # 1. Load index from JSON file with open(self.index_file, 'r') as f: index_data = json.load(f) # 2. Linear iteration through ALL conversations for conv_info in conversations: result = self._process_conversation_for_search(conv_info, query_terms) if result: results.append(result) # 3. Sort and limit results results.sort(key=lambda x: x["score"], reverse=True) return results[:limit] ``` **File Access Pattern** (`_process_conversation_for_search`): - **CRITICAL BOTTLENECK**: Opens and reads EVERY conversation file during search - No caching or pre-loading of content - JSON parsing for every file on every search ## Performance Bottlenecks Identified ### 1.1.2 Major Performance Issues **1. O(n) Linear Search Complexity** - Must check every conversation for every search - No early termination or indexing - Performance degrades linearly with dataset size **2. Excessive File I/O Operations** - Opens and reads EVERY conversation file during search - No content caching between searches - JSON parsing overhead for each file **3. Naive Text Matching** - Simple substring matching with `.count()` - No fuzzy matching or stemming - Case-sensitive after lowercasing (limited) **4. Memory Inefficiency** - Loads all results into memory before sorting - No streaming or pagination at search level - Temporary result objects created for all conversations **5. No Search Optimization Features** - No phrase searching or boolean operators - No filtering by date, topics, or metadata - No search result caching ### Performance Scaling Analysis **Current Performance Characteristics**: - **Time Complexity**: O(n × m) where n = conversations, m = avg file size - **Space Complexity**: O(n) for result storage - **I/O Complexity**: O(n) file reads per search **Estimated Performance with Dataset Growth**: - 100 conversations: ~50-100ms - 500 conversations: ~250-500ms - 1000 conversations: ~500ms-1s - 5000 conversations: ~2.5-5s (unusable) ## Current Search Limitations ### 1.1.3 Functional Limitations **Search Quality Issues**: - No relevanHow about **Technical Limitations**: - No concurrent search support - No search analytics or logging - No search result caching - No incremental search or real-time updates **Scalability Constraints**: - Linear performance degradation - High memory usage for large datasets - No database-level optimizations - No indexing beyond simple file listing ## Search Index Structure Analysis **Current Index Format** (`index.json`): ```json { "conversations": [ { "id": "conv_20240101_120000_1234", "title": "Sample Conversation", "date": "2024-01-01T12:00:00Z", "topics": ["python", "programming"], "file_path": "2024/01-january/conv_20240101_120000_1234.json" } ], "last_updated": "2024-01-01T12:00:00Z" } ``` **Index Limitations**: - Contains only metadata, no searchable content - No pre-computed search terms or keywords - No content fingerprints or summaries - No search-optimized data structures ## Recommendations for SQLite FTS Migration **Critical Improvements Needed**: 1. **Replace linear search** with SQLite FTS for sub-second searches 2. **Eliminate file I/O bottleneck** by storing content in database 3. **Add proper text indexing** with stemming and phrase support 4. **Implement relevance ranking** with BM25 or similar algorithms 5. **Add search features** like filtering, boolean operators, and highlighting **Performance Targets**: - Search time: <100ms for datasets up to 10,000 conversations - Memory usage: <50MB for search operations - Relevance: Improved ranking with modern search algorithms - Features: Boolean operators, phrase search, date filtering ## Section 1.2: Performance Benchmarking Requirements ### 1.1.2 Performance Bottleneck Analysis **Critical Performance Issues Identified**: 1. **File I/O Bottleneck (Most Critical)**: - Every search opens and reads EVERY conversation file - JSON parsing overhead for each file on each search - No content caching between searches - Scales linearly: O(n) file operations per search 2. **Linear Search Algorithm**: - Must check every conversation for every search - No early termination or indexing optimizations - Simple substring matching with `.count()` method - No ranking beyond basic term frequency 3. **Memory Inefficiency**: - Loads all search results into memory before sorting - Creates temporary result objects for ALL conversations - No streaming or pagination at search algorithm level ### 1.1.3 Current Search Limitations Documentation **Search Quality Limitations**: - No relevance ranking beyond simple term counting - No phrase matching or proximity search - No fuzzy matching or typo tolerance - No semantic understanding or synonym matching - No boolean operators (AND, OR, NOT) - No field-specific searching (title:, content:, topic:) **Performance Scaling Issues**: - **Current**: O(n × m) time complexity (n=conversations, m=avg file size) - **Memory**: O(n) space for results + file content during processing - **I/O**: O(n) file reads per search operation - **Estimated Performance Degradation**: - 100 conversations: ~50-100ms (acceptable) - 500 conversations: ~250-500ms (slow but usable) - 1000 conversations: ~500ms-1s (poor user experience) - 5000+ conversations: >2.5s (unusable) **Technical Debt**: - No search result caching - No concurrent search support - No search analytics or performance monitoring - No incremental search capabilities ## Section 1.2: Performance Baseline Requirements ### Benchmarking Strategy **Test Dataset Requirements**: - Small dataset: 50-100 conversations - Medium dataset: 500-1000 conversations - Large dataset: 2000-5000 conversations - Variable conversation sizes: 1KB, 10KB, 50KB, 100KB **Metrics to Measure**: 1. **Search Response Time**: Total time from query to results 2. **File I/O Time**: Time spent reading conversation files 3. **Search Algorithm Time**: Pure search logic execution time 4. **Memory Usage**: Peak memory during search operations 5. **Disk I/O Operations**: Number of file reads per search **Performance Targets for SQLite FTS**: - Search time: <100ms for datasets up to 10,000 conversations - Memory usage: <50MB for search operations - Relevance: Improved ranking with BM25 or similar algorithms - Features: Boolean operators, phrase search, date filtering **Implementation Note**: Benchmark suite will be created in Section 1.2.1 to establish these baseline metrics before implementing SQLite FTS solution. ## Section 1.2.1: Baseline Performance Measurements ### Benchmark Results Summary **Test Environment**: - Python 3.8.10 on WSL2 Ubuntu - Hardware: Modern development machine - Storage: SSD storage for conversation files - Memory monitoring: /proc/self/status (psutil fallback) **Dataset Characteristics**: | Dataset | Conversations | Total Size | Avg File Size | Files Read/Search | |---------|---------------|------------|---------------|-------------------| | Small | 100 | 0.33 MB | 3.34 KB | 100 | | Medium | 500 | 3.39 MB | 5.79 KB | 600 | **Performance Results**: | Dataset | Avg Search Time | Min Time | Max Time | Performance Notes | |---------|-----------------|----------|----------|-------------------| | Small | 25ms | 18ms | 40ms | Acceptable for small datasets | | Medium | 165ms | 123ms | 245ms | Poor user experience | ### Key Performance Findings **1. Linear Performance Degradation Confirmed**: - **6x increase** in dataset size (100 → 600 files) - **6.6x increase** in search time (25ms → 165ms) - Performance scales exactly with file count (O(n) confirmed) **2. File I/O Bottleneck Validation**: - Every search reads ALL conversation files (100% I/O overhead) - Medium dataset: 600 file operations per search - No caching between search operations **3. User Experience Impact**: - Small dataset (100 conversations): 25ms - acceptable - Medium dataset (600 conversations): 165ms - noticeable delay - **Projected large dataset (2000+ conversations): >500ms - unusable** **4. Search Quality Assessment**: - Consistent result counts across iterations (good accuracy) - Simple relevance scoring functional but basic - No advanced search features available ### Performance Bottleneck Analysis **Critical Issues Identified**: 1. **File I/O Dominates Performance** (Primary Bottleneck): - 600 file read operations per search (medium dataset) - JSON parsing overhead for each file - No content caching or pre-loading 2. **Linear Search Algorithm**: - O(n) complexity with no optimization - No early termination possible - Simple substring matching only 3. **Memory Usage**: - Low memory increase (0.01 MB average) - Efficient result storage but inefficient processing ### SQLite FTS Migration Justification **Current Performance Limits**: - **Unusable at scale**: >500ms for 2000+ conversations - **No search features**: No boolean ops, phrase search, or advanced ranking - **Maintenance burden**: No indexing, no optimization possible **Expected SQLite FTS Improvements**: - **Sub-100ms searches** for 10,000+ conversations - **Rich search features**: Boolean operators, phrase search, ranking - **Better relevance**: BM25 ranking algorithm - **Reduced I/O**: Content stored in database, no file reads per search **Migration Priority**: **HIGH** - Current system won't scale beyond 1000 conversations ## Section 1.3: SQLite FTS Requirements Definition ### 1.3.1 New Search Features and Capabilities **Core Search Features**: 1. **Full-Text Search with Ranking**: - BM25 relevance ranking algorithm - Phrase search with exact matching ("machine learning") - Prefix search for partial matches (prog* → programming, program) - Snippet generation with search term highlighting 2. **Advanced Query Syntax**: - Boolean operators: AND, OR, NOT - Field-specific search: title:"python" content:"django" - Proximity search: NEAR/2 (words within 2 positions) - Wildcard support: * for any characters 3. **Performance Features**: - Sub-100ms search response for 10,000+ conversations - Indexed content with no file I/O during search - Query result caching for repeated searches - Pagination support for large result sets 4. **Filter Capabilities**: - Date range filtering (last week, last month, custom range) - Topic filtering with multi-select - Platform filtering (once multi-platform support added) - Combined text search + filters ### 1.3.2 Performance Targets and Acceptance Criteria **Performance Requirements**: | Dataset Size | Target Search Time | Current Time | Improvement | |--------------|-------------------|--------------|-------------| | 1,000 convs | <50ms | ~300ms | 6x faster | | 5,000 convs | <75ms | ~1,500ms | 20x faster | | 10,000 convs | <100ms | ~3,000ms | 30x faster | **Acceptance Criteria**: 1. ✓ All existing search functionality maintained 2. ✓ Zero data loss during migration 3. ✓ Backward compatible API (search_conversations method signature) 4. ✓ Search accuracy equal or better than current 5. ✓ Memory usage under 100MB for largest operations 6. ✓ Database file size < 2x current JSON storage **Quality Metrics**: - Search relevance score > 90% user satisfaction - Zero search failures under normal operation - Database corruption recovery mechanisms - Comprehensive test coverage (>95%) ### 1.3.3 Backward Compatibility Requirements **API Compatibility**: ```python # Current API must continue working unchanged results = server.search_conversations(query="python", limit=10) # Returns same format: List[Dict[str, Any]] ``` **Data Compatibility**: 1. **Dual-mode operation** during transition: - Support both JSON and SQLite storage - Automatic migration on first run - Fallback to JSON if SQLite unavailable 2. **Migration Safety**: - Non-destructive migration (JSON files preserved) - Verification step after migration - Rollback capability if issues detected 3. **Storage Structure**: - Maintain year/month directory organization - Keep JSON files as backup/export format - SQLite database in conversations/ root **Feature Compatibility**: - All current MCP tools continue working - Weekly summaries use same data access - Topic extraction remains unchanged - Import/export functionality preserved ## Section 2: Database Design Plan ### 2.1 SQLite Schema Design **Core Tables**: ```sql -- Main conversations table CREATE TABLE conversations ( id TEXT PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, date TEXT NOT NULL, created_at TEXT NOT NULL, platform TEXT DEFAULT 'claude', file_path TEXT ); -- Topics relationship table CREATE TABLE conversation_topics ( conversation_id TEXT, topic TEXT, FOREIGN KEY (conversation_id) REFERENCES conversations(id) ); -- FTS5 virtual table for full-text search CREATE VIRTUAL TABLE conversations_fts USING fts5( id UNINDEXED, title, content, topics, content=conversations, content_rowid=rowid, tokenize='porter unicode61' ); -- Indexes for performance CREATE INDEX idx_conversations_date ON conversations(date); CREATE INDEX idx_topics_topic ON conversation_topics(topic); CREATE INDEX idx_topics_conv ON conversation_topics(conversation_id); ``` ### 2.2 FTS5 Configuration **Tokenizer Selection**: - `porter unicode61`: Stemming + Unicode support - Handles multiple languages and special characters - Reduces "programming" and "programmed" to same stem **Ranking Configuration**: - BM25 algorithm (built into FTS5) - Weights: title=2.0, content=1.0, topics=1.5 - Snippet generation: 64 chars with ellipsis **Search Optimization**: - Prefix indexes for autocomplete - Auxiliary functions for custom ranking - Trigram tokenizer option for fuzzy matching **Next Steps**: Proceed to Section 2.3 - Migration Strategy for safe JSON to SQLite transition.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/adamkwhite/claude-memory-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

SEARCH_OPTIMIZATION_ANALYSIS.md•14.8 KiB