Skip to main content
Glama
Smart-Retrieval-System-Guide.md9.87 kB
# Smart Retrieval System Guide Comprehensive guide to OpenZIM MCP's intelligent entry retrieval system with automatic fallback and path mapping cache. ## Overview The Smart Retrieval System is one of OpenZIM MCP's most powerful features, designed to handle the complexities of ZIM file path encoding automatically. It provides transparent, reliable access to ZIM entries regardless of path format inconsistencies. ## How Smart Retrieval Works ### The Challenge ZIM files often contain entries with inconsistent path encoding: - Spaces vs underscores: `"Test Article"` vs `"Test_Article"` - URL encoding: `"Café"` vs `"Caf%C3%A9"` - Case variations: `"DNA"` vs `"dna"` - Special characters: `"C++"` vs `"C%2B%2B"` ### The Solution Smart Retrieval uses a multi-stage approach: ``` 1. Direct Access Attempt ↓ (if fails) 2. Search-Based Fallback ↓ 3. Path Mapping Cache ↓ 4. Multiple Search Strategies ↓ 5. Best Match Selection ↓ 6. Cache Update ↓ 7. Content Retrieval ``` ## Retrieval Process ### Stage 1: Direct Access First, the system attempts direct access using the provided path: ```python # Example: User requests "A/Test Article" try: entry = zim_file.get_entry_by_path("A/Test Article") return entry.content except EntryNotFound: # Proceed to fallback pass ``` ### Stage 2: Path Mapping Cache Check Check if we've seen this path before: ```python # Check cache for known mapping cached_path = path_cache.get("A/Test Article") if cached_path: return zim_file.get_entry_by_path(cached_path) ``` ### Stage 3: Search-Based Fallback If direct access fails, multiple search strategies are employed: #### Strategy 1: Exact Title Search ```python # Search for exact title match results = search_zim_file(query="Test Article", namespace="A") ``` #### Strategy 2: Normalized Search ```python # Try with underscores results = search_zim_file(query="Test_Article", namespace="A") ``` #### Strategy 3: URL Decoded Search ```python # Try URL decoding results = search_zim_file(query=url_decode("Test%20Article"), namespace="A") ``` #### Strategy 4: Fuzzy Matching ```python # Fuzzy search for close matches results = fuzzy_search(query="Test Article", threshold=0.8) ``` ### Stage 4: Best Match Selection The system evaluates search results using multiple criteria: ```python def score_match(result, original_query): score = 0.0 # Exact title match (highest priority) if result.title.lower() == original_query.lower(): score += 1.0 # Path similarity score += path_similarity(result.path, original_query) # Namespace match if result.namespace == expected_namespace: score += 0.5 # Content relevance score += content_relevance(result.snippet, original_query) return score ``` ### Stage 5: Cache Update Successful mappings are cached for future use: ```python # Cache the successful mapping path_cache.set( key="A/Test Article", value="A/Test_Article", ttl=3600 # 1 hour ) ``` ## Path Mapping Cache ### Cache Structure The path mapping cache stores successful path resolutions: ```json { "cache_entries": { "A/Test Article": { "resolved_path": "A/Test_Article", "timestamp": "2025-09-15T10:30:00Z", "hit_count": 15, "confidence": 1.0 }, "C/Café": { "resolved_path": "C/Caf%C3%A9", "timestamp": "2025-09-15T10:25:00Z", "hit_count": 3, "confidence": 0.95 } }, "statistics": { "total_entries": 2, "hit_rate": 0.87, "cache_size_mb": 0.5 } } ``` ### Cache Management #### Automatic Invalidation - **TTL-based**: Entries expire after configured time - **Size-based**: LRU eviction when cache is full - **Confidence-based**: Low-confidence entries expire sooner #### Cache Optimization ```python # High-confidence entries get longer TTL if confidence > 0.9: ttl = 7200 # 2 hours elif confidence > 0.7: ttl = 3600 # 1 hour else: ttl = 1800 # 30 minutes ``` ## Usage Examples ### Basic Usage The Smart Retrieval System works transparently: ```json { "name": "get_zim_entry", "arguments": { "zim_file_path": "/path/to/file.zim", "entry_path": "A/Test Article" } } ``` **Response with Smart Retrieval**: ``` # Test Article Requested Path: A/Test Article Actual Path: A/Test_Article Type: text/html Retrieval Method: smart_fallback ## Content This article demonstrates the smart retrieval system... ``` ### Advanced Scenarios #### Scenario 1: URL Encoded Paths ```json { "name": "get_zim_entry", "arguments": { "entry_path": "C/Caf%C3%A9" } } ``` **Smart Retrieval Process**: 1. Try direct: `C/Caf%C3%A9` 2. Try decoded: `C/Café` 3. Cache mapping: `C/Caf%C3%A9` → `C/Café` #### Scenario 2: Space vs Underscore ```json { "name": "get_zim_entry", "arguments": { "entry_path": "A/Machine Learning" } } ``` **Smart Retrieval Process**: 1. Try direct: `A/Machine Learning` 2. Try with underscores: `A/Machine_Learning` 3. Cache mapping: `A/Machine Learning` → `A/Machine_Learning` #### Scenario 3: Case Sensitivity ```json { "name": "get_zim_entry", "arguments": { "entry_path": "C/dna" } } ``` **Smart Retrieval Process**: 1. Try direct: `C/dna` 2. Search with case variations: `DNA`, `Dna` 3. Cache mapping: `C/dna` → `C/DNA` ## Performance Optimizations ### Cache Hit Rates Typical cache performance: - **First access**: 0% hit rate (cache miss) - **Subsequent access**: 95%+ hit rate - **Similar paths**: 80%+ hit rate (pattern recognition) ### Performance Metrics ```json { "smart_retrieval_stats": { "total_requests": 1000, "direct_hits": 650, "cache_hits": 280, "fallback_successes": 65, "failures": 5, "average_resolution_time_ms": 45, "cache_hit_rate": 0.93 } } ``` ### Optimization Strategies #### 1. Preload Common Patterns ```python # Preload known path mappings common_patterns = { " ": "_", # Space to underscore "%20": "_", # URL encoded space "%C3%A9": "é", # URL encoded é } ``` #### 2. Pattern Learning ```python # Learn from successful mappings def learn_pattern(original, resolved): pattern = extract_pattern(original, resolved) pattern_cache.add(pattern) ``` #### 3. Batch Processing ```python # Process multiple paths efficiently def resolve_paths_batch(paths): # Group by pattern # Process in parallel # Update cache in batch ``` ## Configuration ### Smart Retrieval Settings ```bash # Enable/disable smart retrieval (default: true) export OPENZIM_MCP_SMART_RETRIEVAL__ENABLED=true # Path cache size (default: 1000) export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_SIZE=2000 # Cache TTL in seconds (default: 3600) export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_TTL=7200 # Fallback search limit (default: 10) export OPENZIM_MCP_SMART_RETRIEVAL__SEARCH_LIMIT=20 # Confidence threshold (default: 0.7) export OPENZIM_MCP_SMART_RETRIEVAL__MIN_CONFIDENCE=0.8 ``` ### Performance Tuning #### High-Performance Profile ```bash export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_SIZE=5000 export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_TTL=14400 export OPENZIM_MCP_SMART_RETRIEVAL__SEARCH_LIMIT=50 ``` #### Memory-Constrained Profile ```bash export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_SIZE=500 export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_TTL=1800 export OPENZIM_MCP_SMART_RETRIEVAL__SEARCH_LIMIT=5 ``` ## Monitoring and Diagnostics ### Health Monitoring Check smart retrieval performance: ```json { "name": "get_server_health" } ``` **Response includes**: ```json { "smart_retrieval": { "enabled": true, "cache_size": 1500, "cache_hit_rate": 0.93, "average_resolution_time_ms": 45, "fallback_success_rate": 0.87 } } ``` ### Performance Analysis ```json { "name": "diagnose_server_state" } ``` **Smart Retrieval Diagnostics**: ```json { "smart_retrieval_analysis": { "cache_efficiency": "excellent", "fallback_performance": "good", "common_patterns": [ "space_to_underscore: 45%", "url_encoding: 25%", "case_variations: 20%" ], "recommendations": [ "Cache performance is optimal", "Consider increasing cache size for better hit rates" ] } } ``` ## Troubleshooting ### Common Issues #### Issue: Low Cache Hit Rate **Symptoms**: Slow response times, frequent fallback searches **Causes**: - Cache size too small - TTL too short - Highly variable path patterns **Solutions**: ```bash # Increase cache size export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_SIZE=3000 # Increase TTL export OPENZIM_MCP_SMART_RETRIEVAL__CACHE_TTL=7200 ``` #### Issue: Fallback Failures **Symptoms**: "Entry not found" errors for existing content **Causes**: - Search limit too low - Confidence threshold too high - Unusual path encoding **Solutions**: ```bash # Increase search limit export OPENZIM_MCP_SMART_RETRIEVAL__SEARCH_LIMIT=30 # Lower confidence threshold export OPENZIM_MCP_SMART_RETRIEVAL__MIN_CONFIDENCE=0.6 ``` #### Issue: Slow Resolution Times **Symptoms**: High average resolution times **Causes**: - Too many fallback searches - Large search limits - Cache misses **Solutions**: 1. Monitor cache hit rates 2. Optimize search strategies 3. Preload common patterns ### Diagnostic Commands #### Check Cache Status ```bash # View cache statistics curl -X POST http://localhost:8000/mcp \ -d '{"name": "get_server_health"}' | \ jq '.smart_retrieval' ``` #### Test Path Resolution ```bash # Test specific path resolution curl -X POST http://localhost:8000/mcp \ -d '{"name": "get_zim_entry", "arguments": {"zim_file_path": "/path/to/file.zim", "entry_path": "A/Test Article"}}' | \ grep "Retrieval Method" ``` --- **Want to optimize performance?** Check the [Performance Optimization Guide](Performance-Optimization-Guide) for advanced tuning strategies.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cameronrye/openzim-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server