SEARCH_RANKING_OPTIMIZATION_STRATEGY.md•7.31 kB
# Multi-Engine Search Result Ranking Optimization Strategy
## Current State Analysis
### Issues Identified
- **Engine tracking attribution failure** - results not properly attributed to source engines
- **Naive ranking**: Simple rank-based deduplication favors first result
- **No diversity guarantee**: May miss good results from other engines
- **LLM gets suboptimal choices**: Current system doesn't optimize for LLM selection
### Current Algorithm
```python
def deduplicate_results(all_results, num_results):
url_to_best = {}
for result in all_results:
url = result["url"]
rank = result.get("rank", 999)
if url not in url_to_best or rank < url_to_best[url]["rank"]:
url_to_best[url] = result
return sorted(list(url_to_best.values()), key=lambda x: x.get("rank", 999))[:num_results]
```
**Problems**: Only considers individual engine ranking, ignores engine expertise, no diversity guarantee.
## Use Case Clarification
**Goal**: Provide best candidate pool to LLM for second-level selection
- LLM receives ~15 results (5 from each engine)
- LLM decides which pages to fetch for content
- We need to maximize quality and diversity of choices
**This is federated search, not hybrid search** - different problem requiring different solution.
## Proposed Solution: Quality-First with Diversity
### Why Not RRF?
RRF is designed for hybrid search (combining different query types on same corpus). Our use case is federated search (same query across different engines). We want to:
1. **Trust each engine's expertise** - their top 5 are their best judgment
2. **Guarantee diversity** - ensure LLM gets options from all engines
3. **Preserve provenance** - LLM knows which engine found what
### Algorithm: Token-Optimized Quality Selection
**Process Flow**:
1. Take top 4 results from each engine (12 total)
2. Deduplicate by URL (typically reduces to 8-10 unique)
3. Rank by original engine position
4. Return top 10 with direct source attribution
```python
def create_llm_optimized_pool(ddg_results, bing_results, startpage_results):
# Collect candidates with engine attribution
candidates = []
for i, result in enumerate(ddg_results[:4]):
candidates.append({**result, "engine": "duckduckgo", "rank": i+1})
for i, result in enumerate(bing_results[:4]):
candidates.append({**result, "engine": "bing", "rank": i+1})
for i, result in enumerate(startpage_results[:4]):
candidates.append({**result, "engine": "startpage", "rank": i+1})
# Deduplicate - keep best ranked version
unique_results = deduplicate_keep_best_rank(candidates)
# Rank by original engine position (trust engine expertise)
ranked_results = sorted(unique_results, key=lambda x: x["rank"])[:10]
# Add direct source attribution to each result
clean_results = []
for result in ranked_results:
clean_result = {k: v for k, v in result.items() if k not in ["rank"]}
clean_result["source"] = result["engine"]
clean_results.append(clean_result)
return clean_results
def deduplicate_keep_best_rank(candidates):
seen = {}
for candidate in candidates:
url = candidate["url"]
if url not in seen or candidate["rank"] < seen[url]["rank"]:
seen[url] = candidate
return list(seen.values())
```
### Response Format Optimization
**LLM receives clean format with direct attribution**:
```json
{
"results": [
{
"title": "Complete Python Machine Learning Tutorial",
"url": "https://example.com/ml-tutorial",
"snippet": "Learn machine learning with Python from basics...",
"source": "duckduckgo"
},
{
"title": "Scikit-learn Documentation",
"url": "https://scikit-learn.org/stable/tutorial",
"snippet": "Official scikit-learn tutorials covering...",
"source": "bing"
}
]
}
```
**Benefits**: Self-contained attribution, LLM-friendly, no index mapping needed
## Algorithm Comparison
| Method | Pros | Cons | Use Case |
|--------|------|------|----------|
| **Current Simple** | Fast | Biased, no diversity | None - should replace |
| **RRF** | Good for hybrid search | Wrong problem type | Different query types on same corpus |
| **Quality-First** | Trusts engines, guarantees diversity | May have duplicates | **Our use case** |
## Implementation Plan
### Week 1: Core Algorithm & Token Optimization ✅ COMPLETED
- [x] Implement token-optimized candidate pool creation (4 per engine → dedupe → rank top 10)
- [x] Fix tracking system to ensure proper engine attribution
- [x] Add direct source attribution to each result (no metadata mapping)
- [x] Add comprehensive tests and token usage measurement
### Week 2: Deduplication & Ranking ✅ COMPLETED
- [x] Optimize deduplication performance with robust URL normalization
- [x] Implement ranking by original engine position
- [x] Add configurable per-engine limits (default 4)
- [x] Performance benchmarking vs current system
### Week 3: Monitoring & Metrics ✅ COMPLETED
- [x] Add engine diversity metrics and deduplication stats
- [x] Track LLM selection patterns and token usage
- [x] Monitor for engine bias in final results
- [x] Create diversity and performance dashboard
### Week 4: Enhancement & Deployment
- [ ] Add query-type based engine selection
- [ ] Implement adaptive per-engine limits based on query length
- [ ] A/B test token optimization impact on LLM performance
- [ ] Production deployment with feature flags
## Expected Outcomes
### Primary Metrics
- **Token Efficiency**: ~60% reduction in tokens sent to LLM
- **Result Quality**: 8-10 unique, high-quality results (vs current ~10 with duplicates)
- **Engine Diversity**: Balanced representation from all working engines
- **Attribution Accuracy**: Clean engine tracking via metadata
- **Performance**: Maintain current speed with deduplication overhead
### Success Criteria
- Reduce LLM token consumption while improving result quality
- Eliminate duplicate results before LLM selection
- Ensure proper engine attribution for all results
- Maintain 4 results from each working engine before deduplication
- Clean, optimized response format for LLM consumption
## Risk Mitigation
### Technical Risks
- **Duplicate Handling**: Robust URL normalization
- **Engine Failures**: Graceful degradation when engines are down
- **Performance**: Efficient deduplication algorithm
### Operational Risks
- **Tracking Issues**: Comprehensive engine attribution testing
- **LLM Impact**: Monitor LLM selection patterns
- **Rollback**: Maintain ability to revert
## Implementation Details
### Fix Tracking System
```python
# In format_search_response(), ensure proper attribution
ddg_with_engine = [{"source_engine": "ddg", **result} for result in ddg_results]
bing_with_engine = [{"source_engine": "bing", **result} for result in bing_results]
startpage_with_engine = [{"source_engine": "startpage", **result} for result in startpage_results]
```
### Engine Failure Handling
```python
def safe_get_top_results(results, count=5):
"""Safely get top N results, handling empty/failed engines."""
return results[:count] if results else []
```
---
**Status**: Core ranking optimization successfully implemented. Phases 1-3 complete, Phase 4 enhancement pending.