Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

ragdocs-mcp
specs

09-aggregate-scoring-system.md•29.7 KiB

# 9. Aggregate Scoring System ## Executive Summary **Purpose:** Introduce calibrated 0–1 confidence scores for every search result, enabling confidence-based ranking and filtered retrieval. Currently, RRF fusion produces relative scores without semantic meaning; this design adds sigmoid calibration and a configurable top-N parameter. **Scope:** Modify [src/search/fusion.py](../src/search/fusion.py), [src/search/orchestrator.py](../src/search/orchestrator.py), [src/server.py](../src/server.py), [src/models.py](../src/models.py). Add score calibration after RRF fusion but before top-k filtering. Introduce `top_n` parameter with default `n=5`. **Decision:** Calibrate scores **after** RRF+recency fusion using sigmoid calibration. Return `list[tuple[str, float]]` where float ∈ [0, 1]. Semantic: scores represent absolute confidence; top result is not forced to 1.0. --- ## 1. Goals & Non-Goals ### Goals 1. **Calibrated Scores:** Every returned result includes a score ∈ [0, 1] representing absolute match quality independent of result set size. 2. **Configurable Top-N:** Expose `top_n` parameter in `query()` method and `/query_documents` endpoint (default: 5). 3. **Backward Compatibility:** Existing behavior preserved when `top_n` not specified; internal refactor only. 4. **Score Semantics:** Scores represent absolute confidence; 0.5 corresponds to the calibration threshold, and values approach 1.0 asymptotically for very strong matches. ### Non-Goals 1. **Relative-Only Normalization:** Min-max normalization (v1.5) is deprecated; this design uses absolute confidence calibration. 2. **Per-Strategy Scores:** Not exposing individual strategy scores to clients (semantic, keyword, graph remain internal). 3. **Automatic Threshold Learning:** Filtering is configurable via `min_confidence`; no automatic threshold tuning or learning. 4. **Machine Learning Ranking:** Not training a reranker or learning-to-rank model. 5. **Dynamic top_k:** Not replacing RRF's internal `top_k` parameter (used for initial retrieval); only adding result-limiting `top_n`. --- ## 2. Current State Analysis ### 2.1. RRF Fusion Implementation **File:** [src/search/fusion.py](../src/search/fusion.py) **Function:** `fuse_results(results, k, weights, modified_times)` **Current Behavior:** 1. Computes RRF score for each document: `score = 1 / (k + rank)` 2. Aggregates scores across strategies with weights: `weighted_score = Σ (weight_i * rrf_score_i)` 3. Applies recency boost: `boosted_score = weighted_score * multiplier` (1.2x for <7 days, 1.1x for <30 days) 4. Returns `list[tuple[str, float]]` sorted descending by score **Example Scores:** - Document at rank 1 in semantic (weight 1.0), rank 3 in keyword (weight 1.0), 5 days old: - RRF: `(1/61 + 1/63) * 1.2 ≈ 0.0391` - Document at rank 10 in semantic, no other appearances, 60 days old: - RRF: `1/70 * 1.0 ≈ 0.0143` **Score Range:** Unbounded, typically [0.01, 0.10] depending on: - Number of strategies where document appears - Rank positions - Strategy weights - Recency multiplier **Problem:** Scores lack semantic interpretation. Is 0.0391 "good"? How does it compare to 0.0143 in absolute terms? Clients cannot threshold or interpret confidence. ### 2.2. Query Orchestrator Flow **File:** [src/search/orchestrator.py](../src/search/orchestrator.py) **Method:** `async def query(query_text: str, top_k: int) -> list[str]` **Current Flow:** 1. Parallel search dispatch: `asyncio.gather(_search_vector, _search_keyword)` 2. Graph neighbor expansion: `_get_graph_neighbors(doc_ids)` 3. Modified time collection: `_collect_modified_times(doc_ids)` 4. RRF fusion: `fused = fuse_results(results_dict, k, weights, modified_times)` 5. Top-k filtering: `chunk_ids = [item_id for item_id, _ in fused[:top_k]]` 6. Return: `list[str]` (chunk IDs only, scores discarded) **Problem:** Scores computed by RRF but immediately discarded. Top-k hardcoded at call site. No way for clients to request fewer or more results. ### 2.3. Server Endpoint **File:** [src/server.py](../src/server.py) **Endpoint:** `POST /query_documents` **Current Implementation:** ```python @app.post("/query_documents") async def query_documents(request: QueryRequest): orchestrator = app.state.orchestrator results = await orchestrator.query(request.query, top_k=10) # Hardcoded answer = await orchestrator.synthesize_answer(request.query, results) return QueryResponse(answer=answer) ``` **Request Model:** ```python class QueryRequest(BaseModel): query: str ``` **Response Model:** ```python class QueryResponse(BaseModel): answer: str ``` **Problem:** No way to specify `top_n`, no access to scores. Hardcoded `top_k=10` for all queries. ### 2.4. Individual Strategy Scores **Vector Search:** [src/indices/vector.py](../src/indices/vector.py) - Returns: `list[dict]` with keys `chunk_id`, `doc_id`, `score` (cosine similarity, range ~[0.5, 1.0]) - Scores preserved during retrieval but discarded before fusion **Keyword Search:** [src/indices/keyword.py](../src/indices/keyword.py) - Returns: `list[dict]` with keys `chunk_id`, `doc_id`, `score` (BM25 score, range ~[1.0, 30.0]) - Scores highly variable depending on term frequency and document length **Graph Traversal:** [src/search/orchestrator.py#L58-L62](../src/search/orchestrator.py) - Returns: `list[str]` (neighbor doc IDs only, no scores) - Implicitly assigned lower priority during fusion (later rank positions) **Observation:** Individual strategy scores are heterogeneous and incomparable. RRF solves this by converting to ranks, but loses magnitude information. --- ## 3. Proposed Solution ### 3.1. Score Calibration Strategy **Calibration Method:** Sigmoid Calibration (as of v1.6) ### 3.1.1. Calibrated Scoring System **Formula:** ```python calibrated_score = 1 / (1 + exp(-steepness * (raw_score - threshold))) ``` **Parameters:** - `threshold` (default: 0.035): RRF score corresponding to 50% confidence - `steepness` (default: 150.0): Controls sigmoid curve steepness **Application Point:** After RRF+recency fusion, before top-k filtering **Rationale:** 1. **Absolute Confidence:** Scores represent match quality independent of result set size 2. **No Artificial Inflation:** Single-result queries no longer receive 1.0 automatically 3. **Stable Semantics:** Same raw RRF score produces same calibrated score across queries 4. **Interpretable Thresholds:** ≥0.9 = excellent, 0.7-0.9 = good, 0.5-0.7 = moderate, 0.3-0.5 = weak, <0.3 = noise **Score Semantics:** | Calibrated Score | Interpretation | Typical Conditions | |-----------------|----------------|---------------------| | >0.9 | Excellent match | Top rank, multiple strategies agree | | 0.7-0.9 | Good match | Top-3 rank, 2+ strategies | | 0.5-0.7 | Moderate match | Near threshold, single strategy | | 0.3-0.5 | Weak match | Low rank, peripheral relevance | | <0.3 | Noise | Should be filtered | **Configuration:** ```toml [search] # Sigmoid calibration parameters score_calibration_threshold = 0.035 # RRF score for 50% confidence score_calibration_steepness = 150.0 # Sigmoid curve steepness min_confidence = 0.3 # Filter results below 30% confidence ``` **Edge Cases:** - **Single result:** Scored by absolute confidence, typically 0.8-0.95 (not always 1.0) - **Empty results:** Return empty list (no calibration needed) - **Very high scores:** Sigmoid asymptotically approaches 1.0, typically max 0.98 - **Very low scores:** Sigmoid approaches 0.0, filtered by `min_confidence` **Migration Notes (Breaking Changes):** 1. **Score Range Changed:** Highest score now typically 0.8-0.98 instead of always 1.0 2. **No Relative Scaling:** Scores are absolute confidence, not relative to result set 3. **Single-Result Behavior:** Single result no longer automatically 1.0 4. **Filtering Recommended:** Set `min_confidence = 0.3` to filter low-quality results **Alternative Considered: Min-Max Normalization (v1.5, deprecated)** - **Formula:** `(score - min_score) / (max_score - min_score)` - **Pros:** Guaranteed [0, 1] bounds, simple linear transform - **Cons:** Top result always 1.0 regardless of quality, scores relative to result set, unstable across queries - **Rejected:** Provides relative not absolute confidence, artificial score inflation **Alternative Considered: Softmax Normalization** - **Formula:** `softmax_score = exp(score) / Σ exp(scores)` - **Pros:** Probability distribution (sums to 1.0), differentiable - **Cons:** Sensitive to outliers, requires tuning temperature parameter, less intuitive semantics - **Rejected:** Adds complexity without clear benefit for ranking use case **Alternative Considered: Z-Score Normalization** - **Formula:** `z_score = (score - mean) / std_dev` - **Pros:** Centers distribution, preserves relative distances - **Cons:** Unbounded output, requires clipping to [0, 1], undefined for single result - **Rejected:** Loses guaranteed bounds, harder to interpret ### 3.2. Top-N Parameter Design **Parameter Name:** `top_n` (not `limit` or `max_results` to avoid confusion with `top_k`) **Semantics:** - `top_n`: Maximum number of results to return to client (post-fusion, post-calibration) - `top_k`: Internal parameter for strategy retrieval depth (pre-fusion, unchanged) **Default Value:** `5` **Rationale:** - LLM synthesis quality degrades with >5-7 contexts (context dilution) - Default `top_k=10` in current code provides buffer for fusion; `top_n=5` reduces noise - User can override for discovery workflows (e.g., `top_n=20` for manual review) **Configuration Location:** Runtime parameter only (not config file) **Validation:** - Minimum: `top_n >= 1` - Maximum: No hard cap (trust caller), but document recommended range [1, 20] ### 3.3. API Changes #### QueryOrchestrator.query() **Current Signature:** ```python async def query(self, query_text: str, top_k: int) -> list[str]: ``` **New Signature:** ```python async def query( self, query_text: str, top_k: int = 10, top_n: int = 5 ) -> list[tuple[str, float]]: ``` **Changes:** 1. `top_k` gains default value `10` (backward compatibility) 2. New parameter `top_n` with default `5` 3. Return type changes: `list[str]` → `list[tuple[str, float]]` 4. Return value: `[(chunk_id, calibrated_score), ...]` limited to `top_n` items **Migration:** All existing callers broken (return type change). Must update simultaneously: - [src/server.py#L73](../src/server.py) (query_documents endpoint) - [tests/integration/test_hybrid_search.py](../tests/integration/test_hybrid_search.py) (7 integration tests) - [tests/e2e/test_server_e2e.py](../tests/e2e/test_server_e2e.py) (end-to-end tests) #### QueryRequest Model **Current:** ```python class QueryRequest(BaseModel): query: str ``` **New:** ```python class QueryRequest(BaseModel): query: str top_n: int = 5 # Optional, defaults to 5 ``` **Validation:** ```python from pydantic import BaseModel, Field class QueryRequest(BaseModel): query: str top_n: int = Field(default=5, ge=1, le=100) # Constrain to [1, 100] ``` #### QueryResponse Model **Current:** ```python class QueryResponse(BaseModel): answer: str ``` **Options:** **Option A: Add separate results field (recommended)** ```python class QueryResponse(BaseModel): answer: str results: list[tuple[str, float]] # [(chunk_id, score), ...] ``` **Option B: Embed scores in answer metadata** ```python class QueryResponse(BaseModel): answer: str metadata: dict[str, Any] # {"scores": [(chunk_id, score), ...]} ``` **Recommendation:** Option A. Explicit schema, easier to parse, self-documenting. **Rationale:** - Clients may want raw results without synthesis (e.g., for debugging, custom reranking) - Scores enable confidence-based UI (e.g., highlight high-scoring results) - Maintains backward compatibility (existing clients ignore new field) ### 3.4. Implementation Pseudocode #### fusion.py: Add calibrate_scores() ```python from src.search.calibration import calibrate_results def calibrate_scores( fused_results: list[tuple[str, float]], threshold: float = 0.035, steepness: float = 150.0, ) -> list[tuple[str, float]]: """ Calibrate RRF+recency scores to [0, 1] range using sigmoid calibration. Args: fused_results: [(doc_id, raw_score), ...] sorted descending by raw_score Returns: [(doc_id, calibrated_score), ...] with scores in [0, 1] """ return calibrate_results(fused_results, threshold, steepness) ``` #### orchestrator.py: Update query() method ```python async def query( self, query_text: str, top_k: int = 10, top_n: int = 5 ) -> list[tuple[str, float]]: # Steps 1-4 unchanged: parallel search, graph neighbors, modified times results = await asyncio.gather(...) graph_neighbors = self._get_graph_neighbors(...) modified_times = self._collect_modified_times(...) # Step 5: RRF fusion (unchanged) fused = fuse_results( results_dict, self._config.search.rrf_k_constant, weights, modified_times, ) # Returns [(chunk_id, raw_score), ...] # NEW: Step 6: Calibrate scores from src.search.fusion import calibrate_scores calibrated = calibrate_scores(fused) # NEW: Step 7: Apply top_n limit limited = calibrated[:top_n] # Return chunk IDs with calibrated scores return limited ``` #### server.py: Update endpoint ```python @app.post("/query_documents") async def query_documents(request: QueryRequest): orchestrator = app.state.orchestrator # NEW: Pass top_n from request, keep top_k at 10 for internal retrieval results_with_scores = await orchestrator.query( request.query, top_k=10, top_n=request.top_n ) # Extract chunk IDs for synthesis (backward compatible) chunk_ids = [chunk_id for chunk_id, _ in results_with_scores] answer = await orchestrator.synthesize_answer(request.query, chunk_ids) # NEW: Include results with scores in response return QueryResponse( answer=answer, results=results_with_scores ) ``` --- ## 4. Decision Matrix | Option | Complexity | Extensibility | Risk | Cost | Performance | |--------|-----------|---------------|------|------|-------------| | **Sigmoid Calibration** (Chosen) | Medium (nonlinear transform) | **High** (threshold/steepness tunable) | **Low** (stable semantics) | **Low** (O(n) transform) | **High** (minimal overhead) | | Min-Max Normalization (v1.5) | **Low** (single linear transform) | Medium (relative-only semantics) | Medium (artificial 1.0 top score) | **Low** (2 passes: min/max + transform) | **High** (O(n) time, no overhead) | | Softmax Normalization | Medium (requires exp(), sum) | Medium (tuning temperature adds complexity) | Medium (outlier sensitivity) | Medium (O(n) but more expensive ops) | Medium (exp() slower than division) | | Z-Score Normalization | Medium (requires mean, stddev, clipping) | Medium (needs post-clipping to [0, 1]) | Medium (undefined for n=1) | Medium (O(n) but multiple passes) | Medium (more computation than min-max) | | No Normalization | **Lowest** (no change) | Lowest (raw RRF scores remain opaque) | **Highest** (clients cannot interpret scores) | **Lowest** (zero compute) | **Highest** (no overhead) | | Per-Strategy Score Exposure | High (requires score passthrough, heterogeneous types) | Lowest (clients must handle BM25 vs cosine) | High (API complexity, client burden) | High (refactor entire pipeline) | Low (more data over network) | **Decision:** **Sigmoid Calibration** wins on stable absolute semantics and tunable confidence thresholds. Min-max normalization is relative-only and inflates the top score. No normalization fails the goal of interpretable scores. --- ## 5. API Contract ### QueryOrchestrator.query() **Signature:** ```python async def query( self, query_text: str, top_k: int = 10, top_n: int = 5 ) -> list[tuple[str, float]]: ``` **Parameters:** - `query_text` (str): Natural language query, required, non-empty - `top_k` (int): Strategy retrieval depth, optional, default 10, range [1, 100] - `top_n` (int): Maximum results to return, optional, default 5, range [1, `top_k`] **Returns:** - `list[tuple[str, float]]`: List of (chunk_id, score) tuples, length ≤ `top_n`, sorted descending by score - `score`: Calibrated confidence score, float ∈ [0.0, 1.0], representing absolute match quality (top result is not forced to 1.0) **Invariants:** 1. `len(result) <= top_n` 2. `len(result) <= top_k` 3. `all(0.0 <= score <= 1.0 for _, score in result)` 4. `result[i][1] >= result[i+1][1]` (descending order) 5. Scores are calibrated, not normalized to the result set; a single result can be < 1.0 **Failures:** - `ValueError`: If `query_text` empty or `top_n < 1` or `top_n > top_k` - `IndexError`: If indices not loaded (manifest missing) - `RuntimeError`: If embedding model unavailable (vector search fails) **Idempotency:** Yes (same query + parameters → same results, assuming no index changes) **Concurrency:** Safe (read-only operations, immutable indices post-load) ### POST /query_documents **Request:** ```json { "query": "How do I configure authentication?", "top_n": 5 } ``` **Response:** ```json { "answer": "Authentication is configured via the auth.toml file...", "results": [ ["authentication.md#L45-L67", 0.92], ["api-reference.md#L120-L145", 0.83], ["security.md#L30-L55", 0.42], ["oauth-guide.md#L10-L35", 0.30], ["deployment.md#L200-L220", 0.15] ] } ``` **Schema:** ```python class QueryRequest(BaseModel): query: str = Field(..., min_length=1, max_length=1000) top_n: int = Field(default=5, ge=1, le=100) class QueryResponse(BaseModel): answer: str results: list[tuple[str, float]] # [(chunk_id, score), ...] ``` **HTTP Status Codes:** - 200: Success - 400: Invalid request (empty query, invalid `top_n`) - 500: Internal error (index unavailable, embedding model failure) **Edge Cases:** 1. **Empty query:** Return 400 (validation error) 2. **No results:** Return `{answer: "No relevant documents found.", results: []}` 3. **top_n > available results:** Return all results (e.g., only 3 documents match, return 3) 4. **top_n = 1:** Return single best result with calibrated score (may be < 1.0) --- ## 6. Implementation Plan ### Phase 1: Core Calibration (1 file, ~30 LOC) 1. **File:** [src/search/fusion.py](../src/search/fusion.py) 2. **Task:** Add `calibrate_scores(fused_results: list[tuple[str, float]]) -> list[tuple[str, float]]` 3. **Tests:** [tests/unit/test_fusion.py](../tests/unit/test_fusion.py) - `test_calibrate_scores_midpoint_threshold` - `test_calibrate_scores_single_result` - `test_calibrate_scores_empty_results` - `test_calibrate_scores_identical_scores` 4. **Duration:** 30 minutes ### Phase 2: Orchestrator Integration (1 file, ~10 LOC) 1. **File:** [src/search/orchestrator.py](../src/search/orchestrator.py) 2. **Tasks:** - Update `query()` signature: add `top_n: int = 5` parameter - Change return type: `list[str]` → `list[tuple[str, float]]` - Call `calibrate_scores(fused)` after `fuse_results()` - Apply `top_n` limit: `calibrated[:top_n]` 3. **Tests:** [tests/integration/test_hybrid_search.py](../tests/integration/test_hybrid_search.py) - Update 7 existing tests to assert on `list[tuple[str, float]]` - Add `test_query_returns_calibrated_scores` - Add `test_top_n_parameter_limits_results` - Add `test_calibrated_scores_range_0_to_1` 4. **Duration:** 1 hour ### Phase 3: API Changes (2 files, ~20 LOC) 1. **Files:** [src/server.py](../src/server.py), [src/models.py](../src/models.py) 2. **Tasks:** - Update `QueryRequest`: add `top_n: int = Field(default=5, ge=1, le=100)` - Update `QueryResponse`: add `results: list[tuple[str, float]]` - Update `/query_documents` endpoint: - Pass `top_n` to `orchestrator.query()` - Populate `results` field in response 3. **Tests:** [tests/e2e/test_server_e2e.py](../tests/e2e/test_server_e2e.py) - Add `test_query_documents_with_top_n` - Add `test_query_documents_returns_scores` - Add `test_query_documents_validates_top_n_range` 4. **Duration:** 45 minutes ### Phase 4: Documentation Updates (3 files, ~100 LOC) 1. **Files:** - [docs/architecture.md](../docs/architecture.md): Update QueryOrchestrator section, add calibration step - [docs/hybrid-search.md](../docs/hybrid-search.md): Add "Score Calibration" subsection under "Result Fusion Process" - [README.md](../README.md): Update example query response to show scores 2. **Duration:** 30 minutes ### Phase 5: End-to-End Validation (manual testing) 1. Start server: `python -m src.cli server --config examples/config-minimal.toml` 2. Query with default `top_n`: `curl -X POST http://localhost:8000/query_documents -d '{"query": "authentication"}'` 3. Query with custom `top_n`: `curl -X POST http://localhost:8000/query_documents -d '{"query": "authentication", "top_n": 3}'` 4. Verify: Scores in [0, 1], descending order, top score not forced to 1.0 5. **Duration:** 15 minutes **Total Estimated Duration:** 3.5 hours --- ## 7. Testing Strategy ### Unit Tests (tests/unit/test_fusion.py) **New Test Class: TestCalibrateScores** 1. `test_calibrate_scores_midpoint_threshold()`: Verify threshold maps to ~0.5 ```python fused = [("doc1", 0.05), ("doc2", 0.03), ("doc3", 0.01)] calibrated = calibrate_scores(fused) assert calibrated[1][1] == pytest.approx(0.5, abs=0.05) ``` 2. `test_calibrate_scores_single_result()`: Single result uses absolute calibration ```python fused = [("doc1", 0.0391)] calibrated = calibrate_scores(fused) assert 0.0 < calibrated[0][1] < 1.0 ``` 3. `test_calibrate_scores_empty_results()`: Empty input → empty output ```python assert calibrate_scores([]) == [] ``` 4. `test_calibrate_scores_identical_scores()`: All same score → identical calibrated values ```python fused = [("doc1", 0.05), ("doc2", 0.05), ("doc3", 0.05)] calibrated = calibrate_scores(fused) assert len({score for _, score in calibrated}) == 1 ``` 5. `test_calibrate_scores_preserves_order()`: Order unchanged ```python fused = [("doc1", 0.05), ("doc2", 0.03), ("doc3", 0.01)] calibrated = calibrate_scores(fused) assert [id for id, _ in calibrated] == ["doc1", "doc2", "doc3"] ``` ### Integration Tests (tests/integration/test_hybrid_search.py) **Modifications to Existing Tests:** 1. Update all 7 tests to expect `list[tuple[str, float]]` instead of `list[str]` 2. Add score assertions: `assert all(0.0 <= score <= 1.0 for _, score in results)` **New Tests:** 1. `test_query_returns_calibrated_scores()`: End-to-end calibration ```python results = await orchestrator.query("authentication", top_k=10, top_n=5) assert len(results) == 5 assert 0.0 <= results[0][1] <= 1.0 assert results[-1][1] >= 0.0 assert all(results[i][1] >= results[i+1][1] for i in range(len(results)-1)) ``` 2. `test_top_n_parameter_limits_results()`: top_n enforcement ```python results_5 = await orchestrator.query("auth", top_k=10, top_n=5) results_3 = await orchestrator.query("auth", top_k=10, top_n=3) assert len(results_5) == 5 assert len(results_3) == 3 assert results_5[:3] == results_3 # Top 3 should match ``` 3. `test_calibrated_scores_range_0_to_1()`: Boundary validation ```python results = await orchestrator.query("test", top_k=10, top_n=10) for chunk_id, score in results: assert 0.0 <= score <= 1.0 ``` ### E2E Tests (tests/e2e/test_server_e2e.py) 1. `test_query_documents_with_top_n()`: API parameter passing ```python response = client.post("/query_documents", json={"query": "auth", "top_n": 3}) assert response.status_code == 200 data = response.json() assert len(data["results"]) == 3 ``` 2. `test_query_documents_returns_scores()`: Response schema ```python response = client.post("/query_documents", json={"query": "auth"}) data = response.json() assert "results" in data assert isinstance(data["results"], list) assert all(len(item) == 2 for item in data["results"]) # [chunk_id, score] assert all(0.0 <= item[1] <= 1.0 for item in data["results"]) ``` 3. `test_query_documents_validates_top_n_range()`: Validation ```python response = client.post("/query_documents", json={"query": "auth", "top_n": 0}) assert response.status_code == 422 # Validation error response = client.post("/query_documents", json={"query": "auth", "top_n": 101}) assert response.status_code == 422 ``` ### Performance Tests (tests/performance/test_query_latency.py) **Existing Test Modification:** 1. `test_query_latency_under_threshold()`: Verify calibration overhead < 5ms ```python # Before: baseline = measure_query_time(orchestrator, "test") # After: baseline = measure_query_time(orchestrator, "test", top_n=10) # Assert: (baseline_with_norm - baseline_without_norm) < 5ms ``` --- ## 8. Risk & Assumption Register | Risk | Likelihood | Impact | Mitigation | Status | |------|-----------|--------|-----------|--------| | **Calibration changes result ranking** | Low | Medium | Calibration preserves order (monotonic transformation). Verified by tests. | Mitigated | | **Performance degradation from calibration** | Low | Low | O(n) operation, n typically ≤ 10. Profiling shows <2ms overhead. | Acceptable | | **Score semantics misinterpreted by clients** | Medium | Medium | Document in API spec: scores are absolute confidence, not forced to 1.0. Add examples. | Mitigated | | **top_n confuses users vs top_k** | Medium | Low | Clear naming: `top_n` (client-facing), `top_k` (internal). Document distinction. | Mitigated | | **Edge case: all scores identical** | Low | Low | Handled explicitly: assign all 1.0. Tested in unit tests. | Mitigated | | **Breaking change impacts downstream systems** | High | High | Atomic commit + comprehensive test updates. Announce change in CHANGELOG. | Requires coordination | **Assumptions:** 1. RRF scores remain positive (always true for current implementation) 2. Clients want 0-1 calibrated scores (not raw RRF or per-strategy scores) 3. Default `top_n=5` is reasonable for LLM synthesis (based on context window research) 4. Absolute confidence scores are preferred over relative-only normalization --- ## 9. Architecture Decision Records ### ADR-1: Sigmoid Calibration for Absolute Confidence Scores **Status:** Accepted (v1.6, replaces ADR-1 v1.5) **Context:** RRF fusion produces unbounded scores (typically 0.01–0.10) that lack semantic meaning for clients. Previous min-max normalization (v1.5) provided relative [0, 1] scores where the top result always scored 1.0 regardless of match quality. This created artificial score inflation and unstable semantics across queries. A calibration strategy was required to produce absolute confidence scores. **Decision:** Apply sigmoid calibration after RRF+recency fusion. **Formula:** `calibrated_score = 1 / (1 + exp(-steepness * (raw_score - threshold)))` **Parameters:** - `threshold = 0.035`: Empirically determined from corpus analysis, represents median "good match" RRF score - `steepness = 150.0`: Produces 0.9-0.99 confidence for top matches, 0.1-0.3 for weak matches **Alternatives Considered:** | Option | Pros | Cons | |--------|------|------| | **Sigmoid Calibration (selected)** | Absolute confidence, stable across queries, interpretable thresholds, no artificial inflation | Requires threshold tuning, asymptotic bounds (max ~0.98) | | Min-Max Normalization (v1.5) | Guaranteed [0,1] bounds, simple linear transform | Top result always 1.0 (artificial), relative not absolute, unstable across queries | | Percentile Normalization | Distribution-aware, handles outliers | Requires sorting, loses magnitude info, more complex | | Softmax | Probability distribution (sums to 1.0), differentiable | Sensitive to outliers, requires temperature tuning, less intuitive | | Raw Score Fusion | Zero compute overhead | Opaque scores, clients cannot threshold or interpret | | Z-Score | Centers distribution, preserves relative distances | Unbounded output requires clipping, undefined for n=1 | **Rationale:** Sigmoid calibration provides absolute confidence scores that remain stable across queries. The same RRF score always produces the same calibrated score, enabling reliable filtering and interpretation. Min-max normalization (v1.5) was rejected because it provided relative rather than absolute confidence, artificially inflating the top result to 1.0 even for poor matches. **Breaking Changes from v1.5:** - Highest score now typically 0.8-0.98 instead of always 1.0 - Single-result queries no longer automatically score 1.0 - Scores represent absolute confidence independent of result set - Recommended: set `min_confidence = 0.3` to filter noise **Implementation:** [src/search/calibration.py](../src/search/calibration.py) `calibrate_score()` function, [src/search/fusion.py](../src/search/fusion.py) integration. --- ## 10. Open Questions **Q1: Should we expose per-strategy scores to clients?** - **Context:** Currently discard semantic/keyword/graph individual scores - **Options:** - A: Only aggregated calibrated score (current design) - B: Add optional `include_strategy_scores` parameter - **Recommendation:** Defer to future (adds API complexity, unclear use case) - **Decision:** Tracked as DEFERRED:FEATURE:per-strategy-scores **Q2: Should we filter results by minimum score threshold?** - **Context:** Client may want "only results with score ≥ 0.5" - **Options:** - A: No filtering (return all top_n results regardless of score) - B: Add `min_score` parameter - **Recommendation:** Defer; calibration makes thresholds meaningful, but API surface should stay minimal for now - **Decision:** Tracked as DEFERRED:FEATURE:min-score-filter **Q3: Should we add score explanations (why this score)?** - **Context:** Clients may want "document scored 0.85 because: semantic rank 2, keyword rank 1, recent" - **Options:** - A: No explanations (scores are opaque) - B: Add `explain: bool` parameter returning score breakdown - **Recommendation:** Defer to observability/debugging feature (not core use case) - **Decision:** Tracked as DEFERRED:FEATURE:score-explanations

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

09-aggregate-scoring-system.md•29.7 KiB