Markdown RAG Documentation

context-efficiency.md•11.1 KiB

# Context Efficiency Context efficiency improvements reduce LLM token consumption by 40-60% through compression defaults, compact response formatting, and query-aware truncation. ## Overview **Problem:** Default RAG systems return verbose responses consuming 2000-4000+ tokens of context window. **Solution:** Three-stage efficiency pipeline: 1. **Compression defaults**: Score threshold + deduplication + per-document limits filter redundant results 2. **Compact format**: Reduced metadata overhead (60% token savings vs verbose format) 3. **Query-aware truncation**: Factual queries receive truncated content, conceptual queries receive full context **Result:** 40-60% token reduction with zero accuracy loss. --- ## Compression Defaults Compression enabled by default. Previous defaults returned all results including low-confidence and redundant chunks. ### Configuration Changes **Previous (v1.4 and earlier):** ```toml [search] min_confidence = 0.0 # No filtering max_chunks_per_doc = 0 # No limit dedup_enabled = false # No deduplication ``` **Current (v1.5+):** ```toml [search] min_confidence = 0.3 # Filter low-confidence results max_chunks_per_doc = 2 # Max 2 chunks per document dedup_enabled = true # Remove semantic duplicates dedup_threshold = 0.85 # Cosine similarity threshold ``` ### Impact Measurements **Test corpus:** 147 documents, 834 chunks | Query | Original | After Compression | Reduction | |-------|----------|-------------------|-----------| | "configure authentication" | 47 results | 8 results | 83% | | "API reference" | 63 results | 12 results | 81% | | "deployment guide" | 28 results | 7 results | 75% | **Average reduction:** 79% fewer results, 51% token savings after accounting for retained high-quality results. ### Why These Defaults **`min_confidence = 0.3`:** - Removes results with normalized score <0.3 (low semantic relevance) - Empirically determined threshold where precision degrades - Avoids "long tail" of marginally related results **`max_chunks_per_doc = 2`:** - Prevents single document from dominating results - Encourages diversity across documentation corpus - User can navigate to file for complete context if needed **`dedup_enabled = true`:** - Removes paraphrased or semantically identical chunks - Common with templated documentation, repeated configuration examples - Threshold 0.85 preserves distinct chunks while merging duplicates ### Disabling Compression To restore v1.4 behavior: ```python # MCP tool call { "query": "...", "min_score": 0.0, "similarity_threshold": 1.0 } ``` Or configure globally: ```toml [search] min_confidence = 0.0 max_chunks_per_doc = 0 dedup_enabled = false ``` --- ## Compact Response Format Stdio transport uses compact format reducing overhead by 60% compared to verbose JSON-style formatting. ### Format Comparison **Verbose Format (v1.4):** ``` **Result 1** (Score: 0.9234) File: docs/authentication.md Section: Configuration > Authentication Authentication is configured in the auth section of config.toml. Use the `auth_provider` field to specify the authentication backend. Supported providers include OAuth2, SAML, and LDAP. ``` **Compact Format (v1.5):** ``` [1] docs/authentication.md § Configuration > Authentication (0.92) Authentication is configured in the auth section of config.toml. Use the `auth_provider` field to specify the authentication backend. Supported providers include OAuth2, SAML, and LDAP. ``` ### Overhead Analysis **Single result:** | Component | Verbose | Compact | Savings | |-----------|---------|---------|---------| | Metadata headers | 94 chars | 67 chars | 29% | | Score precision | 4 digits | 2 digits | 50% | | Section path separator | ` > ` (3 chars) | ` § ` (1 char) | 67% | | **Total overhead** | 546 chars (5 results) | 220 chars (5 results) | **60%** | **Metadata only** (no content): 28.9% overhead (verbose) → 11.6% overhead (compact) ### Design Rationale **Section symbol (§):** - Single character vs `>` multi-char separator - Visually distinct from file path separator `/` - Semantic meaning: "section" or "paragraph" **Inline score:** - Score at end of header line vs separate line - 2 decimal precision sufficient for ranking information - Reduces vertical space consumption **Bracket-indexed:** - `[N]` prefix more compact than `Result N` - Clear visual delimiter for multi-result responses --- ## Query-Aware Truncation Factual queries receive truncated content (200 characters). Conceptual queries receive full content. ### Query Classification **Factual signals** (→ truncate): - camelCase: `getUserById`, `parseJSON` - snake_case: `get_user_by_id`, `parse_json` - Backticks: `` `configure` ``, `` `--flag` `` - Version numbers: `v1.5`, `2.3.1` - Commands: "install", "configure", "enable", "syntax" **Conceptual signals** (→ full content): - Question words: "why", "how", "explain", "what" - Guides: "getting started", "overview", "background" - Comparison: "difference", "compare", "versus" - Question mark: `?` **Priority:** Conceptual > Factual (first match wins). **Default:** Factual (if no patterns match). ### Truncation Examples **Query:** `getUserById function` (factual) **Full content (487 chars):** ``` getUserById(id: string): User | null Retrieves a user by their unique identifier. Returns null if the user does not exist. This function performs a database lookup and includes related profile data in the response. The function is cached for 60 seconds to reduce database load. Use getUserByIdUncached() for real-time data. Example: const user = getUserById("user-123"); if (user) { console.log(user.name); } Throws: DatabaseError if connection fails ``` **Truncated (200 chars):** ``` getUserById(id: string): User | null Retrieves a user by their unique identifier. Returns null if the user does not exist. This function performs a database lookup and includes related profil... ``` **Token savings:** 287 characters = ~72 tokens (at 4 chars/token average). **Accuracy impact:** Zero. Factual queries target specific syntax or commands. Truncated content includes function signature and core description. --- **Query:** `why use JWT authentication?` (conceptual) **Returns full content** (no truncation). Conceptual queries require complete context for explanation, rationale, or architectural understanding. --- ### Truncation Algorithm ```python def truncate_content(content: str | None, max_chars: int = 200): if not content or len(content) <= max_chars: return content truncated = content[:max_chars].rstrip() return f"{truncated}..." ``` **Word-boundary truncation not implemented:** Simple character cutoff sufficient for current use case. Revisit if user feedback indicates mid-word truncation causes confusion. --- ## Cumulative Impact **Combined token savings** (all three improvements): Test query: `"how to configure authentication"` | Stage | Tokens | Reduction | |-------|--------|-----------| | Baseline (v1.4, 47 results × 400 chars avg) | 18,800 chars = ~4,700 tokens | - | | + Compression (8 results) | 3,200 chars = ~800 tokens | 83% | | + Compact format (60% overhead reduction) | 2,240 chars = ~560 tokens | 30% | | + Truncation (factual) | 1,760 chars = ~440 tokens | 21% | | **Total** | **440 tokens** | **91%** | **Note:** Percentage reductions are cumulative, not additive. **Accuracy validation:** Manual review of 50 queries showed zero cases where truncation removed critical information. Users navigate to files for complete context when needed. --- ## Migration Notes ### Breaking Changes (v1.4 → v1.5) **Response format change:** ```diff - **Result 1** (Score: 0.9234) - File: docs/auth.md - Section: Configuration + [1] docs/auth.md § Configuration (0.92) ``` **Parsing impact:** Clients parsing `Result N` headers or `Score:` fields must update to bracket-indexed format. **Mitigation:** HTTP endpoint retains JSON format (unchanged). Only MCP stdio transport uses compact format. ### Gradual Rollout **Phase 1 (Current):** - Compression defaults enabled for all users - Compact format enabled for MCP stdio transport - HTTP API unchanged (backward compatible) **Phase 2 (Proposed, deferred):** - Optional `--format` flag: `compact`, `verbose`, `json` - Allow MCP clients to opt into verbose format if desired --- ## Configuration Reference ### Query-Time Overrides ```python # MCP tool call { "query": "authentication setup", "top_n": 5, "min_score": 0.5, # Override threshold "similarity_threshold": 0.90, # Stricter dedup "show_stats": true # Show compression metrics } ``` ### Global Configuration ```toml [search] # Compression settings min_confidence = 0.3 # Score threshold (0.0-1.0) max_chunks_per_doc = 2 # Per-document limit (0 = no limit) dedup_enabled = true # Semantic deduplication dedup_threshold = 0.85 # Cosine similarity (0.5-1.0) # N-gram deduplication (Stage 2, before semantic) ngram_dedup_enabled = true # Jaccard similarity for near-duplicates ngram_dedup_threshold = 0.7 # Stricter than semantic # Query classification (for truncation) # No configuration - heuristic-based ``` --- ## Performance Impact **Latency:** - Compression filtering: +5ms (negligible) - Deduplication: +15ms (cosine similarity computation) - Truncation: <1ms (string slicing) - **Total:** +20ms average query time **Acceptable trade-off** for 51% token reduction. **Memory:** - Compact format: no change (formatting during serialization) - Deduplication: O(n²) embedding comparisons, mitigated by early exit and score threshold reducing candidate set --- ## Metrics **Compression stats** available via `show_stats=true`: ``` Compression Stats: - Original results: 47 - After score filter (≥0.3): 23 - After deduplication: 12 - After document limit (2 per doc): 8 - Clusters merged: 11 ``` **Interpretation:** - **Original → After threshold:** How many low-confidence results removed - **After threshold → After dedup:** How many semantic duplicates merged - **After dedup → After doc limit:** How many additional chunks dropped per document - **Clusters merged:** Semantic deduplication cluster count **Tracking:** No persistent metrics. Enable `show_stats` on per-query basis for debugging. --- ## Future Enhancements **Deferred items:** 1. **Adaptive truncation:** Increase max_chars for technical reference queries (e.g., API schemas) 2. **User-configurable truncation:** Allow per-project truncation thresholds 3. **Smart word-boundary truncation:** Avoid cutting mid-sentence 4. **Context budget allocation:** Dynamically adjust per-result content based on total top_n 5. **Format versioning:** Protocol versioning for MCP transport if multiple format variants supported **Not planned:** - Abstractive summarization (hallucination risk, latency) - Client-side truncation (increases bandwidth) - Progressive loading (complicates MCP protocol) --- ## References - [Configuration Reference](configuration.md) - All compression settings - [Hybrid Search](hybrid-search.md) - Deduplication pipeline details - [Spec: Search Quality Improvements](../specs/11-search-quality-improvements.md) - Deduplication algorithm

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

context-efficiency.md•11.1 KiB