Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

ragdocs-mcp
specs

12-context-compression.md•17.2 KiB

# 12. Context Compression ## Executive Summary **Purpose:** Design a second query tool (`query_documents_compressed`) that reduces context size to avoid LLM context window poisoning when retrieving multiple document chunks. **Scope:** Evaluate 6 compression strategies; propose API surface for new tool; provide decision matrix for implementation prioritization. **Decision Required:** ~~Select compression strategy (or combination) balancing token reduction, quality preservation, and implementation complexity.~~ **IMPLEMENTED:** Score thresholding + semantic deduplication. LLM-based summarization rejected. --- ## 1. Current State Analysis ### 1.1 Current `query_documents` API Surface **Location:** [src/mcp_server.py](../src/mcp_server.py#L44-L54) ```python Tool( name="query_documents", inputSchema={ "properties": { "query": {"type": "string"}, "top_n": {"type": "integer", "default": 5, "minimum": 1, "maximum": 100}, }, "required": ["query"], }, ) ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `query` | string | (required) | Natural language query | | `top_n` | integer | 5 | Maximum results to return | ### 1.2 Response Structure **ChunkResult Model:** [src/models.py](../src/models.py#L17-L30) ```python @dataclass class ChunkResult: chunk_id: str # Unique identifier doc_id: str # Parent document ID score: float # Normalized relevance [0.0, 1.0] header_path: str # Section hierarchy file_path: str # Source file content: str # Full chunk text (200-1500 chars) ``` **MCP Response Format:** [src/mcp_server.py](../src/mcp_server.py#L70-L81) ``` # Answer {synthesized_answer} # Source Documents **Result 1** (Score: 0.9500) File: architecture.md Section: Components > IndexManager {full_chunk_content} ``` ### 1.3 Response Size Analysis | Scenario | Chunks | Chars/Chunk | Total Content | Est. Tokens | |----------|--------|-------------|---------------|-------------| | Default (top_n=5) | 5 | 750 (avg) | 3,750 | ~940 | | Default (top_n=5) | 5 | 1,500 (max) | 7,500 | ~1,875 | | Extended (top_n=10) | 10 | 750 (avg) | 7,500 | ~1,875 | | Extended (top_n=10) | 10 | 1,500 (max) | 15,000 | ~3,750 | | Discovery (top_n=25) | 25 | 750 (avg) | 18,750 | ~4,688 | **Note:** Token estimates use ~4 chars/token approximation. Actual varies by content. ### 1.4 Problem Statement **Context Bloat Scenarios:** 1. User requests many results (`top_n=20+`) 2. Chunks at maximum size (1500 chars each) 3. Redundant content across similar chunks 4. Low-relevance chunks included due to `top_n` requirement **Consequences:** - Consumes LLM context window budget - Reduces space for conversation history - May include irrelevant content that confuses LLM - Higher API costs for token-based billing --- ## 1.5 Architecture Comparison ```mermaid flowchart TD subgraph Current["Current: query_documents"] Q1[Query] --> S1[Hybrid Search] S1 --> F1[RRF Fusion] F1 --> R1["Top-N Chunks (full content)"] R1 --> SYN1[Synthesize Answer] end subgraph Proposed["Proposed: query_documents_compressed"] Q2[Query] --> S2[Hybrid Search] S2 --> F2[RRF Fusion] F2 --> R2["Top-N Chunks (full content)"] R2 --> COMP{Compression Strategy} COMP -->|budget| BUD["Token Budget Constraint"] COMP -->|extractive| EXT["Sentence Extraction"] COMP -->|combined| COMB["Budget + Extractive"] BUD --> OUT[Compressed Results] EXT --> OUT COMB --> OUT OUT --> SYN2[Synthesize Answer] end ``` **Legend:** - Rectangles: Processing steps - Diamond: Decision point (strategy selection) - Arrows: Data flow --- ## 2. Proposed Compression Strategies ### 2.1 Token Budget Constraint ❌ DEFERRED **Status:** Not implemented. Clipping approach less desirable than compression. **Description:** Stop adding chunks when cumulative token count exceeds a configurable budget. **Technical Approach:** 1. Accept `max_tokens` parameter (default: 2000) 2. Iterate through ranked chunks 3. Estimate token count per chunk (chars / 4) 4. Stop when next chunk would exceed budget 5. Return partial results with `truncated: true` flag **Pros:** - Guaranteed budget compliance - No content modification (full chunks preserved) - Zero latency overhead - Works with existing chunk storage **Cons:** - May return fewer results than `top_n` - Abrupt cutoff may exclude relevant content - No optimization within budget (all-or-nothing per chunk) **Dependencies:** None (uses existing infrastructure) **Implementation Complexity:** Low ### 2.2 Extractive Summarization ❌ DEFERRED **Status:** Not implemented. Sentence boundary detection adds complexity without clear benefit over semantic dedup. **Description:** Select key sentences from each chunk using query-aware relevance scoring. **Technical Approach:** 1. Tokenize chunk into sentences (spaCy or regex) 2. Score each sentence: - Embedding similarity to query - TF-IDF overlap with query terms - Position bias (first sentences often summarize) 3. Select top-k sentences per chunk (k = budget / chunk_count) 4. Preserve sentence order in output **Pros:** - Preserves exact source text (no hallucination) - Query-aware selection (keeps most relevant content) - Configurable compression ratio - No LLM dependency **Cons:** - May fragment meaning if key sentences depend on context - Sentence boundary detection imperfect for technical content - Requires sentence scoring infrastructure **Dependencies:** - Sentence tokenizer (spaCy, NLTK, or regex) - Embedding model for sentence scoring (already available) **Implementation Complexity:** Medium ### 2.3 Abstractive Summarization (LLM-Based) ❌ REJECTED **Status:** Rejected. See ADR-1 below. **Description:** Use LLM to rewrite/condense each chunk into a shorter summary. **Technical Approach:** 1. For each chunk, call LLM with prompt: ``` Summarize the following content in {target_length} words, preserving key facts relevant to: {query} {chunk_content} ``` 2. Cache summaries keyed by (chunk_id, query_hash, target_length) 3. Return compressed chunks with `summary: true` flag **Pros:** - High-quality, coherent summaries - Significant token reduction (50-80%) - Maintains semantic completeness - Can highlight query-specific aspects **Cons:** - Adds LLM inference latency (100-500ms per chunk) - Requires LLM availability (may fail) - Risk of hallucination or information loss - Additional API costs - Caching complexity for query-specific summaries **Dependencies:** - LLM endpoint (already configured for synthesis) - Summary cache (new infrastructure) **Implementation Complexity:** Medium (code), High (operational) ### 2.4 Semantic Deduplication ✅ IMPLEMENTED **Status:** Implemented in [src/compression/deduplication.py](../src/compression/deduplication.py). **Description:** Cluster similar chunks and return one representative per cluster. **Technical Approach:** 1. Compute pairwise cosine similarity between chunk embeddings 2. Cluster chunks using: - Threshold-based grouping (sim > 0.85 = same cluster) - Or k-means with k = min(top_n, unique_clusters) 3. Select highest-scoring chunk from each cluster 4. Return deduplicated set **Pros:** - Removes redundant content - Maintains diversity in results - Uses existing embeddings (no new computation) - Improves information density **Cons:** - Clustering adds computational overhead - May lose nuanced differences between similar chunks - Threshold selection affects results significantly - Less effective if content already diverse **Dependencies:** - Access to chunk embeddings (available in VectorIndex) - Clustering algorithm (scikit-learn or numpy) **Implementation Complexity:** Medium ### 2.5 Aggressive Score Thresholding ✅ IMPLEMENTED **Status:** Implemented in [src/compression/thresholding.py](../src/compression/thresholding.py). **Description:** Only return chunks above a minimum relevance score threshold. **Technical Approach:** 1. Accept `min_score` parameter (default: 0.5) 2. Filter results: `[r for r in results if r.score >= min_score]` 3. Return filtered set (may be empty) **Pros:** - Extremely simple implementation - Removes low-quality results - No content modification - Zero latency overhead **Cons:** - Score semantics are relative (best result always 1.0) - May return empty results for sparse queries - Threshold selection arbitrary without absolute scoring - Does not address size of remaining chunks **Dependencies:** None **Implementation Complexity:** Very Low ### 2.6 Hierarchical Compression ❌ DEFERRED **Status:** Not implemented. Two-tool interaction adds complexity. **Description:** Return short summaries with drill-down capability for full content. **Technical Approach:** 1. Generate one-line summary per chunk: - First sentence, or - Header path + word count 2. Return compressed list with `chunk_id` references 3. Provide `expand_chunk(chunk_id)` tool for full content 4. Client decides which chunks to expand **Response Example:** ```json { "answer": "...", "summaries": [ {"chunk_id": "auth_chunk_0", "preview": "Authentication via JWT tokens", "score": 0.95}, {"chunk_id": "config_chunk_2", "preview": "Configuration options for OAuth", "score": 0.82} ] } ``` **Pros:** - Minimal initial token usage (70-90% reduction) - User controls detail level - Good for browsing/exploration - Preserves full content access **Cons:** - Requires follow-up queries for full content - More complex API surface (two tools) - Client must implement expansion logic - Not suitable for single-turn interactions **Dependencies:** - New MCP tool (`expand_chunk`) - Client-side expansion logic **Implementation Complexity:** Medium --- ## 3. Decision Matrix | Option | Complexity | Quality Preservation | Token Reduction | Latency Impact | LLM Dependency | Best For | |--------|------------|---------------------|-----------------|----------------|----------------|----------| | **Token Budget Constraint** | Low | High (full chunks) | Guaranteed | None | No | Strict token limits | | **Extractive Summarization** | Medium | Medium-High | 30-60% | Low (+10ms) | No | Query-focused compression | | **Abstractive Summarization** | Medium+ | Medium (may hallucinate) | 50-80% | High (+200ms/chunk) | Yes | High compression needs | | **Semantic Deduplication** | Medium | High (removes redundancy) | 20-50% | Low (+20ms) | No | Redundant result sets | | **Score Thresholding** | Very Low | High (no modification) | Variable | None | No | Filtering low-quality results | | **Hierarchical Compression** | Medium | High (on-demand) | 70-90% initial | None | No | Exploratory browsing | --- ## 4. Decision (2025-12-23) ### 4.1 Selected Approach: Score Thresholding + Semantic Deduplication **Rejected:** - Token Budget Constraint — Clipping is less desirable than compression - Extractive Summarization — Not selected for initial implementation - Abstractive Summarization — LLM dependency unacceptable - Hierarchical Compression — Two-tool interaction adds complexity **Selected:** 1. **Score Thresholding** — Filter results below configurable relevance threshold 2. **Semantic Deduplication** — Cluster similar chunks across ALL results, return one representative per cluster **Rationale:** - Compression over clipping (dedup removes redundancy without truncating) - No LLM dependency (uses existing embedding model for similarity) - Improves information density (removes near-duplicate content) - Score thresholding removes noise before deduplication ### 4.2 Tool Definition ```python Tool( name="query_documents_compressed", description="Search documents with context compression. Filters low-relevance results and removes semantic duplicates.", inputSchema={ "type": "object", "properties": { "query": { "type": "string", "description": "Natural language query", }, "top_n": { "type": "integer", "description": "Maximum results to return (default: 5)", "default": 5, "minimum": 1, "maximum": 100, }, "min_score": { "type": "number", "description": "Minimum relevance score threshold (default: 0.3)", "default": 0.3, "minimum": 0.0, "maximum": 1.0, }, "similarity_threshold": { "type": "number", "description": "Cosine similarity threshold for deduplication (default: 0.85)", "default": 0.85, "minimum": 0.5, "maximum": 1.0, }, }, "required": ["query"], }, ) ``` ### 4.3 Response Structure Response format matches `query_documents` (same `ChunkResult` structure). Additional metadata in response: ```python { "answer": str, # Synthesized answer "results": list[ChunkResult], # Deduplicated results "compression_stats": { "original_count": int, # Results before filtering "after_threshold": int, # Results after score filter "after_dedup": int, # Final results after dedup "clusters_merged": int, # Number of duplicate clusters removed } } ``` ### 4.4 Implementation Order | Phase | Scope | LOC Est. | Duration | |-------|-------|----------|----------| | 1 | Score thresholding in orchestrator | ~30 | 30 min | | 2 | Semantic deduplication module | ~120 | 2 hours | | 3 | New MCP tool registration | ~60 | 1 hour | | 4 | Integration + compression stats | ~40 | 30 min | | 5 | Tests + documentation | ~200 | 3 hours | **Total:** ~450 LOC, ~7 hours --- ## 5. Open Questions (Resolved) 1. ~~**Default `max_tokens` value**~~ — **RESOLVED:** Token budget approach rejected in favor of deduplication. 2. ~~**Sentence boundary detection**~~ — **DEFERRED:** Not needed for selected approach (dedup uses embeddings, not sentences). 3. **Compression indicator in response:** **RESOLVED:** Include `compression_stats` object showing original_count, after_threshold, after_dedup, clusters_merged. 4. ~~**Score threshold integration**~~ — **RESOLVED:** Score thresholding applied before deduplication as first filter. 5. **Similarity threshold default:** 0.85 selected as reasonable balance. Higher (0.9+) may miss semantic duplicates; lower (0.7) may merge distinct content. --- ## 6. Architecture Decision Records ### ADR-1: Rejection of LLM-Based Summarization **Status:** Rejected **Context:** Context compression requires reducing token count while preserving semantic value. LLM-based abstractive summarization can achieve 50-80% compression with coherent output. **Decision:** Reject LLM-based summarization in favor of score thresholding + semantic deduplication. **Alternatives Considered:** | Option | Pros | Cons | |--------|------|------| | LLM Summarization | High compression (50-80%), coherent output | Latency (+100-500ms/chunk), LLM dependency, hallucination risk, API costs, cache complexity | | **Score Thresholding (selected)** | Zero latency, deterministic, no dependencies | May return fewer results | | **Semantic Dedup (selected)** | Removes redundancy, uses existing embeddings | May lose nuanced differences | | Extractive Summarization | Preserves source text | Sentence boundary issues for technical content | | Token Budget | Guaranteed budget | Abrupt cutoff, no optimization within budget | **Rationale:** 1. **Local-first constraint:** This project operates without external API keys. LLM summarization would require either an external API (violates constraint) or local LLM inference (adds significant resource requirements). 2. **Latency budget:** 100-500ms per chunk exceeds acceptable latency for interactive use. 3. **Hallucination risk:** Summarization may introduce inaccuracies into technical documentation. 4. **Determinism:** Score thresholding and deduplication produce reproducible results. The selected approach (threshold + dedup) achieves compression through information density improvement rather than content rewriting. **Implementation:** - Score thresholding: [src/compression/thresholding.py](../src/compression/thresholding.py) - Semantic deduplication: [src/compression/deduplication.py](../src/compression/deduplication.py) --- ## 7. Risks and Mitigations | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | ~~Extractive summarization loses critical context~~ | ~~Medium~~ | ~~High~~ | N/A (approach not implemented) | | ~~Token estimation inaccurate~~ | ~~Low~~ | ~~Low~~ | N/A (token budget not implemented) | | Compression degrades synthesis quality | Medium | Medium | A/B test compressed vs full responses; measure answer accuracy | | User confusion about two query tools | Low | Low | Clear documentation; consider single tool with compression flag | | Semantic dedup removes relevant distinct chunks | Low | Medium | Conservative threshold (0.85), user-configurable | --- ## 8. Related Documents - [spec 06: Hybrid Search Strategy](06-hybrid-search-strategy.md) - RRF fusion and scoring - [spec 08: Document Chunking](08-document-chunking.md) - Chunk size constraints - [spec 09: Aggregate Scoring System](09-aggregate-scoring-system.md) - Score normalization

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

12-context-compression.md•17.2 KiB