Session Buddy

ENGRAM_FEATURE_2_NGRAM_FINGERPRINT.md•14.7 KiB

# Implementation Plan: N-gram Fingerprinting for Deduplication **Inspired by**: DeepSeek Engram's n-gram hash mapping for pattern detection **Priority**: Medium value, Low effort **Status**: Proposed ## Overview Implement n-gram fingerprinting to detect and prevent storage of near-duplicate conversations and reflections. This applies Engram's deterministic hashing approach to the content storage pipeline rather than retrieval. ## Problem Statement Session Buddy currently has no deduplication: ```python # Current: Every store operation creates a new record await db.store_conversation(content, metadata) # Always inserts await db.store_reflection(content, tags) # Always inserts ``` This leads to: - Database bloat from repeated/similar content - Search results polluted with near-duplicates - Wasted embedding computation on duplicate content - Degraded search relevance (duplicates dilute signal) ## Proposed Solution ### N-gram Fingerprinting Algorithm Inspired by Engram's `NgramHashMapping`, but adapted for content deduplication: ``` ┌─────────────────────────────────────────────────────────────┐ │ Content Storage Pipeline │ ├─────────────────────────────────────────────────────────────┤ │ 1. Normalize content (lowercase, whitespace, unicode) │ │ 2. Extract n-grams (configurable n, default 3) │ │ 3. Hash each n-gram with xxhash │ │ 4. Create MinHash signature (k=128 hashes) │ │ 5. Compute Jaccard similarity with existing fingerprints │ │ 6. If similarity > threshold: skip/merge, else: store │ └─────────────────────────────────────────────────────────────┘ ``` ### Why MinHash over Simple Hash? Simple content hashing (hash entire text) only catches **exact** duplicates. MinHash catches **near-duplicates** with configurable similarity threshold: - "How do I fix this bug?" vs "How do I fix this bug" (trailing punctuation) - Same content with different whitespace - Minor typo corrections - Content with timestamps or IDs that change ### Components #### 1. N-gram Extractor ```python def extract_ngrams(text: str, n: int = 3) -> set[str]: """Extract character n-grams from normalized text. Character n-grams are more robust than word n-grams for: - Typo detection - Languages without clear word boundaries - Technical content with unusual tokenization Args: text: Input text (will be normalized internally) n: N-gram size (default 3, trigrams) Returns: Set of unique n-grams """ # Normalize first normalized = normalize_for_fingerprint(text) if len(normalized) < n: return {normalized} if normalized else set() return {normalized[i:i+n] for i in range(len(normalized) - n + 1)} ``` #### 2. MinHash Signature Generator ```python from dataclasses import dataclass import xxhash @dataclass class MinHashSignature: """MinHash signature for approximate set similarity.""" values: tuple[int, ...] # k hash values num_hashes: int = 128 @classmethod def from_ngrams(cls, ngrams: set[str], num_hashes: int = 128) -> "MinHashSignature": """Generate MinHash signature from n-gram set. Uses different seeds for each hash function (Engram-style XOR variation). """ if not ngrams: return cls(values=tuple([0] * num_hashes), num_hashes=num_hashes) # Generate k minimum hashes min_hashes = [] for seed in range(num_hashes): min_hash = min( xxhash.xxh64(gram.encode(), seed=seed).intdigest() for gram in ngrams ) min_hashes.append(min_hash) return cls(values=tuple(min_hashes), num_hashes=num_hashes) def jaccard_similarity(self, other: "MinHashSignature") -> float: """Estimate Jaccard similarity from MinHash signatures. Jaccard(A, B) ≈ |matching_hashes| / |total_hashes| """ if self.num_hashes != other.num_hashes: raise ValueError("Signatures must have same num_hashes") matches = sum(a == b for a, b in zip(self.values, other.values)) return matches / self.num_hashes def to_bytes(self) -> bytes: """Serialize for storage.""" import struct return struct.pack(f">{self.num_hashes}Q", *self.values) @classmethod def from_bytes(cls, data: bytes, num_hashes: int = 128) -> "MinHashSignature": """Deserialize from storage.""" import struct values = struct.unpack(f">{num_hashes}Q", data) return cls(values=values, num_hashes=num_hashes) ``` #### 3. Fingerprint Storage Schema ```sql -- Add to existing tables ALTER TABLE conversations ADD COLUMN IF NOT EXISTS fingerprint BLOB; -- MinHash signature bytes ALTER TABLE reflections ADD COLUMN IF NOT EXISTS fingerprint BLOB; -- Dedicated fingerprint index for fast lookup CREATE TABLE IF NOT EXISTS content_fingerprints ( id TEXT PRIMARY KEY, content_type TEXT NOT NULL, -- 'conversation' or 'reflection' content_id TEXT NOT NULL, fingerprint BLOB NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(content_type, content_id) ); -- Index for efficient similarity search -- Note: DuckDB doesn't support custom index types, so we'll scan CREATE INDEX idx_fingerprints_type ON content_fingerprints(content_type); ``` #### 4. Deduplication Logic ```python class DeduplicationResult: """Result of deduplication check.""" is_duplicate: bool similar_id: str | None = None similarity: float = 0.0 action: Literal["store", "skip", "merge"] = "store" async def check_duplicate( self, content: str, content_type: Literal["conversation", "reflection"], threshold: float = 0.85, ) -> DeduplicationResult: """Check if content is a near-duplicate of existing content. Args: content: Content to check content_type: Type of content threshold: Jaccard similarity threshold (default 0.85 = 85% similar) Returns: DeduplicationResult with duplicate status and action """ # Generate fingerprint for new content ngrams = extract_ngrams(content, n=self.settings.fingerprint_ngram_size) new_sig = MinHashSignature.from_ngrams(ngrams, self.settings.fingerprint_num_hashes) # Query existing fingerprints rows = self.conn.execute(""" SELECT content_id, fingerprint FROM content_fingerprints WHERE content_type = ? ORDER BY created_at DESC LIMIT ? """, [content_type, self.settings.fingerprint_check_limit]).fetchall() # Check similarity against recent content for row in rows: existing_sig = MinHashSignature.from_bytes( row[1], num_hashes=self.settings.fingerprint_num_hashes ) similarity = new_sig.jaccard_similarity(existing_sig) if similarity >= threshold: return DeduplicationResult( is_duplicate=True, similar_id=row[0], similarity=similarity, action="skip" if similarity > 0.95 else "merge", ) return DeduplicationResult(is_duplicate=False, action="store") ``` ### Integration Points #### File: `session_buddy/adapters/reflection_adapter_oneiric.py` ```python async def store_conversation( self, content: str, metadata: dict[str, Any] | None = None, deduplicate: bool = True, # NEW parameter ) -> str: """Store conversation with optional deduplication.""" if deduplicate and self.settings.deduplication_enabled: dedup_result = await self.check_duplicate(content, "conversation") if dedup_result.action == "skip": logger.debug( f"Skipping duplicate conversation (similarity={dedup_result.similarity:.2%})" ) return dedup_result.similar_id # Return existing ID if dedup_result.action == "merge": # Could merge metadata or update timestamp logger.debug( f"Near-duplicate detected (similarity={dedup_result.similarity:.2%}), storing anyway" ) # Generate fingerprint for new content ngrams = extract_ngrams(content, n=self.settings.fingerprint_ngram_size) fingerprint = MinHashSignature.from_ngrams(ngrams).to_bytes() # Store with fingerprint conv_id = str(uuid.uuid4()) # ... existing INSERT logic with fingerprint column ... # Also store in fingerprint index await self._store_fingerprint(conv_id, "conversation", fingerprint) return conv_id ``` ### Configuration Add to `session_buddy/adapters/settings.py`: ```python class ReflectionAdapterSettings(BaseSettings): # Existing settings... # Deduplication settings deduplication_enabled: bool = True fingerprint_ngram_size: int = 3 # Character trigrams fingerprint_num_hashes: int = 128 # MinHash signature size fingerprint_similarity_threshold: float = 0.85 # 85% similar = duplicate fingerprint_check_limit: int = 1000 # Check against last N items fingerprint_skip_threshold: float = 0.95 # >95% = skip entirely ``` ### Merge Strategy for Near-Duplicates When `action == "merge"` (similarity between 85-95%): ```python async def merge_conversation( self, existing_id: str, new_content: str, new_metadata: dict[str, Any] | None, ) -> str: """Merge new content with existing near-duplicate. Strategy: 1. Keep the longer content (more information) 2. Merge metadata (union of keys, newer values win) 3. Update timestamp to most recent 4. Increment a 'version' counter """ existing = await self.get_conversation(existing_id) # Keep longer content final_content = ( new_content if len(new_content) > len(existing["content"]) else existing["content"] ) # Merge metadata final_metadata = {**existing.get("metadata", {}), **(new_metadata or {})} # Update record self.conn.execute(""" UPDATE conversations SET content = ?, metadata = ?, updated_at = CURRENT_TIMESTAMP, version = COALESCE(version, 1) + 1 WHERE id = ? """, [final_content, json.dumps(final_metadata), existing_id]) return existing_id ``` ### Metrics & Observability ```python def get_deduplication_stats(self) -> dict[str, Any]: """Return deduplication metrics.""" return { "total_checked": self._dedup_checks, "duplicates_found": self._dedup_found, "duplicates_skipped": self._dedup_skipped, "duplicates_merged": self._dedup_merged, "dedup_rate": self._dedup_found / self._dedup_checks if self._dedup_checks > 0 else 0, "fingerprints_stored": self._get_fingerprint_count(), } ``` ## Implementation Steps ### Phase 1: Core Fingerprinting Module (2-3 hours) 1. [ ] Create `session_buddy/utils/fingerprint.py` with: - `normalize_for_fingerprint()` function - `extract_ngrams()` function - `MinHashSignature` dataclass 1. [ ] Add fingerprint settings to `ReflectionAdapterSettings` 1. [ ] Add unit tests for fingerprinting functions ### Phase 2: Database Schema (1 hour) 1. [ ] Add `fingerprint` column to `conversations` table 1. [ ] Add `fingerprint` column to `reflections` table 1. [ ] Create `content_fingerprints` index table 1. [ ] Add migration script for existing databases ### Phase 3: Integration (2-3 hours) 1. [ ] Add `check_duplicate()` method to adapter 1. [ ] Modify `store_conversation()` to use deduplication 1. [ ] Modify `store_reflection()` to use deduplication 1. [ ] Implement `merge_conversation()` for near-duplicates 1. [ ] Add `_store_fingerprint()` helper method ### Phase 4: Testing (2-3 hours) 1. [ ] Unit tests for MinHash accuracy (known Jaccard values) 1. [ ] Integration tests for duplicate detection 1. [ ] Edge case tests (empty content, very short content, unicode) 1. [ ] Performance benchmarks for fingerprint generation ### Phase 5: MCP Tool Exposure (1 hour) 1. [ ] Add `deduplication_stats` tool to expose metrics 1. [ ] Add `find_duplicates` tool to scan for existing duplicates 1. [ ] Update `reflection_stats` to include dedup stats ## Dependencies **Optional dependency** (same as Feature 1): ```toml [project.optional-dependencies] performance = ["xxhash>=3.0"] ``` **Fallback**: Use `hashlib.blake2b` with different seeds. ## Expected Impact | Metric | Before | After (Estimated) | |--------|--------|-------------------| | Database size growth | Unbounded | 15-30% reduction | | Duplicate entries | Common | \<5% slip-through | | Search result quality | Diluted by duplicates | Cleaner results | | Embedding compute | Every store | Skip for duplicates | | Store latency | ~10ms | ~12-15ms (fingerprint overhead) | ## Similarity Threshold Tuning | Threshold | Effect | |-----------|--------| | 0.70 | Aggressive dedup, may merge distinct content | | 0.80 | Balanced, catches most duplicates | | **0.85** | **Recommended default** | | 0.90 | Conservative, only clear duplicates | | 0.95 | Very conservative, near-exact only | ## Risks & Mitigations | Risk | Mitigation | |------|------------| | False positives (distinct content marked duplicate) | Conservative threshold (0.85), option to disable | | Fingerprint computation overhead | O(n) where n=content length, negligible for typical content | | Storage overhead for fingerprints | 1KB per item (128 × 8-byte hashes), minimal | | Index scan performance | Limit check to recent N items, can add LSH for larger scale | ## Future Enhancements (Not in Scope) 1. **Locality-Sensitive Hashing (LSH)**: For O(1) approximate nearest neighbor lookup instead of scanning 1. **Hierarchical fingerprints**: Different n-gram sizes for coarse-to-fine matching 1. **Semantic deduplication**: Combine fingerprints with embedding similarity 1. **Cross-project deduplication**: Dedupe across all projects (currently per-project) ## Success Criteria - [ ] >90% of exact duplicates detected - [ ] >70% of near-duplicates (>85% similar) detected - [ ] \<1% false positive rate (distinct content incorrectly merged) - [ ] Store latency increase \<50% (target \<15ms) - [ ] All existing tests pass - [ ] New dedup tests achieve >90% coverage ## References - MinHash algorithm: https://en.wikipedia.org/wiki/MinHash - Jaccard similarity: https://en.wikipedia.org/wiki/Jaccard_index - DeepSeek Engram: https://github.com/deepseek-ai/Engram - xxhash Python: https://github.com/ifduyue/python-xxhash - DuckDB BLOB handling: https://duckdb.org/docs/sql/data_types/blob

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lesleslie/session-buddy'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ENGRAM_FEATURE_2_NGRAM_FINGERPRINT.md•14.7 KiB