Skip to main content
Glama
VECTOR_DB_SUMMARY.md7.55 kB
# Reddit MCP Server - Vector DB Integration Summary ## Quick Reference **File**: `/Users/chrisivester/Documents/mbp-obsidian-vault/02-Projects/software-projects/MCP/reddit-mcp-poc/reddit-research-mcp/VECTOR_DB_ANALYSIS.md` (21 detailed sections) --- ## Current State ### What It Does - Semantic search across 20,000+ indexed subreddits via HTTP proxy - Discovers relevant communities based on natural language queries - Ranks results using confidence scores (0.0-1.0) - Supports batch queries and NSFW filtering - Integrates with FastMCP context for progress reporting ### How It Works ``` Query → ChromaProxyClient → HTTP Proxy (Render) → ChromaDB Cloud ↓ Vector Search (Euclidean) ↓ Returns distances + metadata ↓ Process: Distance→Confidence, Filter, Sort ↓ Return ranked subreddit list ``` ### Key Components | Component | Lines | Purpose | |-----------|-------|---------| | `ChromaProxyClient` | chroma_client.py:16-84 | HTTP proxy client | | `discover_subreddits()` | discover.py:10-98 | Main entry point | | `_search_vector_db()` | discover.py:101-248 | Search logic | | `validate_subreddit()` | discover.py:251-310 | Exact match check | --- ## API Surface (What's Exposed) ### Parameters ```python discover_subreddits( query: str = None, # Single search term queries: List[str] = None, # Batch queries (mutually exclusive) limit: int = 10, # Results per query (1-50) include_nsfw: bool = False, # Include adult content ctx: Context = None # Progress reporting ) ``` ### Return Values **Per Subreddit**: - `name` - Subreddit name - `subscribers` - Subscriber count - `confidence` - Match score (0.0-1.0) - `url` - Reddit URL **Per Query**: - `query` - Query echo - `subreddits` - Array of results - `summary.total_found` - Total matches - `summary.returned` - Results shown - `summary.has_more` - More available - `next_actions` - Suggestions --- ## What Vector DB Provides vs What's Exposed ### Currently Used - [x] Semantic similarity search (via distance) - [x] Top-K retrieval (up to 100) - [x] Distance scores - [x] Metadata (name, subscribers, nsfw, url) - [x] Collection counting ### Currently NOT Exposed - [ ] Raw distance scores - [ ] Match type classifications - [ ] Metadata filtering (WHERE clauses) - [ ] Embedding vectors - [ ] Search latency/timing - [ ] Collection statistics - [ ] Advanced filtering options --- ## Confidence Score Deep Dive ### NOT Based On - Statistical significance - Cross-validation metrics - Any formal ML evaluation ### IS Based On - Heuristic distance→confidence mapping - Observed distance distribution (0.8-1.6+ range) - Business rules (penalties/boosts) ### Mapping Formula ```python # Piecewise linear conversion distance < 0.8 → confidence 0.9-1.0 (excellent) distance < 1.0 → confidence 0.7-0.9 (very good) distance < 1.2 → confidence 0.5-0.7 (good) distance < 1.4 → confidence 0.3-0.5 (fair) distance ≥ 1.4 → confidence 0.1-0.3 (weak) # Then apply adjustments: - Generic subs penalty: ×0.3 (if 'funny', 'pics', etc.) - Large subs boost: ×1.1 (if >1M subscribers) - Small subs penalty: ×0.9 (if <10K subscribers) ``` --- ## Environment Setup ```bash # Required REDDIT_CLIENT_ID=<your-id> REDDIT_CLIENT_SECRET=<your-secret> # Optional (defaults provided) CHROMA_PROXY_URL=https://reddit-mcp-vector-db.onrender.com CHROMA_PROXY_API_KEY=<optional> REDDIT_USER_AGENT=RedditMCP/1.0 ``` --- ## Enhancement Roadmap ### Phase 2a: Quick Wins (1-2 hours each) 1. Expose raw distance scores 2. Return match tier labels (exact/strong/partial/weak) 3. Include NSFW filter count in summary 4. Add confidence statistics (mean/median) ### Phase 2b: Medium Features (3-4 hours each) 5. Add `min_confidence` filter parameter 6. Add subscriber range filters 7. Add result diversity modes (focused/balanced/diverse) ### Phase 2c: Advanced Features (6+ hours each) 8. Similar subreddits operation 9. Batch query overlap analysis 10. Collection coverage analysis --- ## Code Quality Checklist - [x] ChromaProxyClient is minimal (~70 lines) and maintainable - [x] Error handling with specific guidance - [x] Batch mode support - [x] Progress reporting integration - [x] Metadata-driven filtering (no extra queries) - [ ] Distance scores exposed to users - [ ] No architectural debt - [ ] Test coverage for distance→confidence - [ ] Test coverage for filtering logic --- ## Testing Strategy **Unit Tests Needed**: - Distance→Confidence conversion (all 5 ranges) - Generic subreddit penalty application - Subscriber-based adjustments - NSFW filtering behavior **Integration Tests Needed**: - Single query end-to-end - Batch query execution - Error recovery paths - Exact match validation --- ## Known Limitations 1. **Distance not exposed** - Users can't set custom thresholds 2. **No WHERE clause support** - Can't filter by subscriber range in DB 3. **Static confidence algorithm** - No learning from feedback 4. **Collection not updatable** - Would need separate pipeline 5. **Batch limit untested** - Probably ~10-20 queries max --- ## Performance Notes - Typical response: <2 seconds (network + search + processing) - Search limit: 100 results max - Batch overhead: Minimal (sequential calls, not parallel) - Bottlenecks: Network latency, not compute --- ## Context Integration Status **Current** (Phase 1): - [x] Accepts `ctx: Context` parameter - [x] Uses `ctx.report_progress()` for streaming - [x] Reports progress during filtering **Not Yet Used**: - Filtering/ranking decisions - Caching of results - Request tracking - Analytics collection --- ## Key Files ``` reddit-research-mcp/src/ ├── chroma_client.py ← Vector DB proxy (164 lines) ├── tools/discover.py ← Discovery operations (310 lines) ├── server.py ← MCP operations dispatcher ├── models.py ← Data structures └── config.py ← Reddit API setup ``` **All vector DB integration in 2 files:** - `chroma_client.py` (70 lines of actual logic) - `discover.py` (150 lines for search + validation) --- ## For Phase 2a/2b/2c Development ### Starting Point 1. Read VECTOR_DB_ANALYSIS.md (full details) 2. Review Section 18 (Enhancement Roadmap) 3. Check specific file lines for modifications 4. Run existing tests to ensure baseline ### Making Changes 1. Add parameters to `discover_subreddits()` signature 2. Update schema in `server.py:get_operation_schema()` 3. Implement filtering logic in `_search_vector_db()` 4. Add tests for new features 5. Update docstrings ### Testing - Mock ChromaDB responses - Test confidence calculations - Test filter combinations - Test error cases --- ## Questions This Analysis Answers ✓ How is the vector DB integrated? ✓ What parameters does discover_subreddits accept? ✓ What data is currently returned? ✓ How are distance scores converted to confidence? ✓ What Vector DB capabilities exist but aren't exposed? ✓ Where should new features be added? ✓ What's the performance impact? ✓ How does context integration work? ✓ What are good next features to build? ✓ How is error handling implemented?

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/king-of-the-grackles/dialog-reddit-tools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server