Skip to main content
Glama
VECTOR_DB_INDEX.md9.95 kB
# Vector Database Integration - Documentation Index Complete analysis of the Reddit MCP server's vector database integration with ChromaDB. ## Quick Navigation ### For Different Audiences **Project Managers / Stakeholders** - Start: [VECTOR_DB_SUMMARY.md](VECTOR_DB_SUMMARY.md) - 5 min overview - Contains: Key features, current state, enhancement roadmap **Developers - Starting Implementation** - Start: [VECTOR_DB_QUICK_REF.txt](VECTOR_DB_QUICK_REF.txt) - 1 page reference - Contains: File locations, API parameters, code modification checklist **Architects / Technical Leads** - Start: [VECTOR_DB_ANALYSIS.md](VECTOR_DB_ANALYSIS.md) - Comprehensive deep dive - Contains: All 7 analysis tasks fully answered with exact line numbers **Code Reviewers** - Section 9: VECTOR_DB_ANALYSIS.md - Current Capabilities - Section 10: VECTOR_DB_ANALYSIS.md - Unavailable Capabilities **Phase 2a/2b/2c Implementers** - Section 18: VECTOR_DB_ANALYSIS.md - Enhancement Roadmap - Section 15: VECTOR_DB_ANALYSIS.md - Code Locations --- ## Document Overview ### VECTOR_DB_ANALYSIS.md Comprehensive technical analysis (885 lines, 21 sections) | Section | Content | Purpose | |---------|---------|---------| | 1 | Architecture Overview | System design and components | | 2 | Vector DB Client Implementation | ChromaProxyClient class details | | 3 | discover_subreddits Complete Flow | Entry point and parameters | | 4 | Vector DB Query Characteristics | Search behavior and distance handling | | 5 | Available Metadata | What ChromaDB contains | | 6 | validate_subreddit Helper | Validation operation details | | 7 | Vector DB Integration Points | How other operations use vector DB | | 8 | MCP Server Integration | Three-layer architecture | | 9 | Current Capabilities | What's exposed ✓ and what isn't ✗ | | 10 | Capabilities NOT Exposed | Opportunities for enhancement | | 11 | Confidence Calculation Deep Dive | Complete algorithm explanation | | 12 | Error Recovery & Guidance | Error handling patterns | | 13 | Collection Schema | Data structure and metadata | | 14 | Performance Characteristics | Timing and limits | | 15 | Code Locations Reference | Exact file and line numbers | | 16 | Environment Configuration | Setup and defaults | | 17 | Phase 1 Context Integration Status | Current and future usage | | 18 | Enhancement Recommendations | Phase 2a/2b/2c features | | 19 | Architecture Diagrams | Visual flows | | 20 | Testing & Validation Points | Test strategy | | 21 | Key Takeaways | Summary and outlook | ### VECTOR_DB_SUMMARY.md Executive summary (276 lines) - Current state overview - API surface (parameters and returns) - Confidence score deep dive - Enhancement roadmap (quick version) - Phase 1 context integration status - Known limitations ### VECTOR_DB_QUICK_REF.txt Single-page reference card (220 lines) - Architecture flow diagram - Parameter tables - Response structure examples - Confidence calculation formula - File locations with line numbers - Performance characteristics - Error handling guide - Testing checklist --- ## Key Files Analyzed ### Primary Vector DB Integration (220 lines total logic) **src/chroma_client.py** (164 lines) - HTTP proxy client abstraction - Connection management - Error handling for auth/rate limits - Collection interface wrapper **src/tools/discover.py** (310 lines) - Main discovery entry point - Batch query handling - Vector search implementation - NSFW filtering - Confidence calculation - Result sorting and limiting ### Secondary Integration Points **src/server.py** (607 lines) - Three-layer MCP architecture - Operation dispatcher - Schema definitions **src/models.py** (60 lines) - Data structures for results **src/config.py** (46 lines) - Reddit API configuration **src/resources.py** (212 lines) - Server info endpoint --- ## Exact Code Locations ### Functions to Know | Function | File | Lines | Purpose | |----------|------|-------|---------| | `discover_subreddits()` | discover.py | 10-98 | Main entry point (async) | | `_search_vector_db()` | discover.py | 101-248 | Search implementation (async) | | `validate_subreddit()` | discover.py | 251-310 | Exact match validation | | `get_chroma_client()` | chroma_client.py | 89-104 | Client initialization | | `get_collection()` | chroma_client.py | 113-130 | Collection access | | `ChromaProxyClient.query()` | chroma_client.py | 31-51 | HTTP query method | | `execute_operation()` | server.py | 378-428 | Operation dispatcher (async) | | `get_operation_schema()` | server.py | 174-372 | Schema definitions | ### Making Changes (Modification Points) | Feature | File | Lines | Effort | |---------|------|-------|--------| | Add parameter | discover.py | 10-20 | Low | | Update schema | server.py | 189-223 | Low | | Implement logic | discover.py | 101-248 | Medium | | Add filtering | discover.py | 150-203 | Low-Medium | | Add response data | discover.py | 192-224 | Low | --- ## API Reference ### Current Parameters ```python discover_subreddits( query: str = None, # Single search queries: List[str] = None, # Batch (preferred) limit: int = 10, # Results per query (1-50) include_nsfw: bool = False, # Adult content ctx: Context = None # Progress reporting ) ``` ### Current Response ```json { "query": "search term", "subreddits": [ { "name": "subreddit", "subscribers": 1000000, "confidence": 0.95, "url": "https://reddit.com/r/subreddit" } ], "summary": { "total_found": 142, "returned": 10, "has_more": true }, "next_actions": ["suggestions"] } ``` --- ## Enhancement Roadmap ### Phase 2a: Quick Wins (1-2h each) 1. Expose raw distance scores 2. Add match tier labels 3. Include NSFW filter count 4. Add confidence statistics ### Phase 2b: Medium Features (3-4h each) 5. Add min_confidence filter 6. Add subscriber range filters 7. Add diversity modes ### Phase 2c: Advanced (6+h each) 8. Similar subreddits operation 9. Batch query analysis 10. Collection introspection --- ## Confidence Scoring Formula ### Distance → Base Confidence (Piecewise Linear) ``` distance < 0.8 → 0.9-1.0 (excellent) 0.8-1.0 → 0.7-0.9 (very good) 1.0-1.2 → 0.5-0.7 (good) 1.2-1.4 → 0.3-0.5 (fair) >= 1.4 → 0.1-0.3 (weak) ``` ### Post-Processing Rules ``` if generic_sub AND not_directly_searched: confidence *= 0.3 if subscribers > 1_000_000: confidence = min(1.0, confidence * 1.1) if subscribers < 10_000: confidence *= 0.9 ``` --- ## Vector DB Details ### Collection: reddit_subreddits - **Size**: 20,000+ subreddits - **Metric**: Euclidean distance - **Metadata Fields**: name, subscribers, nsfw, url, (description?, active?) - **Query Limit**: 100 results max - **Update Frequency**: Unknown (static for MVP) ### Query Process 1. User provides search term 2. ChromaProxyClient sends HTTP POST to `/query` 3. ChromaDB performs vector similarity search 4. Returns metadata + distance scores 5. Server filters, scores, sorts results 6. Returns to user with confidence scores ### Performance - Typical response: <2 seconds - Bottleneck: Network latency (not compute) - Batch overhead: Minimal (sequential calls) --- ## Testing Checklist ### Unit Tests - [ ] Distance→Confidence conversion (all 5 ranges) - [ ] Generic subreddit penalty - [ ] Subscriber adjustments - [ ] NSFW filtering ### Integration Tests - [ ] Single query end-to-end - [ ] Batch query execution - [ ] Error recovery - [ ] Exact match validation --- ## Environment Variables ```bash # Required REDDIT_CLIENT_ID=<app-id> REDDIT_CLIENT_SECRET=<app-secret> # Optional (defaults provided) CHROMA_PROXY_URL=https://reddit-mcp-vector-db.onrender.com CHROMA_PROXY_API_KEY=<api-key> REDDIT_USER_AGENT=RedditMCP/1.0 ``` --- ## Error Handling Guide | HTTP Status | Message | Guidance | |------------|---------|----------| | 401 | Auth failed: API key required | Set CHROMA_PROXY_API_KEY | | 403 | Auth failed: Invalid API key | Verify API key | | 429 | Rate limit exceeded | Wait before retry | | Timeout | Failed to query | Reduce limit parameter | --- ## Questions This Analysis Answers 1. How is the vector DB integrated? (Section 1) 2. What parameters does discover_subreddits accept? (Section 3) 3. What data does it return? (Section 3) 4. Are distance scores/embeddings exposed? (Section 9) 5. How is confidence calculated? (Section 11) 6. What vector DB capabilities aren't exposed? (Section 10) 7. What's the operation naming/structure pattern? (Section 8) 8. How are responses structured? (Section 3) 9. What filtering/ranking logic exists? (Section 4) 10. Are there performance characteristics/limitations? (Section 14) --- ## Getting Started 1. **For Overview**: Read VECTOR_DB_SUMMARY.md (5 min) 2. **For Reference**: Print VECTOR_DB_QUICK_REF.txt 3. **For Deep Dive**: Read VECTOR_DB_ANALYSIS.md (30 min) 4. **For Implementation**: Use Section 18 (Enhancement Recommendations) 5. **For Code Changes**: Reference Section 15 (Code Locations) --- ## Context Integration Status **Current** (Phase 1): - Accepts `ctx: Context` parameter - Uses `ctx.report_progress()` for streaming - Reports progress during result filtering **Available for Phase 2+**: - Filtering/ranking decisions - Result caching - Request tracking - Analytics collection --- ## Key Takeaway The Reddit MCP server has a **clean, minimal** vector DB integration: - Only 2 files contain vector DB logic (~220 lines actual code) - Clear separation: client abstraction + search implementation - Pragmatic confidence scoring (heuristic-based) - Good error handling with user guidance - Ready for incremental Phase 2 enhancements All changes are **low-risk** due to isolated logic and comprehensive error handling. --- Generated: 2025-10-29 Analysis Status: Complete (All 7 tasks answered) Total Documentation: 1,381 lines across 3 files + 1 index

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/king-of-the-grackles/dialog-reddit-tools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server