================================================================================
REDDIT MCP - VECTOR DB QUICK REFERENCE
================================================================================
ARCHITECTURE FLOW:
─────────────────────────────────────────────────────────────────────────────
Query (user)
└─> discover_subreddits() [src/tools/discover.py:10]
└─> _search_vector_db() [src/tools/discover.py:101]
└─> get_chroma_client() [src/chroma_client.py:89]
└─> ChromaProxyClient.query() [HTTP POST to Render]
└─> ChromaDB Cloud [Vector search]
└─> Returns: {metadatas, distances}
└─> Process results (filter, score, sort)
└─> Return to user
PARAMETERS:
─────────────────────────────────────────────────────────────────────────────
Function: discover_subreddits(query, queries, limit, include_nsfw, ctx)
query string None Single search term
queries list|string None Multiple queries [preferred]
limit integer 10 Results per query (1-50)
include_nsfw boolean False Include adult content
ctx Context None Progress reporting
RESPONSE STRUCTURE:
─────────────────────────────────────────────────────────────────────────────
{
"query": "machine learning",
"subreddits": [
{
"name": "MachineLearning",
"subscribers": 1500000,
"confidence": 0.95, ← Distance→Confidence (heuristic)
"url": "https://reddit.com/r/MachineLearning"
},
... more results ...
],
"summary": {
"total_found": 142, ← Total matches before limit
"returned": 10, ← Results shown
"has_more": true ← More available
},
"next_actions": [...]
}
CONFIDENCE CALCULATION:
─────────────────────────────────────────────────────────────────────────────
Step 1: Distance → Base Confidence (Piecewise Linear)
distance < 0.8 → confidence 0.9-1.0
0.8-1.0 → confidence 0.7-0.9
1.0-1.2 → confidence 0.5-0.7
1.2-1.4 → confidence 0.3-0.5
>= 1.4 → confidence 0.1-0.3
Step 2: Apply Business Rules
IF generic_sub(funny, pics, gifs, etc) AND not_directly_searched:
confidence *= 0.3 (Heavy penalty)
IF subscribers > 1_000_000:
confidence *= 1.1 (capped at 1.0) (Small boost)
IF subscribers < 10_000:
confidence *= 0.9 (Small penalty)
FILE LOCATIONS:
─────────────────────────────────────────────────────────────────────────────
src/chroma_client.py 164 lines Vector DB proxy client
└─ ChromaProxyClient 16-84 HTTP client
└─ ProxyCollection 72-83 Collection wrapper
└─ get_chroma_client() 89-104 Singleton initialization
└─ get_collection() 113-130 Collection access
└─ test_connection() 133-164 Connection test
src/tools/discover.py 310 lines Discovery operations
└─ discover_subreddits() 10-98 Entry point (async)
└─ _search_vector_db() 101-248 Search implementation (async)
└─ validate_subreddit() 251-310 Exact match checker
src/server.py 607 lines MCP server
└─ discover_operations() 142-171 Layer 1: See operations
└─ get_operation_schema() 174-372 Layer 2: Get parameters
└─ execute_operation() 378-428 Layer 3: Execute
CURRENT CAPABILITIES:
─────────────────────────────────────────────────────────────────────────────
EXPOSED:
✓ Semantic search (via distance)
✓ Top-K retrieval (1-100)
✓ Confidence scores (0.0-1.0)
✓ Batch queries
✓ NSFW filtering
✓ Progress reporting (ctx)
✓ Subscriber count
✓ Subreddit names/URLs
NOT EXPOSED:
✗ Raw distance scores
✗ Match type tiers
✗ Metadata filters (WHERE)
✗ Embedding vectors
✗ Search timing
✗ Collection statistics
✗ Filter counts
NEXT FEATURES (Phase 2):
─────────────────────────────────────────────────────────────────────────────
PHASE 2A (Quick, 1-2h each):
1. Expose raw distance scores
2. Add match_tier labels (exact/strong/partial/weak)
3. Include nsfw_filtered count
4. Add confidence statistics (mean/median)
PHASE 2B (Medium, 3-4h each):
5. Add min_confidence filter parameter
6. Add subscriber range filters
7. Add diversity modes (focused/balanced/diverse)
PHASE 2C (Advanced, 6+h each):
8. Similar subreddits (vector similarity)
9. Batch query overlap analysis
10. Collection coverage introspection
ENVIRONMENT VARIABLES:
─────────────────────────────────────────────────────────────────────────────
REQUIRED:
REDDIT_CLIENT_ID=<your-app-id>
REDDIT_CLIENT_SECRET=<your-app-secret>
OPTIONAL (defaults provided):
CHROMA_PROXY_URL=https://reddit-mcp-vector-db.onrender.com
CHROMA_PROXY_API_KEY=<your-api-key>
REDDIT_USER_AGENT=RedditMCP/1.0
VECTOR DB COLLECTION SCHEMA:
─────────────────────────────────────────────────────────────────────────────
Collection Name: reddit_subreddits
Index Size: ~20,000 subreddits
Embedding Type: Multi-field (name, description, purpose, activity)
Vector Metric: Euclidean distance
Metadata Fields:
- name (str) Subreddit name
- subscribers (int) Subscriber count
- nsfw (bool) Is adult content
- url (str) Reddit URL
- description (?) Community description (inferred)
- active (?) Active status (inferred)
DISTANCE RANGES OBSERVED:
0.0-0.8 Excellent matches
0.8-1.0 Very good matches
1.0-1.2 Good matches
1.2-1.4 Fair matches
1.4-1.6 Weak matches
1.6+ Very weak matches
PERFORMANCE CHARACTERISTICS:
─────────────────────────────────────────────────────────────────────────────
Typical Response Time: <2 seconds
- Network latency: ~1s
- ChromaDB search: <100ms
- Confidence calculation: <50ms
- Sorting: <10ms
Scaling Limits:
- Max results per query: 100
- Max batch queries: ~10-20 (untested)
- Concurrent requests: Depends on proxy (Render free: ~10)
Bottlenecks:
- Network latency (primary)
- ChromaDB I/O (secondary)
- Confidence calculation (negligible)
ERROR HANDLING:
─────────────────────────────────────────────────────────────────────────────
HTTP 401 → "API key required"
HTTP 403 → "Invalid API key"
HTTP 429 → "Rate limit exceeded"
Other → "Failed to query: {error}"
Pattern matching for guidance:
"not found" → "Verify subreddit name spelling"
"rate" → "Rate limited - wait 60 seconds"
"timeout" → "Reduce limit parameter to 10"
else → "Try simpler search terms"
TESTING REQUIREMENTS:
─────────────────────────────────────────────────────────────────────────────
Unit Tests:
□ Distance→Confidence (all 5 piecewise ranges)
□ Generic subreddit penalty (×0.3)
□ Subscriber boosts/penalties
□ NSFW filtering
Integration Tests:
□ Single query end-to-end
□ Batch query execution
□ Error recovery & guidance
□ Exact match validation
MAKING CHANGES:
─────────────────────────────────────────────────────────────────────────────
1. Modify discover_subreddits() signature [discover.py:10]
2. Update get_operation_schema() [server.py:189-223]
3. Update _search_vector_db() logic [discover.py:101-248]
4. Add tests for new behavior
5. Update docstrings
DOCUMENTED RESOURCES:
─────────────────────────────────────────────────────────────────────────────
VECTOR_DB_ANALYSIS.md 21 detailed sections (this directory)
VECTOR_DB_SUMMARY.md Quick reference guide (this directory)
specs/chroma-proxy-architecture.md Proxy design
specs/agentic-discovery-architecture.md Future agent patterns
================================================================================
KEY TAKEAWAY:
Two files contain all vector DB logic (chroma_client.py + discover.py).
Clean architecture enables incremental improvements with low risk.
================================================================================