Tea Rags MCP

performance-tuning.md•12 KiB

--- title: Performance Tuning sidebar_position: 5 --- import MermaidTeaRAGs from '@site/src/components/MermaidTeaRAGs'; # Performance Tuning ## Auto-Tuning Benchmark Don't guess — let the benchmark find optimal settings for your hardware: ```bash npm run tune ``` Creates `tuned_environment_variables.env` with optimal values in ~60-90 seconds. **For local setup**, just run `npm run tune` — defaults work out of the box: - `QDRANT_URL` defaults to `http://localhost:6333` - `EMBEDDING_BASE_URL` defaults to `http://localhost:11434` - `EMBEDDING_MODEL` defaults to `unclemusclez/jina-embeddings-v2-base-code:latest` **For remote setup**, configure via environment variables: ```bash QDRANT_URL=http://192.168.1.100:6333 \ EMBEDDING_BASE_URL=http://192.168.1.100:11434 \ npm run tune ``` The benchmark tests 7 parameters and shows estimated indexing times: ``` Phase 1: Embedding Batch Size ... Optimal: 128 Phase 2: Embedding Concurrency ... Optimal: 2 Phase 3: Qdrant Batch Size ... Optimal: 384 ``` Then add tuned values to your MCP config: ```bash claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \ -e QDRANT_URL=http://localhost:6333 \ -e EMBEDDING_BATCH_SIZE=256 \ -e EMBEDDING_CONCURRENCY=2 \ -e QDRANT_UPSERT_BATCH_SIZE=384 ``` ## Embeddings-Only Benchmark For GPU-specific optimization without Qdrant (embedding calibration only): ```bash npm run benchmark-embeddings ``` This benchmark uses a **three-phase plateau detection algorithm** designed for production robustness. ### Three-Phase Calibration Algorithm #### **Phase 1: Batch Plateau Detection** (CONCURRENCY=1) - Tests batch sizes: [256, 512, 1024, 2048, 3072, 4096] - Uses adaptive chunk count: `min(batch×2, MAX_TOTAL_CHUNKS)` per batch - Stops when improvement < 3% (plateau detected) or timeout exceeded - Plateau timeout: calculated from previous throughput to detect degradation early #### **Phase 2: Concurrency Testing** (on plateau only) - Tests concurrency: [1, 2, 4] on plateau batches only - Plateau timeout: calculated from baseline (CONC=1) throughput - Stops testing higher concurrency if timeout exceeded (degradation) - Reuses CONC=1 results from Phase 1 (no redundant testing) #### **Phase 3: Robust Selection** - Selects from configurations within 2% of maximum throughput - **Prefers**: Lower concurrency → Lower batch size → Higher throughput - Avoids overfitting to noise, reduces tail-risk ### Key Principles The algorithm follows engineering best practices: 1. **Adaptive Workload**: Each batch size tests `min(batch×2, MAX_TOTAL_CHUNKS)` chunks 2. **Plateau Over Peak**: Seeks stable performance range, not theoretical maximum 3. **Noise Tolerance**: Differences < 2-3% considered measurement noise 4. **Robustness**: Prefers simpler configs (lower concurrency/batch) when performance is equivalent ### When to Use Use `npm run benchmark-embeddings` when you: - Want to understand GPU/CPU characteristics without Qdrant - Are comparing different embedding models - Need to diagnose embedding bottlenecks - Want to see the full calibration process with detailed output Use `npm run tune` (full benchmark) when you: - Need complete end-to-end optimization (embedding + Qdrant) - Want production-ready configuration file - Are setting up new deployment ### Example Output (Remote GPU) ``` Phase 1: Batch Plateau Detection (CONCURRENCY=1) 512 chunks @ batch 256 143 chunks/s (3.6s) +100.0% 1024 chunks @ batch 512 (max 11.1s) 145 chunks/s (7.1s) +1.4% → Plateau detected, stopping Plateau batches: [256, 512] Phase 2: Concurrency Effect Test BATCH=256 CONC=1 (from Phase 1) 143 chunks/s CONC=2 (1024 chunks, max 11.1s) STABLE 152 chunks/s +6.3% CONC=4 (2048 chunks, max 22.1s) STABLE 153 chunks/s +0.7% → Concurrency plateau, stopping BATCH=512 CONC=1 (from Phase 1) 145 chunks/s CONC=2 (2048 chunks, max 21.8s) STABLE 147 chunks/s +1.4% → Concurrency plateau, stopping Phase 3: Configuration Selection Acceptable configurations: 5/5 Max throughput: 153 chunks/s Recommended configurations: 🏠 Local GPU: BATCH_SIZE=512 CONCURRENCY=2 (153 chunks/s) 🌐 Remote GPU: BATCH_SIZE=256 CONCURRENCY=4 (153 chunks/s) Detected setup: 🌐 Remote Selected: BATCH_SIZE=256, CONCURRENCY=4 ``` **Hardware:** Remote AMD Radeon 7800M GPU via LAN, jina-embeddings-v2-base-code ### Testing Different Models ```bash # Test jina-embeddings-v2-base-code (768 dims) - default npm run benchmark-embeddings # Test different model EMBEDDING_MODEL=mxbai-embed-large:latest npm run benchmark-embeddings # Remote GPU EMBEDDING_BASE_URL=http://192.168.1.100:11434 npm run benchmark-embeddings ``` ## Indexing Benchmarks | Codebase | LoC | Local Setup | Remote GPU | |----------|-----|------------|------------| | Small CLI tool | 10K | ~2s | ~2s | | Medium library | 50K | ~11s | ~9s | | Large library | 100K | ~21s | ~17s | | Enterprise app | 500K | ~2 min | ~1.5 min | | Large codebase | 1M | ~3.5 min | ~3 min | | VS Code | 3.5M | ~12 min | ~10 min | | Kubernetes | 5M | ~18 min | ~15 min | | Linux kernel | 10M | ~36 min | ~29 min | ## Deployment Topologies ### 🏠 Fully Local Setup Everything runs on your machine — lowest latency, fully offline. <MermaidTeaRAGs> {` flowchart LR subgraph machine["💻 Your Machine"] claude["🤖 Coding Agent"] ollama["✨ Ollama GPU"] qdrant["🗄️ Qdrant Docker"] claude -->|embedding| ollama ollama -->|vectors| claude claude -->|storage| qdrant end `} </MermaidTeaRAGs> **Best for:** Powerful local GPU (M3 Pro or better), offline work, minimal setup. ### ⭐ Remote GPU + Local Qdrant (Recommended) Embedding on dedicated GPU server, Qdrant runs locally in Docker. <MermaidTeaRAGs> {` flowchart LR subgraph dev["💻 Development Machine"] claude["🤖 Coding Agent"] qdrant["🗄️ Qdrant Docker"] end subgraph gpu["🖥️ GPU Server LAN"] ollama["✨ Ollama GPU"] end claude -->|embedding| ollama ollama -.->|vectors| claude claude -->|storage| qdrant `} </MermaidTeaRAGs> **Best for:** Most users — fast GPU embedding + fast local storage. **Fastest overall:** ~7m 39s for VS Code (3.5M LoC). ### 🌐 Full Remote Setup Both Qdrant and Ollama on dedicated server. <MermaidTeaRAGs> {` flowchart LR claude["🤖 Coding Agent Your Machine"] subgraph gpu["🖥️ GPU Server LAN"] ollama["✨ Ollama GPU"] qdrant["🗄️ Qdrant"] end claude -->|embedding| ollama ollama -.->|vectors| claude claude -->|storage| qdrant `} </MermaidTeaRAGs> **Best for:** Cannot run Docker locally, indexing from multiple thin clients. **Trade-off:** Network latency reduces storage rate (~4x slower than local). ## Performance Comparison | Metric | 🏠 Fully Local | ⭐ Remote GPU + Local Qdrant | 🌐 Full Remote | |--------|---------------|------------------------------|----------------| | Optimal batch | 512 | 256 | 256 | | Optimal concurrency | 1 | 6 | 4 | | Embedding rate | 87 ch/s | **156 ch/s** | 154 ch/s | | Storage rate | 6273 ch/s | **6966 ch/s** | 1810 ch/s | | VS Code (3.5M LoC) | 13m 36s | **7m 39s** | 8m 13s | :::tip Why Remote GPU + Local Qdrant is Fastest Network latency crushes remote Qdrant storage: 6966 ch/s → 1810 ch/s (3.8x drop). Always run Qdrant locally if possible. Embedding benefits from dedicated GPU (1.8x faster than M3 Pro), and concurrency hides network latency. ::: ## Key Performance Insights ### Batch Size **Rule of thumb:** - **Local GPU**: Use large batches (512) + CONCURRENCY=1 to minimize per-batch overhead - **Remote GPU**: Use smaller batches (256) + CONCURRENCY=4-6 to hide network latency ```bash # Local GPU — GPU-bound, concurrency adds overhead export EMBEDDING_BATCH_SIZE=512 export EMBEDDING_CONCURRENCY=1 # Remote GPU — network latency hidden by parallel requests export EMBEDDING_BATCH_SIZE=256 export EMBEDDING_CONCURRENCY=4 ``` ### Concurrency **Critical insight:** Concurrency is **only beneficial for remote GPU**. Local GPU sees no improvement — it adds overhead without benefit. | Setup | Concurrency | Why | |-------|-------------|-----| | Local GPU | 1 | GPU-bound, parallel requests add overhead | | Remote GPU | 4-6 | Hide network latency with overlapping I/O | ### Qdrant Ordering | Mode | Use case | Performance | |------|----------|-------------| | `weak` | Local Qdrant, single indexer | Fastest | | `medium` | Balanced | Default | | `strong` | Remote Qdrant, multiple indexers | Safest, highest consistency | ```bash # Local setup export QDRANT_BATCH_ORDERING=weak # Remote GPU + Local Qdrant export QDRANT_BATCH_ORDERING=strong # Full remote export QDRANT_BATCH_ORDERING=weak ``` ### Bottleneck Analysis **Embedding is always the bottleneck:** - Embedding: 87-156 ch/s - Storage: 1810-6966 ch/s Even at 156 ch/s (remote GPU), embedding is **40x slower** than storage. Invest in GPU, not Qdrant infrastructure. ## Tuning Guidelines ### For Large Codebases (500K+ LOC) - ✅ Run `npm run tune` to find optimal batch sizes - ✅ Increase `MAX_IO_CONCURRENCY=100` for SSD - ✅ Increase Qdrant memory limits in docker-compose.yml - ✅ Use `.contextignore` to exclude node_modules, build artifacts ### For Slow Search - ✅ Use filters: `fileTypes`, `pathPattern` to narrow scope - ✅ Limit results to needed count - ✅ Enable hybrid search for technical queries - ✅ Verify Qdrant has payload indexes ### For Memory Issues - ⚠️ Reduce `CODE_CHUNK_SIZE` (default 2500) - ⚠️ Reduce `QDRANT_UPSERT_BATCH_SIZE` - ⚠️ Increase Qdrant memory in docker-compose.yml - ⚠️ Index subdirectories separately ## Monitoring & Debug Enable detailed timing logs: ```bash export DEBUG=1 ``` Logs are written to `~/.tea-rags-mcp/logs/`. Check index status: ```bash /mcp__qdrant__get_index_status /path/to/project ``` Returns: status, chunk count, last update time, collection statistics. ## Hardware Recommendations ### Minimum (Development) - 4GB RAM, SSD, CPU embedding (slow but works) ### Recommended (Production) - 8GB RAM, GPU with 8GB+ VRAM, SSD, Dedicated Qdrant ### Enterprise (Large Codebases) - 16GB+ RAM, GPU with 12GB+ VRAM, NVMe SSD, Clustered Qdrant --- <details> <summary>Pre-configured Settings by Setup</summary> ### 🏠 Local Setup (MacBook M3 Pro) ```bash EMBEDDING_BATCH_SIZE=512 EMBEDDING_CONCURRENCY=1 QDRANT_UPSERT_BATCH_SIZE=192 QDRANT_BATCH_ORDERING=weak QDRANT_FLUSH_INTERVAL_MS=100 QDRANT_DELETE_BATCH_SIZE=1500 QDRANT_DELETE_CONCURRENCY=12 # Embedding: 87 ch/s, Storage: 6273 ch/s ``` **Why these values:** - `BATCH_SIZE=512` + `CONCURRENCY=1`: GPU-bound workload, concurrency adds overhead - `QDRANT_BATCH_ORDERING=weak`: Local Qdrant doesn't need strong ordering - `FLUSH_INTERVAL_MS=100`: Fast flushes for local SSD ### ⭐ Remote GPU + Local Qdrant (AMD 7800M) ```bash EMBEDDING_BATCH_SIZE=256 EMBEDDING_CONCURRENCY=6 QDRANT_UPSERT_BATCH_SIZE=512 QDRANT_BATCH_ORDERING=strong QDRANT_FLUSH_INTERVAL_MS=250 QDRANT_DELETE_BATCH_SIZE=500 QDRANT_DELETE_CONCURRENCY=16 # Embedding: 156 ch/s, Storage: 6966 ch/s ``` **Why these values:** - `BATCH_SIZE=256` + `CONCURRENCY=6`: Network latency hidden by concurrent requests - `QDRANT_BATCH_ORDERING=strong`: Higher ordering ensures data consistency - Higher storage rate (6966 ch/s) because Qdrant is local ### 🌐 Full Remote Setup ```bash EMBEDDING_BATCH_SIZE=256 EMBEDDING_CONCURRENCY=4 QDRANT_UPSERT_BATCH_SIZE=256 QDRANT_BATCH_ORDERING=weak QDRANT_FLUSH_INTERVAL_MS=500 QDRANT_DELETE_BATCH_SIZE=1000 QDRANT_DELETE_CONCURRENCY=12 # Embedding: 154 ch/s, Storage: 1810 ch/s ``` **Why these values:** - `BATCH_SIZE=256` + `CONCURRENCY=4`: Balance between hiding latency and avoiding queue buildup - `QDRANT_BATCH_ORDERING=weak`: Reduce round-trips over network - `FLUSH_INTERVAL_MS=500`: Larger flush windows amortize network latency - Storage bottlenecked by network (1810 ch/s vs 6966 for local) </details>

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/artk0de/TeaRAGs-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

performance-tuning.md•12 KiB