fastcontext-hybrid-mcp
Allows the Hermes agent to use FastContext for codebase exploration.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@fastcontext-hybrid-mcpshow me how the user registration is implemented"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
FastContext Hybrid MCP Server
An MCP (Model Context Protocol) server that gathers context from codebases using FastContext-1.0-4B-RL — a 4B parameter model trained by Microsoft for repository exploration.
The server combines LLM-guided code exploration with fuzzy matching to find relevant code snippets for any question about a codebase.
How It Works
User question
↓
1. DECOMPOSE — break into sub-questions (code-focused + doc-focused)
↓
2. EXPLORE — FastContext 4B model searches the codebase via Grep/Glob/Read
↓
3. EXTRACT — fuzzy matching extracts only relevant lines from found files
↓
4. GAP-FILL — ripgrep + Levenshtein distance catches what the model missed
↓
Snippets (~5K tokens) → fed to larger LLM for synthesisPerformance Gains
Why use this pipeline instead of just asking the model directly?
APPROACH COMPARISON (tested on business-auditor, 1170 files)
═══════════════════════════════════════════════════════════════════════════
Method Concept Answerable Context/Question
Coverage
───────────────────────────────────────────────────────────────────────────
Raw FastContext (no pipeline) 50% 3/6 N/A (model output)
+ Path resolution fix 67% 4/6 N/A
+ Hybrid pipeline (unlimited) 97% 6/6 308K tokens
+ Hybrid pipeline (optimized) 92% 6/6 5K tokens ← this
───────────────────────────────────────────────────────────────────────────What each layer adds:
Layer What it does Gain
──────────────────────────────────────────────────────────────────────
FastContext 4B Finds relevant files via tool calls Baseline
Query decomposition Breaks Q into doc + code sub-questions +17%
Fuzzy snippet extract camelCase split + Levenshtein matching +15%
Gap-fill (ripgrep) Catches what model missed +25%
──────────────────────────────────────────────────────────────────────
Total: 50% → 92% concept coverage (+84% improvement)Context efficiency:
Without optimization: 308K tokens/question (loads full files)
With optimization: 5K tokens/question (extracts relevant lines only)
Reduction: 62x smaller contextWhat this means for the larger LLM:
Without pipeline: feed 308K tokens of raw files → exceeds most context windows, expensive
With pipeline: feed 5K tokens of targeted snippets → fits easily, cheap, higher quality
The 4B model handles the expensive exploration work (searching, reading, filtering). The larger LLM only sees the distilled evidence — no noise, no irrelevant code.
Key Features
Smart search: 4B model decides WHERE to look (not just keyword matching)
Fuzzy matching: camelCase splitting, separator normalization, Levenshtein distance
Minimal context: extracts only relevant lines, not full files (~5K tokens vs ~300K)
Gap-fill: ripgrep safety net catches what the model misses
Q4 quantization: runs on 6GB+ VRAM, ~67 tok/s generation
Related MCP server: Moatless MCP Server
Quick Start (Docker)
The fastest way to get started. Docker handles all dependencies.
Note for macOS users: Docker on macOS runs in a Linux VM — it cannot access Metal GPU. For Metal GPU acceleration on Apple Silicon, use the native setup below.
Linux with Vulkan GPU (AMD/Intel/NVIDIA)
git clone https://github.com/LyuboslavLyubenov/fastcontext-hybrid-mcp
cd fastcontext-hybrid-mcp
# Start with your project directory
WORK_DIR=/path/to/your/project docker compose up fastcontext-vulkanThe model downloads automatically on first run (~2.4GB).
CPU-only (macOS Docker or Linux without GPU)
git clone https://github.com/LyuboslavLyubenov/fastcontext-hybrid-mcp
cd fastcontext-hybrid-mcp
WORK_DIR=/path/to/your/project docker compose up fastcontext-cpuDocker run (manual)
# Build
docker build -f Dockerfile.vulkan -t fastcontext-mcp .
# Run
docker run -d \
-v /path/to/project:/workspace \
-v ./models:/models \
-p 8080:8080 \
--device /dev/dri:/dev/dri \
fastcontext-mcpQuick Start (No Docker — Recommended for macOS)
For native performance with Metal GPU on Apple Silicon, or Vulkan on Linux.
macOS with Metal GPU (Apple Silicon M1/M2/M3/M4)
git clone https://github.com/LyuboslavLyubenov/fastcontext-hybrid-mcp
cd fastcontext-hybrid-mcp
chmod +x setup-mac.sh start.sh
# One-command setup (installs everything, builds llama.cpp with Metal)
./setup-mac.sh
# Start with your project
./start.sh /path/to/your/projectThis uses Metal GPU for ~67 tok/s generation. No Docker needed.
Linux with Vulkan GPU
git clone https://github.com/LyuboslavLyubenov/fastcontext-hybrid-mcp
cd fastcontext-hybrid-mcp
chmod +x setup.sh start.sh
# One-command setup
./setup.sh
# Start
./start.sh /path/to/your/projectManual setup (Linux with Vulkan)
1. Install system dependencies
# Fedora
sudo dnf install cmake gcc-c++ glslc spirv-headers-devel spirv-tools-devel \
vulkan-headers vulkan-loader-devel ripgrep
# Ubuntu/Debian
sudo apt install cmake build-essential glslc spirv-headers spirv-tools \
libvulkan-dev ripgrep2. Install Python dependencies
pip install fastmcp mcp huggingface_hub3. Build llama.cpp with Vulkan
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
sudo cp build/bin/llama-server /usr/local/bin/
cd ..4. Download the model
huggingface-cli download sdougbrown/FastContext-1.0-4B-RL-GGUF \
FastContext-1.0-4B-RL-Q4_K_M.gguf --local-dir ./models5. Start
# Start inference server (32K context, 1 slot, Vulkan GPU)
llama-server \
-m models/FastContext-1.0-4B-RL-Q4_K_M.gguf \
--ctx-size 32768 \
--parallel 1 \
-ngl 99 \
--host 127.0.0.1 \
--port 8080 \
--reasoning off &
# Start MCP server
FASTCONTEXT_WORK_DIR=/path/to/project python3 mcp_server.pyManual setup (macOS with Metal)
1. Install dependencies
brew install cmake git ripgrep python3
pip install fastmcp mcp huggingface_hub2. Build llama.cpp with Metal
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
sudo cp build/bin/llama-server /usr/local/bin/
cd ..3. Download model and start
huggingface-cli download sdougbrown/FastContext-1.0-4B-RL-GGUF \
FastContext-1.0-4B-RL-Q4_K_M.gguf --local-dir ./models
./start.sh /path/to/your/projectConfigure in Hermes Agent
Add to ~/.hermes/config.yaml:
mcp_servers:
fastcontext:
command: "python3"
args: ["/path/to/fastcontext-hybrid-mcp/mcp_server.py"]
env:
FASTCONTEXT_WORK_DIR: "/path/to/your/project"
FASTCONTEXT_SERVER: "http://127.0.0.1:8080"
timeout: 120Restart Hermes Agent. Tools appear as mcp_fastcontext_*.
Tools
search_context
Main tool — searches a codebase for context relevant to a question.
Args:
question: str — The question (conceptual or code-specific)
work_dir: str — Path to codebase (optional, uses env var)
seed: int — Random seed (default: 42)
max_turns: int — Exploration turns per sub-question (default: 6)
enable_gap_fill: bool — Use fuzzy gap-fill (default: true)
Returns:
JSON with:
snippets: str — Extracted code snippets (~5K tokens)
files_read: int — Number of files explored
context_chars: int — Total context size
keywords: list — Extracted keywordsread_snippet
Extract relevant lines from a single file using fuzzy matching.
Args:
filepath: str — Absolute path to file
concepts: list[str] — Concepts to search for
context_lines: int — Surrounding lines (default: 2)list_files
List files matching a glob pattern.
health_check
Check if the inference server is running.
Environment Variables
Variable | Default | Description |
|
| Project directory to search |
|
| llama-server URL |
|
| Model path |
| auto-detected | llama-server binary path |
Hardware Requirements
Backend | Min RAM | GPU | Platform | Notes |
Metal | 8 GB unified | Apple Silicon M1+ | macOS native | Best for macOS — requires native install, not Docker |
Vulkan | 6 GB | AMD/Intel/NVIDIA | Linux | Mesa or proprietary drivers |
CPU | 8 GB RAM | None | Any | Works in Docker on any platform, ~10x slower |
Performance
Metric | Value |
Model size (Q4_K_M) | 2.4 GB |
VRAM usage | ~6 GB (model + KV cache) |
Prompt eval | ~420 tokens/sec |
Generation | ~67 tokens/sec |
Context per question | ~5K tokens |
Time per question | ~20-40 seconds |
License
MIT
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/LyuboslavLyubenov/fastcontext-hybrid-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server