mcp-retrieve
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-retrievesearch for 'token-level relevance' in my docs"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
mcp-retrieve
An MCP server that exposes late-interaction
document retrieval (ColBERT-style MaxSim) over a local folder. Point it at a
directory of text/markdown/code, and any MCP client — Claude Desktop, an IDE
agent, your own host — can index it and search it with token-level
relevance.
It ships with a deterministic, model-free default embedder, so the server and its full test suite run offline with no model weights, no API key, no network. When you want production-grade semantics, drop in a real ColBERT / ColQwen encoder behind a small protocol — ranking code does not change.
What is MCP?
The Model Context Protocol is an open standard that lets LLM applications
connect to external tools and data through a uniform server interface. A host
(e.g. Claude Desktop) launches MCP servers and calls the tools they
advertise. mcp-retrieve is such a server; it advertises two tools:
Tool | Purpose |
| Read text files under |
| Rank indexed chunks by MaxSim late interaction and return the top |
Related MCP server: ragi
The retrieval approach: late interaction (ColBERT)
Most dense retrievers compress a passage into one vector and compare it to one query vector — cheap, but lossy. Late interaction, introduced by ColBERT (Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020, arXiv:2004.12832), keeps one vector per token for both the query and the document and defers their interaction to scoring time via the MaxSim operator:
score(q, d) = Σ_i max_j sim(q_i, d_j)Each query token q_i is matched to its single most similar document token
d_j, and those per-token maxima are summed. This preserves fine-grained term
matching (a query term can find its evidence anywhere in the passage) while
staying efficient. With L2-normalised vectors, sim is cosine similarity, so
MaxSim reduces to a dot product followed by a row-wise max and a sum — which is
exactly what mcp_retrieve.retrieval.maxsim computes.
Install
pip install -e . # core: mcp + numpy
pip install -e ".[dev]" # plus pytest for the test suiteRegister with an MCP client
For Claude Desktop, add the server to its mcpServers config
(claude_desktop_config.json):
{
"mcpServers": {
"mcp-retrieve": {
"command": "mcp-retrieve"
}
}
}If you installed into a virtual environment, use the absolute path to the
mcp-retrieve console script (or "command": "python", "args": ["-m", "mcp_retrieve"]). Restart the client, then ask it to index a folder and search
it — the model will call the index_folder and search tools.
Usage from Python
from mcp_retrieve import RetrievalIndex
index = RetrievalIndex() # deterministic default embedder
index.index_folder("./docs")
for hit in index.search("late interaction maxsim", k=5):
print(f"{hit.score:.3f} {hit.chunk.source} {hit.snippet}")Plugging in a real late-interaction model
The default HashingEmbedder makes the project run anywhere, but it matches on
character n-grams, not meaning. For real semantics, implement the Embedder
protocol around a trained encoder and pass it in:
import numpy as np
from mcp_retrieve import RetrievalIndex
from mcp_retrieve.server import create_server
class ColbertEmbedder:
"""Wrap a ColBERT checkpoint as a multi-vector Embedder."""
def __init__(self, checkpoint: str) -> None:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
self._ckpt = Checkpoint(checkpoint, ColBERTConfig())
@property
def dim(self) -> int:
return 128
def embed(self, text: str) -> "np.ndarray":
vecs = self._ckpt.docFromText([text])[0] # (num_tokens, 128)
return np.asarray(vecs, dtype=np.float32)
# Use it from Python …
index = RetrievalIndex(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
# … or run the MCP server with it.
server = create_server(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
server.run()Any object exposing dim: int and embed(text) -> ndarray[num_tokens, dim]
with L2-normalised rows satisfies the protocol — ColBERT, ColQwen, ColPali, or
your own. The ranking and chunking code is encoder-agnostic.
Architecture
src/mcp_retrieve/
embedder.py # Embedder protocol + deterministic HashingEmbedder default
retrieval.py # chunking, MaxSim, RetrievalIndex (pure — no MCP dependency)
server.py # FastMCP server exposing index_folder + search (only MCP import)The retrieval and embedding cores import nothing MCP-related, so they are
testable and reusable on their own; the SDK is isolated to server.py and
imported lazily.
Testing
python -m pytestAll retrieval and embedder tests run offline with the default embedder. The
end-to-end FastMCP tool test is skipped automatically when the mcp package is
not installed.
License
MIT © 2026 Max Baluev
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Tools
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/maxbaluev/mcp-retrieve'
If you have feedback or need assistance with the MCP directory API, please join our Discord server