mcp-retrieve
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-retrievesearch for 'token-level relevance' in my docs"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
mcp-retrieve
An MCP server that exposes late-interaction
document retrieval (ColBERT-style MaxSim) over a local folder. Point it at a
directory of text/markdown/code, and any MCP client — Claude Desktop, an IDE
agent, your own host — can index it and search it with token-level
relevance.
It ships with a deterministic, model-free default embedder, so the server and its full test suite run offline with no model weights, no API key, no network. When you want production-grade semantics, drop in a real ColBERT / ColQwen encoder behind a small protocol — ranking code does not change.
What is MCP?
The Model Context Protocol is an open standard that lets LLM applications
connect to external tools and data through a uniform server interface. A host
(e.g. Claude Desktop) launches MCP servers and calls the tools they
advertise. mcp-retrieve is such a server; it advertises two tools:
Tool | Purpose |
| Read text files under |
| Rank indexed chunks by MaxSim late interaction and return the top |
Related MCP server: mcp-context
The retrieval approach: late interaction (ColBERT)
Most dense retrievers compress a passage into one vector and compare it to one query vector — cheap, but lossy. Late interaction, introduced by ColBERT (Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020, arXiv:2004.12832), keeps one vector per token for both the query and the document and defers their interaction to scoring time via the MaxSim operator:
score(q, d) = Σ_i max_j sim(q_i, d_j)Each query token q_i is matched to its single most similar document token
d_j, and those per-token maxima are summed. This preserves fine-grained term
matching (a query term can find its evidence anywhere in the passage) while
staying efficient. With L2-normalised vectors, sim is cosine similarity, so
MaxSim reduces to a dot product followed by a row-wise max and a sum — which is
exactly what mcp_retrieve.retrieval.maxsim computes.
Install
pip install -e . # core: mcp + numpy
pip install -e ".[dev]" # plus pytest for the test suiteRegister with an MCP client
For Claude Desktop, add the server to its mcpServers config
(claude_desktop_config.json):
{
"mcpServers": {
"mcp-retrieve": {
"command": "mcp-retrieve"
}
}
}If you installed into a virtual environment, use the absolute path to the
mcp-retrieve console script (or "command": "python", "args": ["-m", "mcp_retrieve"]). Restart the client, then ask it to index a folder and search
it — the model will call the index_folder and search tools.
Usage from Python
from mcp_retrieve import RetrievalIndex
index = RetrievalIndex() # deterministic default embedder
index.index_folder("./docs")
for hit in index.search("late interaction maxsim", k=5):
print(f"{hit.score:.3f} {hit.chunk.source} {hit.snippet}")Plugging in a real late-interaction model
The default HashingEmbedder makes the project run anywhere, but it matches on
character n-grams, not meaning. For real semantics, implement the Embedder
protocol around a trained encoder and pass it in:
import numpy as np
from mcp_retrieve import RetrievalIndex
from mcp_retrieve.server import create_server
class ColbertEmbedder:
"""Wrap a ColBERT checkpoint as a multi-vector Embedder."""
def __init__(self, checkpoint: str) -> None:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
self._ckpt = Checkpoint(checkpoint, ColBERTConfig())
@property
def dim(self) -> int:
return 128
def embed(self, text: str) -> "np.ndarray":
vecs = self._ckpt.docFromText([text])[0] # (num_tokens, 128)
return np.asarray(vecs, dtype=np.float32)
# Use it from Python …
index = RetrievalIndex(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
# … or run the MCP server with it.
server = create_server(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
server.run()Any object exposing dim: int and embed(text) -> ndarray[num_tokens, dim]
with L2-normalised rows satisfies the protocol — ColBERT, ColQwen, ColPali, or
your own. The ranking and chunking code is encoder-agnostic.
Architecture
src/mcp_retrieve/
embedder.py # Embedder protocol + deterministic HashingEmbedder default
retrieval.py # chunking, MaxSim, RetrievalIndex (pure — no MCP dependency)
server.py # FastMCP server exposing index_folder + search (only MCP import)The retrieval and embedding cores import nothing MCP-related, so they are
testable and reusable on their own; the SDK is isolated to server.py and
imported lazily.
Testing
python -m pytestAll retrieval and embedder tests run offline with the default embedder. The
end-to-end FastMCP tool test is skipped automatically when the mcp package is
not installed.
License
MIT © 2026 Max Baluev
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/maxbaluev/mcp-retrieve'
If you have feedback or need assistance with the MCP directory API, please join our Discord server