Skip to main content
Glama
maxbaluev

mcp-retrieve

by maxbaluev

mcp-retrieve

An MCP server that exposes late-interaction document retrieval (ColBERT-style MaxSim) over a local folder. Point it at a directory of text/markdown/code, and any MCP client — Claude Desktop, an IDE agent, your own host — can index it and search it with token-level relevance.

It ships with a deterministic, model-free default embedder, so the server and its full test suite run offline with no model weights, no API key, no network. When you want production-grade semantics, drop in a real ColBERT / ColQwen encoder behind a small protocol — ranking code does not change.

What is MCP?

The Model Context Protocol is an open standard that lets LLM applications connect to external tools and data through a uniform server interface. A host (e.g. Claude Desktop) launches MCP servers and calls the tools they advertise. mcp-retrieve is such a server; it advertises two tools:

Tool

Purpose

index_folder(folder)

Read text files under folder, chunk them, embed each chunk into a multi-vector representation, and build an in-memory index.

search(query, k=5)

Rank indexed chunks by MaxSim late interaction and return the top k with source path, score, and snippet.

Related MCP server: mcp-context

The retrieval approach: late interaction (ColBERT)

Most dense retrievers compress a passage into one vector and compare it to one query vector — cheap, but lossy. Late interaction, introduced by ColBERT (Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020, arXiv:2004.12832), keeps one vector per token for both the query and the document and defers their interaction to scoring time via the MaxSim operator:

score(q, d) = Σ_i  max_j  sim(q_i, d_j)

Each query token q_i is matched to its single most similar document token d_j, and those per-token maxima are summed. This preserves fine-grained term matching (a query term can find its evidence anywhere in the passage) while staying efficient. With L2-normalised vectors, sim is cosine similarity, so MaxSim reduces to a dot product followed by a row-wise max and a sum — which is exactly what mcp_retrieve.retrieval.maxsim computes.

Install

pip install -e .          # core: mcp + numpy
pip install -e ".[dev]"   # plus pytest for the test suite

Register with an MCP client

For Claude Desktop, add the server to its mcpServers config (claude_desktop_config.json):

{
  "mcpServers": {
    "mcp-retrieve": {
      "command": "mcp-retrieve"
    }
  }
}

If you installed into a virtual environment, use the absolute path to the mcp-retrieve console script (or "command": "python", "args": ["-m", "mcp_retrieve"]). Restart the client, then ask it to index a folder and search it — the model will call the index_folder and search tools.

Usage from Python

from mcp_retrieve import RetrievalIndex

index = RetrievalIndex()                 # deterministic default embedder
index.index_folder("./docs")
for hit in index.search("late interaction maxsim", k=5):
    print(f"{hit.score:.3f}  {hit.chunk.source}  {hit.snippet}")

Plugging in a real late-interaction model

The default HashingEmbedder makes the project run anywhere, but it matches on character n-grams, not meaning. For real semantics, implement the Embedder protocol around a trained encoder and pass it in:

import numpy as np
from mcp_retrieve import RetrievalIndex
from mcp_retrieve.server import create_server

class ColbertEmbedder:
    """Wrap a ColBERT checkpoint as a multi-vector Embedder."""
    def __init__(self, checkpoint: str) -> None:
        from colbert.modeling.checkpoint import Checkpoint
        from colbert.infra import ColBERTConfig
        self._ckpt = Checkpoint(checkpoint, ColBERTConfig())

    @property
    def dim(self) -> int:
        return 128

    def embed(self, text: str) -> "np.ndarray":
        vecs = self._ckpt.docFromText([text])[0]   # (num_tokens, 128)
        return np.asarray(vecs, dtype=np.float32)

# Use it from Python …
index = RetrievalIndex(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))

# … or run the MCP server with it.
server = create_server(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
server.run()

Any object exposing dim: int and embed(text) -> ndarray[num_tokens, dim] with L2-normalised rows satisfies the protocol — ColBERT, ColQwen, ColPali, or your own. The ranking and chunking code is encoder-agnostic.

Architecture

src/mcp_retrieve/
  embedder.py    # Embedder protocol + deterministic HashingEmbedder default
  retrieval.py   # chunking, MaxSim, RetrievalIndex (pure — no MCP dependency)
  server.py      # FastMCP server exposing index_folder + search (only MCP import)

The retrieval and embedding cores import nothing MCP-related, so they are testable and reusable on their own; the SDK is isolated to server.py and imported lazily.

Testing

python -m pytest

All retrieval and embedder tests run offline with the default embedder. The end-to-end FastMCP tool test is skipped automatically when the mcp package is not installed.

License

MIT © 2026 Max Baluev

A
license - permissive license
-
quality - not tested
-
maintenance - not tested

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/maxbaluev/mcp-retrieve'

If you have feedback or need assistance with the MCP directory API, please join our Discord server