Skip to main content
Glama
lhhub10086

MCP-Knowledge-Toolbox

by lhhub10086

MCP-Knowledge-Toolbox

MCP-Knowledge-Toolbox is a local knowledge-base MCP toolbox built on top of the Project 1 DocuPilot-RAG baseline. Project 2 does not modify Project 1 core code. It packages local document ingest, retrieval, context reading, citation checking, and evaluation-report reading as MCP-callable tools.

This repository is currently an engineering MVP, not a production multi-tenant RAG platform.

Architecture

flowchart LR
    A[Local Documents] --> B[Parser]
    B --> C[Chunker]
    C --> D[SQLite Metadata Store]
    C --> E[Vector Index]
    C --> F[BM25 Index]
    E --> G[Hybrid Retriever]
    F --> G
    G --> H[Lightweight Reranker]
    H --> I[MCP Tools]
    I --> J[MCP stdio Client]
    I --> K[Citation Verifier]
    I --> L[Eval Report Reader]

Related MCP server: MinerU Document Explorer

Tech Stack

  • Python 3.10/3.11 compatible code path

  • SQLite metadata store

  • MCP stdio JSON-RPC compatible MVP transport

  • Optional official MCP Python SDK when installed

  • sentence-transformers with BAAI/bge-small-zh-v1.5 as the default embedding model

  • hashing vector fallback when the embedding model is unavailable

  • PyMuPDF for PDF, python-docx for docx, native readers for Markdown/txt

  • pytest integration tests

Tools

The server exposes 11 tools:

ingest_file, ingest_folder, search_knowledge, read_chunk_neighbors, summarize_document, query_table, verify_citation, get_eval_report, list_documents, delete_document, server_status.

MCP Compatibility

Current implementation is an MCP stdio JSON-RPC compatible MVP. It can use the official MCP Python SDK if installed; otherwise it uses the built-in stdio JSON-RPC transport.

MCP capability

Status

Notes

stdio transport

Supported

Used by scripts/run_mcp_server.py.

initialize

Supported

Returns protocol version, server info, and tool capability.

tools/list

Supported

Returns all registered tool schemas.

tools/call

Supported

Returns text content and structuredContent.

notifications/initialized

Accepted

Notification is ignored safely.

resources

Not implemented

No MCP resources are exposed yet.

prompts

Not implemented

No MCP prompts are exposed yet.

sampling

Not implemented

No LLM sampling bridge.

streaming progress

Not verified

Tool calls are request/response only.

official SDK mode

Optional

Depends on mcp package availability.

Reproduce From Scratch

From a fresh clone:

pip install -r requirements.txt
python scripts/ingest_demo_docs.py --input data/raw --collection demo
python scripts/build_index.py --collection demo
python scripts/run_mcp_stdio_client_demo.py
pytest tests

Expected scale after ingest:

ingested files: 20
success: 20
failed: 0
chunks: 1201
documents: 20
collections: demo
embedding_provider: sentence-transformers

End-to-End Demo

Generate the full E2E MCP log:

python scripts/run_e2e_demo.py --collection e2e --input data/raw --output docs/e2e_demo_log.md

The log records:

  • MCP server startup through stdio subprocess

  • stdio client initialize

  • tools/list

  • tools/call ingest_folder

  • tools/call list_documents

  • tools/call search_knowledge

  • tools/call read_chunk_neighbors

  • tools/call verify_citation

  • final answer with citations

See docs/e2e_demo_log.md.

Retrieval Evaluation

Generate 50 QA samples and evaluate four retrieval strategies:

python scripts/run_retrieval_eval.py --collection demo

Outputs:

  • data/eval/demo_qa.jsonl

  • docs/retrieval_eval_report.md

Current measured metrics:

Strategy

Hit@3

Hit@5

MRR

Avg Latency (ms)

bm25

0.400

0.400

0.400

193.55

vector

0.340

0.340

0.340

82.97

hybrid

0.460

0.460

0.460

84.71

hybrid_rerank

0.460

0.460

0.460

80.97

Hybrid improved over individual retrieval modes on this demo set. Hybrid + rerank did not improve over hybrid; the report explains that the corpus is synthetic and repetitive, so first-stage retrieval already ranks many expected documents at the top.

Final Acceptance Artifacts

  • docs/e2e_demo_log.md

  • docs/retrieval_eval_report.md

  • docs/final_acceptance.md

  • data/eval/demo_qa.jsonl

Limitations

  • hashing vector is only a fallback when the sentence-transformers model is unavailable.

  • verify_citation is a lightweight keyword/similarity check, not an LLM judge.

  • query_table is Markdown table caption/content matching, not complex table reasoning.

  • rerank is lightweight token-overlap reranking, not a cross-encoder reranker.

  • summarize_document uses extractive summarization when no LLM is configured.

  • current storage is local SQLite and local JSON indexes, not a distributed vector database.

  • current MCP support covers tools over stdio, not resources/prompts/sampling.

  • this is not a production-grade multi-tenant platform.

Resume Wording

MCP-Knowledge-Toolbox: a local knowledge-base MCP toolbox for Agent workflows. Built an MCP stdio JSON-RPC compatible server exposing 11 tools for document ingest, SQLite metadata management, sentence-transformers vector retrieval, BM25, hybrid retrieval, context reading, citation verification, document deletion sync, and evaluation report reading. Added an end-to-end stdio client demo, 50-sample retrieval evaluation, and 37 pytest tests. Demo acceptance reached 20 documents and 1201 chunks across Markdown, txt, docx, and PDF.

Install Server
F
license - not found
C
quality
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lhhub10086/MCP-Knowledge-Toolbox'

If you have feedback or need assistance with the MCP directory API, please join our Discord server