Skip to main content
Glama

๐Ÿ“„ MarkIndex MCP

Enterprise Document Intelligence Server

Python 3.11+ MCP Protocol License: MIT Version

MarkIndex is a production-ready Model Context Protocol server that empowers LLMs to accurately navigate and retrieve information from complex documents using Page Index RAG methodology.

Built on Microsoft MarkItDown for universal document conversion and a custom hierarchical section parser with TF-IDF search ranking.


โœจ Features

Capability

Description

๐Ÿ“ฅ Universal Ingestion

PDF, Word, Excel, PowerPoint, HTML, TXT, Markdown, URLs

๐ŸŽฌ YouTube Transcripts

Auto-download and index video transcripts with time-chunking

๐Ÿ“‚ Batch Directory Scan

Ingest all supported files from a directory in one call

๐ŸŒณ Hierarchical Parsing

Detects #, SECTION, CHAPTER, APPENDIX, numbered, Roman, and timestamp headers

๐Ÿ” TF-IDF Search

Relevance-ranked full-text search with regex support and context snippets

๐Ÿ“– Paginated Reading

Character-level pagination for reading large sections without overflow

๐Ÿงญ Tree Navigation

Parent, previous, next sibling traversal for sequential reading

๐Ÿ“ Extractive Summaries

Term-frequency sentence scoring for quick section overviews

๐Ÿ’พ Persistent Cache

Markdown files with YAML frontmatter โ€” human-readable, git-friendly


Related MCP server: PageIndex MCP

โš™๏ธ How It Works: The 3-Folder Secret System

MarkIndex utilizes an organized, self-updating knowledge architecture:

  1. raw/: Drop your source materials here (PDFs, Word documents, HTML, etc.). The server reads these files but never alters them.

  2. wiki/: The server processes the raw files and structures them into cross-linked Markdown pages (one per document). It also generates a master index.md file that acts as a crawlable map, allowing the LLM to efficiently fetch context without wasting tokens.

  3. outputs/: This folder automatically saves the results, reports, or plans generated every time you ask the LLM to write something based on your knowledge base.

By implementing this architecture, you essentially build a self-updating, personal consultation engine tailored to your exact data and files.


โš–๏ธ Vector RAG vs. MarkIndex

How does our MarkIndex methodology compare to traditional Vector Database RAG?

Feature

Vector RAG

MarkIndex RAG

Context Preservation

4/10

10/10

Setup Complexity

3/10

9/10

Cost to Run

5/10

10/10

Sequential Reading

2/10

10/10

Token Efficiency

3/10

9/10

Fuzzy Semantic Match

9/10

6/10

Total Score

26/60

54/60

MarkIndex excels by preserving the original document hierarchy and allowing the LLM to paginate through full, unbroken sections, rather than receiving fragmented, out-of-context vector chunks.

Why MarkIndex RAG is Different:

  1. Hierarchy vs. Chunks: Traditional Vector RAG chops documents into arbitrary 500-token chunks, destroying the author's intended structure. MarkIndex parses the actual headers (#, Chapter 1, etc.) to create a navigable tree with stable, unique section IDs.

  2. Full Context: When an LLM asks MarkIndex for a section, it gets the entire section, exactly as it was written, rather than a few stitched-together vector matches that lack surrounding context.

  3. No Expensive Embeddings: Vector RAG requires passing every document through an embedding model (like OpenAI text-embedding-ada-002), which costs time and API credits. MarkIndex uses an ultra-fast, local, pure-Python N-Gram TF-IDF engine for advanced multi-word lexical search.

  4. Stable IDs & Context: MarkIndex tracks document paths deterministically (chapter-1-summary-2) allowing the LLM to easily distinguish between duplicate subheadings. When an LLM asks MarkIndex for a section by ID, it gets the entire section.

  5. Token Efficiency: Vector RAG blindly dumps 5 to 10 disjointed chunks (2,500+ tokens) into the prompt. MarkIndex feeds the LLM a tiny structural map (index.md), and the LLM only fetches the specific, highly-relevant section it needs, drastically reducing token waste and API costs.

  6. LLM Agency: With MarkIndex, the LLM acts like a human reader. It can read the Table of Contents, search for keywords, jump to a specific section, and then navigate to the "next" or "previous" sections.

Architecture

MarkIndex uses a robust "3-Folder Secret System" for enterprise knowledge management:

  • raw/: Your original, untouched source documents (PDFs, Word docs, etc.).

  • wiki/: The LLM's internal representation, stored as hierarchical Markdown files with JSON frontmatter.

  • outputs/: Where the LLM automatically saves the persistent reports and answers it generates for you.

Note: You can strictly control whether the LLM is allowed to access files outside the raw/ directory via the MARKINDEX_ALLOW_EXTERNAL_FILES=true/false setting.

markindex-mcp/
โ”œโ”€โ”€ markindex/                       # Python package
โ”‚   โ”œโ”€โ”€ __init__.py                  # Version & metadata
โ”‚   โ”œโ”€โ”€ __main__.py                  # python -m markindex
โ”‚   โ”œโ”€โ”€ config.py                    # Centralized Settings dataclass
โ”‚   โ”œโ”€โ”€ logger.py                    # Structured logging
โ”‚   โ”œโ”€โ”€ exceptions.py                # Custom exception hierarchy
โ”‚   โ”œโ”€โ”€ server.py                    # FastMCP server & lifecycle
โ”‚   โ”œโ”€โ”€ core/                        # Business logic
โ”‚   โ”‚   โ”œโ”€โ”€ parser.py                # Hierarchical document parser
โ”‚   โ”‚   โ”œโ”€โ”€ search.py                # TF-IDF ranking engine
โ”‚   โ”‚   โ”œโ”€โ”€ summarizer.py            # Extractive summarization
โ”‚   โ”‚   โ””โ”€โ”€ storage.py               # Frontmatter serialization & I/O
โ”‚   โ””โ”€โ”€ tools/                       # MCP tool definitions
โ”‚       โ”œโ”€โ”€ ingest.py                # Ingestion tools
โ”‚       โ”œโ”€โ”€ query.py                 # Querying tools
โ”‚       โ”œโ”€โ”€ navigate.py              # Navigation tools
โ”‚       โ””โ”€โ”€ manage.py                # Management tools
โ”œโ”€โ”€ tests/                           # Test suite
โ”œโ”€โ”€ pyproject.toml                   # PEP 621 packaging
โ”œโ”€โ”€ requirements.txt                 # Dependencies
โ”œโ”€โ”€ raw/                             # [NEW] Drop your source files here
โ”œโ”€โ”€ wiki/                            # [NEW] Auto-generated markdown & master index.md
โ””โ”€โ”€ outputs/                         # [NEW] Claude's generated reports and summaries

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+

  • pip

Installation

# Clone the repository
git clone https://github.com/rajfazulhussain2008/markindex-mcp.git
cd markindex-mcp

# Create a virtual environment
python -m venv venv
venv\Scripts\activate       # Windows
# source venv/bin/activate  # macOS/Linux

# Install dependencies
pip install -r requirements.txt

# Optional: YouTube transcript support
pip install youtube-transcript-api

Running the Server

# Run as a module
python -m markindex

# Or use the CLI entry point (after pip install -e .)
markindex

MCP Client Configuration

Add to your MCP client config (e.g., Claude Desktop):

{
  "mcpServers": {
    "markindex": {
      "command": "python",
      "args": ["-m", "markindex"],
      "cwd": "/path/to/markindex-mcp"
    }
  }
}

๐Ÿ”ง Configuration

All settings are managed via environment variables (prefix: MARKINDEX_):

Variable

Default

Description

MARKINDEX_RAW_DIR

./raw

Source materials directory

MARKINDEX_WIKI_DIR

./wiki

Processed markdown & master index directory

MARKINDEX_OUTPUTS_DIR

./outputs

AI generated reports directory

MARKINDEX_LOG_LEVEL

INFO

Log verbosity: DEBUG, INFO, WARNING, ERROR

MARKINDEX_ALLOW_EXTERNAL_FILES

false

Enable access outside raw/ directory

Copy .env.example โ†’ .env and customize as needed.


๐Ÿ“š Tool Reference

Core Tools

All tools return a consistent standard dictionary: {"success": true/false, "data": ..., "error": null, "code": null}

  1. ingest_document(filepath): Download a URL (with strict size/type safety constraints) or ingest a local file.

  2. ingest_directory(dir_path): Recursively ingest a whole folder.

  3. list_documents(): View all ingested docs.

  4. delete_document(doc_id): Completely purge a document from memory and disk.

LLM Exploration Tools

  1. get_document_outline(doc_id): View the document's structure, titles, stable IDs, and sizes.

  2. search_sections(doc_id, query): Find specific keywords or regex patterns using the built-in N-Gram TF-IDF engine.

  3. read_section(doc_id, section_id): Fetch the full markdown content of a section.

  4. get_adjacent_sections(doc_id, section_id): Read the parent, previous, or next section.

  5. summarize_section(doc_id, section_id): Generate an extractive summary of a huge section without filling up the context window.

Management Tools

Tool

Description

list_documents()

List all ingested documents

delete_document(doc_id)

Delete a document from index and cache

save_to_outputs(filename, content)

Save AI-generated reports to the outputs/ folder


๐Ÿงช Testing

# Run the test suite
python -m pytest tests/ -v

# With coverage
python -m pytest tests/ --cov=markindex --cov-report=term-missing

๐Ÿ“„ License

This project is licensed under the MIT License.


Built with โค๏ธ by Rajmohamed H

A
license - permissive license
-
quality - not tested
-
maintenance - not tested

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rajfazulhussain2008/markindex-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server