Skip to main content
Glama
Edgaras0x4E

tika-mcp

by Edgaras0x4E

tika-mcp

tika-mcp is an MCP (Model Context Protocol) server that wraps Apache Tika and exposes document parsing tools over streamable-http.

It supports local files and optional remote URL ingestion with SSRF protections, plus extraction for plain text, metadata, HTML, MIME detection, and recursive archive content.

Features

  • Async direct Tika Server HTTP integration (/tika, /meta, /detect/stream, /rmeta)

  • MCP tools: extract_text, extract_metadata, detect_mime_type, extract_html, extract_documents

  • Local file controls (TIKA_ALLOW_LOCAL_FILES, TIKA_ALLOWED_LOCAL_ROOTS)

  • Optional remote URL ingestion with blocked private/internal targets by default

  • Bounded input and output sizes with clear tool-facing errors

  • Optional bearer-token protection for exposed MCP HTTP deployments

Related MCP server: kordoc

Requirements

  • Python 3.11+

  • Apache Tika Server reachable at TIKA_URL (default http://localhost:9998)

Installation

pip install tika-mcp

Or from source:

pip install .

Quick Start

  1. Install tika-mcp:

    pip install tika-mcp
  2. Start Apache Tika Server:

    docker run --rm -p 9998:9998 apache/tika:3.3.1.0
  3. Start tika-mcp:

    export TIKA_URL=http://localhost:9998
    tika-mcp

    It now listens at http://127.0.0.1:8000/mcp (transport: streamable-http).

Tools

Every tool takes a single argument, source - a local file path, or an http(s):// URL (remote URLs must be enabled). All tools are read-only.

Tool

What it does

extract_text

Extract plain text from a document (PDF, Office, HTML, and other Tika-supported formats).

extract_metadata

Return document metadata (author, title, content type, page count, …) without the body text.

detect_mime_type

Detect the file's MIME/content type from its bytes, without full parsing.

extract_html

Extract structured XHTML with headings, tables, and links preserved.

extract_documents

Unpack a container/archive/compound file (zip, email, compound doc) and return each embedded document.

Configuration

Variable

Default

Description

TIKA_MCP_HOST

127.0.0.1

MCP bind host

TIKA_MCP_PORT

8000

MCP bind port

TIKA_MCP_PATH

/mcp

Streamable HTTP endpoint path

TIKA_MCP_BEARER_TOKEN

unset

Optional bearer token required for MCP HTTP requests

TIKA_URL

http://localhost:9998

Tika Server base URL

TIKA_TIMEOUT_SECONDS

30

Request timeout for Tika and remote downloads

TIKA_MAX_FILE_SIZE_MB

25

Maximum local/remote input file size

TIKA_MAX_OUTPUT_SIZE_MB

10

Max MCP tool response size; also caps the streamed Tika response body before it is buffered

TIKA_ALLOW_LOCAL_FILES

true

Enable local file sources

TIKA_ALLOWED_LOCAL_ROOTS

unset

Comma-separated allowed local path roots

TIKA_ALLOW_REMOTE_URLS

false

Enable remote URL sources

TIKA_ALLOWED_URL_SCHEMES

http,https

Allowed remote URL schemes

TIKA_BLOCK_PRIVATE_IPS

true

Block private/loopback/link-local/internal targets

TIKA_MAX_REDIRECTS

5

Max remote URL redirects

TIKA_RECURSIVE_MAX_DEPTH

unset

Max recursive archive depth

TIKA_RECURSIVE_MAX_FILES

unset

Max number of recursive extracted files

TIKA_RECURSIVE_MAX_TOTAL_SIZE_MB

unset

Max total recursive expanded text size

TIKA_PDF_EXTRACT_MARKED_CONTENT

unset

Send X-Tika-PDFextractMarkedContent; true preserves paragraph structure for tagged PDFs

TIKA_PDF_EXTRACT_ANNOTATION_TEXT

unset

Send X-Tika-PDFextractAnnotationText; false avoids duplicate hyperlink URLs

TIKA_PDF_SORT_BY_POSITION

unset

Send X-Tika-PDFsortByPosition; true orders text by visual position (untagged PDFs)

TIKA_COLLAPSE_BLANK_LINES

false

Collapse runs of blank lines in extracted text (like cat -s)

Example

Set any of the above in your shell, then run tika-mcp. It starts a streamable-HTTP server on TIKA_MCP_HOST:TIKA_MCP_PORT at TIKA_MCP_PATH (default http://127.0.0.1:8000/mcp):

export TIKA_URL=http://localhost:9998
export TIKA_MCP_PORT=8000
export TIKA_PDF_EXTRACT_MARKED_CONTENT=true
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false
export TIKA_COLLAPSE_BLANK_LINES=true
tika-mcp

PDF structure preservation

By default Tika extracts PDF text line-by-line, which breaks paragraphs mid-sentence. For tagged PDFs (Google Docs / Word "Save as PDF"), rebuild real paragraphs with:

export TIKA_PDF_EXTRACT_MARKED_CONTENT=true    # use the PDF's structure tree for paragraphs
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false  # drop duplicate hyperlink URLs
export TIKA_COLLAPSE_BLANK_LINES=true          # squeeze blank lines (like `cat -s`)
tika-mcp

For untagged PDFs (no structure tree), use TIKA_PDF_SORT_BY_POSITION=true instead, which orders text by visual position.

Streamable HTTP Client Configuration

Endpoint format:

  • URL: http://<TIKA_MCP_HOST>:<TIKA_MCP_PORT><TIKA_MCP_PATH>

  • transport: streamable-http

Example MCP client config (generic):

{
  "mcpServers": {
    "tika": {
      "transport": {
        "type": "streamable-http",
        "url": "http://127.0.0.1:8000/mcp"
      }
    }
  }
}

Bearer Token Protection

Set TIKA_MCP_BEARER_TOKEN to require authenticated MCP requests:

TIKA_MCP_BEARER_TOKEN=super-secret tika-mcp

Request example:

curl -X POST "http://127.0.0.1:8000/mcp" \
  -H "Authorization: Bearer super-secret" \
  -H "Accept: application/json, text/event-stream" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"1","method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"example","version":"1.0.0"}}}'

License

MIT

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
1Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Edgaras0x4E/tika-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server