Which integrations are available for this server?

Provides tools for document parsing and content extraction using Apache Tika, supporting plain text, metadata, HTML, MIME detection, and recursive archive content.

How do I use tika-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@tika-mcp extract text from MarchReport.pdf" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

tika-mcp

by Edgaras0x4E

Overview Schema Related Servers Score Discussions

Python

Hybrid

tika-mcp

tika-mcp is an MCP (Model Context Protocol) server that wraps Apache Tika and exposes document parsing tools over streamable-http.

It supports local files and optional remote URL ingestion with SSRF protections, plus extraction for plain text, metadata, HTML, MIME detection, and recursive archive content.

Features

Async direct Tika Server HTTP integration (/tika, /meta, /detect/stream, /rmeta)
MCP tools: extract_text, extract_metadata, detect_mime_type, extract_html, extract_documents
Local file controls (TIKA_ALLOW_LOCAL_FILES, TIKA_ALLOWED_LOCAL_ROOTS)
Optional remote URL ingestion with blocked private/internal targets by default
Bounded input and output sizes with clear tool-facing errors
Optional bearer-token protection for exposed MCP HTTP deployments

Related MCP server: pdf-reader-mcp

Requirements

Python 3.11+
Apache Tika Server reachable at TIKA_URL (default http://localhost:9998)

Installation

pip install tika-mcp

Or from source:

pip install .

Quick Start

Install tika-mcp:
```
pip install tika-mcp
```

Start Apache Tika Server:

docker run --rm -p 9998:9998 apache/tika:3.3.1.0

Start tika-mcp:
```
export TIKA_URL=http://localhost:9998
tika-mcp
```
It now listens at http://127.0.0.1:8000/mcp (transport: streamable-http).

Tools

Every tool takes a single argument, source - a local file path, or an http(s):// URL (remote URLs must be enabled). All tools are read-only.

Tool	What it does
`extract_text`	Extract plain text from a document (PDF, Office, HTML, and other Tika-supported formats).
`extract_metadata`	Return document metadata (author, title, content type, page count, …) without the body text.
`detect_mime_type`	Detect the file's MIME/content type from its bytes, without full parsing.
`extract_html`	Extract structured XHTML with headings, tables, and links preserved.
`extract_documents`	Unpack a container/archive/compound file (zip, email, compound doc) and return each embedded document.

Configuration

Variable	Default	Description
`TIKA_MCP_HOST`	`127.0.0.1`	MCP bind host
`TIKA_MCP_PORT`	`8000`	MCP bind port
`TIKA_MCP_PATH`	`/mcp`	Streamable HTTP endpoint path
`TIKA_MCP_BEARER_TOKEN`	unset	Optional bearer token required for MCP HTTP requests
`TIKA_URL`	`http://localhost:9998`	Tika Server base URL
`TIKA_TIMEOUT_SECONDS`	`30`	Request timeout for Tika and remote downloads
`TIKA_MAX_FILE_SIZE_MB`	`25`	Maximum local/remote input file size
`TIKA_MAX_OUTPUT_SIZE_MB`	`10`	Max MCP tool response size; also caps the streamed Tika response body before it is buffered
`TIKA_ALLOW_LOCAL_FILES`	`true`	Enable local file sources
`TIKA_ALLOWED_LOCAL_ROOTS`	unset	Comma-separated allowed local path roots
`TIKA_ALLOW_REMOTE_URLS`	`false`	Enable remote URL sources
`TIKA_ALLOWED_URL_SCHEMES`	`http,https`	Allowed remote URL schemes
`TIKA_BLOCK_PRIVATE_IPS`	`true`	Block private/loopback/link-local/internal targets
`TIKA_MAX_REDIRECTS`	`5`	Max remote URL redirects
`TIKA_RECURSIVE_MAX_DEPTH`	unset	Max recursive archive depth
`TIKA_RECURSIVE_MAX_FILES`	unset	Max number of recursive extracted files
`TIKA_RECURSIVE_MAX_TOTAL_SIZE_MB`	unset	Max total recursive expanded text size
`TIKA_PDF_EXTRACT_MARKED_CONTENT`	unset	Send `X-Tika-PDFextractMarkedContent`; `true` preserves paragraph structure for tagged PDFs
`TIKA_PDF_EXTRACT_ANNOTATION_TEXT`	unset	Send `X-Tika-PDFextractAnnotationText`; `false` avoids duplicate hyperlink URLs
`TIKA_PDF_SORT_BY_POSITION`	unset	Send `X-Tika-PDFsortByPosition`; `true` orders text by visual position (untagged PDFs)
`TIKA_COLLAPSE_BLANK_LINES`	`false`	Collapse runs of blank lines in extracted text (like `cat -s`)

Example

Set any of the above in your shell, then run tika-mcp. It starts a streamable-HTTP server on TIKA_MCP_HOST:TIKA_MCP_PORT at TIKA_MCP_PATH (default http://127.0.0.1:8000/mcp):

export TIKA_URL=http://localhost:9998
export TIKA_MCP_PORT=8000
export TIKA_PDF_EXTRACT_MARKED_CONTENT=true
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false
export TIKA_COLLAPSE_BLANK_LINES=true
tika-mcp

PDF structure preservation

By default Tika extracts PDF text line-by-line, which breaks paragraphs mid-sentence. For tagged PDFs (Google Docs / Word "Save as PDF"), rebuild real paragraphs with:

export TIKA_PDF_EXTRACT_MARKED_CONTENT=true    # use the PDF's structure tree for paragraphs
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false  # drop duplicate hyperlink URLs
export TIKA_COLLAPSE_BLANK_LINES=true          # squeeze blank lines (like `cat -s`)
tika-mcp

For untagged PDFs (no structure tree), use TIKA_PDF_SORT_BY_POSITION=true instead, which orders text by visual position.

Streamable HTTP Client Configuration

Endpoint format:

URL: http://<TIKA_MCP_HOST>:<TIKA_MCP_PORT><TIKA_MCP_PATH>
transport: streamable-http

Example MCP client config (generic):

{
  "mcpServers": {
    "tika": {
      "transport": {
        "type": "streamable-http",
        "url": "http://127.0.0.1:8000/mcp"
      }
    }
  }
}

Bearer Token Protection

Set TIKA_MCP_BEARER_TOKEN to require authenticated MCP requests:

TIKA_MCP_BEARER_TOKEN=super-secret tika-mcp

Request example:

curl -X POST "http://127.0.0.1:8000/mcp" \
  -H "Authorization: Bearer super-secret" \
  -H "Accept: application/json, text/event-stream" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"1","method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"example","version":"1.0.0"}}}'

License

MIT

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

1Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Related MCP Servers

kordoc
Documentation Access App Automation
chrisryugj
A
license
-
quality
A
maintenance
An MCP server that parses South Korean document formats like HWP, HWPX, and PDF into Markdown. It features specialized table reconstruction and security-hardened extraction optimized for administrative and public institution files.
Last updated 2026-07-30
12,568
1,575
MIT
pdf-reader-mcp
File Systems
iflow-mcp
F
license
-
quality
D
maintenance
MCP server for extracting text from PDF files, supporting local files and URLs.
Last updated 2025-09-01
document-reader-mcp
File Systems Research & Data Developer Tools
ifmelate
A
license
-
quality
D
maintenance
Universal MCP server for extracting text from various document formats including PDF, Excel, Word, CSV, and more, with support for streaming, limits, and markdown conversion.
Last updated 2025-10-24
3
MIT
MCP Document Analysis Server
RAG Systems Text Summarization Search
BabyChrist666
F
license
-
quality
D
maintenance
A Model Context Protocol server that provides document analysis capabilities to LLM applications, including extraction, chunking, summarization, and semantic search for PDF, DOCX, and plaintext documents.
Last updated 2026-02-02

View all related MCP servers

Related MCP Connectors

Arcjet
An MCP server for Arcjet - the runtime security platform that ships with your AI code.
Parallel Task MCP
An MCP server for deep research or task groups
saagarpatel.dev Portfolio
Agent-native MCP server over the public saagarpatel.dev corpus. Read-only, stateless.

View all MCP Connectors

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Edgaras0x4E/tika-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

tika-mcp

Features

Requirements

Installation

Quick Start

Tools

Configuration

Example

PDF structure preservation

Streamable HTTP Client Configuration

Bearer Token Protection

License

Maintenance

Resources

Looking for Admin?

Related MCP Servers

kordoc

pdf-reader-mcp

document-reader-mcp

MCP Document Analysis Server

Related MCP Connectors

Latest Blog Posts

MCP directory API