tika-mcp
Provides tools for document parsing and content extraction using Apache Tika, supporting plain text, metadata, HTML, MIME detection, and recursive archive content.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@tika-mcpextract text from MarchReport.pdf"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
tika-mcp
tika-mcp is an MCP (Model Context Protocol) server that wraps Apache Tika and
exposes document parsing tools over streamable-http.
It supports local files and optional remote URL ingestion with SSRF protections, plus extraction for plain text, metadata, HTML, MIME detection, and recursive archive content.
Features
Async direct Tika Server HTTP integration (
/tika,/meta,/detect/stream,/rmeta)MCP tools:
extract_text,extract_metadata,detect_mime_type,extract_html,extract_documentsLocal file controls (
TIKA_ALLOW_LOCAL_FILES,TIKA_ALLOWED_LOCAL_ROOTS)Optional remote URL ingestion with blocked private/internal targets by default
Bounded input and output sizes with clear tool-facing errors
Optional bearer-token protection for exposed MCP HTTP deployments
Related MCP server: kordoc
Requirements
Python 3.11+
Apache Tika Server reachable at
TIKA_URL(defaulthttp://localhost:9998)
Installation
pip install tika-mcpOr from source:
pip install .Quick Start
Install tika-mcp:
pip install tika-mcpStart Apache Tika Server:
docker run --rm -p 9998:9998 apache/tika:3.3.1.0Start tika-mcp:
export TIKA_URL=http://localhost:9998 tika-mcpIt now listens at
http://127.0.0.1:8000/mcp(transport:streamable-http).
Tools
Every tool takes a single argument, source - a local file path, or an http(s)://
URL (remote URLs must be enabled). All tools are read-only.
Tool | What it does |
| Extract plain text from a document (PDF, Office, HTML, and other Tika-supported formats). |
| Return document metadata (author, title, content type, page count, …) without the body text. |
| Detect the file's MIME/content type from its bytes, without full parsing. |
| Extract structured XHTML with headings, tables, and links preserved. |
| Unpack a container/archive/compound file (zip, email, compound doc) and return each embedded document. |
Configuration
Variable | Default | Description |
|
| MCP bind host |
|
| MCP bind port |
|
| Streamable HTTP endpoint path |
| unset | Optional bearer token required for MCP HTTP requests |
|
| Tika Server base URL |
|
| Request timeout for Tika and remote downloads |
|
| Maximum local/remote input file size |
|
| Max MCP tool response size; also caps the streamed Tika response body before it is buffered |
|
| Enable local file sources |
| unset | Comma-separated allowed local path roots |
|
| Enable remote URL sources |
|
| Allowed remote URL schemes |
|
| Block private/loopback/link-local/internal targets |
|
| Max remote URL redirects |
| unset | Max recursive archive depth |
| unset | Max number of recursive extracted files |
| unset | Max total recursive expanded text size |
| unset | Send |
| unset | Send |
| unset | Send |
|
| Collapse runs of blank lines in extracted text (like |
Example
Set any of the above in your shell, then run tika-mcp. It starts a streamable-HTTP
server on TIKA_MCP_HOST:TIKA_MCP_PORT at TIKA_MCP_PATH (default
http://127.0.0.1:8000/mcp):
export TIKA_URL=http://localhost:9998
export TIKA_MCP_PORT=8000
export TIKA_PDF_EXTRACT_MARKED_CONTENT=true
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false
export TIKA_COLLAPSE_BLANK_LINES=true
tika-mcpPDF structure preservation
By default Tika extracts PDF text line-by-line, which breaks paragraphs mid-sentence. For tagged PDFs (Google Docs / Word "Save as PDF"), rebuild real paragraphs with:
export TIKA_PDF_EXTRACT_MARKED_CONTENT=true # use the PDF's structure tree for paragraphs
export TIKA_PDF_EXTRACT_ANNOTATION_TEXT=false # drop duplicate hyperlink URLs
export TIKA_COLLAPSE_BLANK_LINES=true # squeeze blank lines (like `cat -s`)
tika-mcpFor untagged PDFs (no structure tree), use TIKA_PDF_SORT_BY_POSITION=true instead, which
orders text by visual position.
Streamable HTTP Client Configuration
Endpoint format:
URL:
http://<TIKA_MCP_HOST>:<TIKA_MCP_PORT><TIKA_MCP_PATH>transport:
streamable-http
Example MCP client config (generic):
{
"mcpServers": {
"tika": {
"transport": {
"type": "streamable-http",
"url": "http://127.0.0.1:8000/mcp"
}
}
}
}Bearer Token Protection
Set TIKA_MCP_BEARER_TOKEN to require authenticated MCP requests:
TIKA_MCP_BEARER_TOKEN=super-secret tika-mcpRequest example:
curl -X POST "http://127.0.0.1:8000/mcp" \
-H "Authorization: Bearer super-secret" \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"1","method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"example","version":"1.0.0"}}}'License
MIT
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Edgaras0x4E/tika-MCP'
If you have feedback or need assistance with the MCP directory API, please join our Discord server