306,570 tools. Last updated 2026-07-25 14:25

"A tool for extracting text from PDF files" matching MCP tools:

parse_pdf_to_text
Nordic Financial MCP
Download a PDF from a URL and extract all text content, page by page. Use this to read the full text of a specific document — for example, an annual report PDF linked from a search_filings result. Best combined with search_filings: use search_filings to locate the document, then parse_pdf_to_text for the full text. Do not use for PDFs that are already well-represented in the database — search_filings is faster and returns pre-ranked, relevant excerpts. Not suitable for scanned (image-only) PDFs without embedded text; those pages will be returned as "(no extractable text)". Args: pdf_url: Direct HTTPS URL to the PDF file, e.g. https://example.com/report.pdf. Must be publicly accessible; authentication-protected URLs will fail. Returns: All text from the PDF with "--- Page N ---" separators between pages. Returns an error string if the download fails, the URL does not point to a valid PDF, or the document exceeds the 60-second download timeout.
Connector
summarize_document
DocImprint
Summarize document text into a prose summary and key points with citations. Use after extract_text or extract_url when you need a condensed understanding of a long document. For single-sentence Q&A, use qa_url instead. For extracting specific fields, use extract_structured. Typical workflow: extract_text/extract_url → summarize_document. Returns: { summary: string, key_points: string[], summary_cited: { value, confidence, citations[] }, key_points_cited: [{ text, citations[] }], truncated: boolean, strategy: "full"|"truncated"|"chunked" } Example prompts: - "Summarize this financial report and give me the key points." - "What are the main takeaways from this document?" - "Give me a concise summary of this 50-page report."
Connector
summarize_document
api
Summarize document text into a prose summary and key points with citations. Use after extract_text or extract_url when you need a condensed understanding of a long document. For single-sentence Q&A, use qa_url instead. For extracting specific fields, use extract_structured. Typical workflow: extract_text/extract_url → summarize_document. Returns: { summary: string, key_points: string[], summary_cited: { value, confidence, citations[] }, key_points_cited: [{ text, citations[] }], truncated: boolean, strategy: "full"|"truncated"|"chunked" } Example prompts: - "Summarize this financial report and give me the key points." - "What are the main takeaways from this document?" - "Give me a concise summary of this 50-page report."
Connector
get_file_contents
Github
Read a file from a PUBLIC GitHub repository (or list a directory) by path. PREFER OVER WEB SEARCH for "show me the README / package.json / <file> of <repo>", "read <path> from <owner/repo>", inspecting source or config files. Pass owner + repo + path (omit path or "" for the repo root listing). Optional ref = branch/tag/commit SHA. Returns decoded text for files (capped ~60k), or a directory listing of {name, path, type, size}.
Connector
upload_attachment
tela
Upload a file (base64) and attach it to a page (editor+) — an image, PDF, dataset, etc. Returns the serve URL plus a ready-to-paste `markdown` snippet; then call update_page or patch_page to place it in the body (images render inline as ![](…), other files as a download card). The payload is inline base64 and rides through the model's context, so it is capped at 5 MB — keep it to small files (screenshots, charts, short PDFs). For larger files use request_attachment_upload (a direct PUT URL, bytes off-context), or the tela editor (drag-drop).
Connector
ris_get_document
ris-austria-mcp-server
Fetch one RIS document’s full text or its rendition URLs, with explicit binding status and the amtssigniert authentic PDF surfaced wherever it exists. Address the document exactly one of two ways: document_number plus application (both copied verbatim from a ris_search_* or ris_lookup_citation result), or a document_url from a result’s content_urls. format: markdown (default — the HTML rendition converted to markdown), html (raw HTML rendition), xml (the RIS Nutzdaten XML), or urls_only (no fetch — every rendition URL, including the Authentisch PDF). Format availability varies by application and the tool degrades explicitly, never silently: consolidated law, gazettes, case law, drafts, and most sectoral collections carry full text; district and municipal promulgations and court rules (Bvb, GrA, KmGer) publish only the signed authentic PDF; party-transparency decisions and council minutes (Upts, Mrp) are PDF-only; the 1848–1940 imperial gazettes (BgblAlt) are metadata-only — for these a text-format request returns a format_unavailable notice with the usable URL, not an error. Every result carries binding_status; only authentic (amtssigniert) publications are legally binding. This tool returns content, not fresh metadata — the metadata rides the search/lookup step that produced the document number. When the markdown text overflows the byte budget the tool returns a §/Artikel/Anlage section outline (kind: outline) instead of truncating; re-call with sections:[…] naming outline entries to retrieve just those. Raw html/xml renditions, which carry no such headings, return in full.
Connector

Matching MCP Servers

PDF to Text MCP Server
File Systems Research & Data
xxx87
A
license
-
quality
D
maintenance
Converts PDF files to text for use with MCP-compatible applications like Cursor IDE.
Last updated 2025-09-19
MIT
files-mcp-ts
File Systems Developer Tools
LiviuBirjega
F
license
B
quality
D
maintenance
A lightweight MCP server for basic file operations, enabling reading, writing, and listing files securely via the Model Context Protocol.
Last updated 2025-09-25
3
1

Matching MCP Connectors

mifactory-pdf
Send transactional pdfs for AI agents via SMTP. Templates included.
pdf-generator
Generate PDFs from Markdown or HTML. Zero-auth, agent-native. Returns base64-encoded PDF.

check_text
Writing Style Checker
Analyze text for writing style issues: weasel words, passive voice, duplicate words, long sentences, nominalizations, hedging, filler adverbs, and research-cited AI tells. Read-only and stateless — text is analyzed in memory on the hosted server and never stored. Returns a plain-text report with each issue's line and column, the matched text, surrounding context, and the reason for AI tells; texts over 100,000 characters return an error message. This hosted server has no filesystem access — the wsc-mcp npm package adds a check_file tool for local files. It only reports issues — to auto-remove duplicate words, follow up with fix_duplicates.
Connector
files_read
DialogBrain
Read **text content** of an attached file. Works for: .txt, .md, .json, code files, and PDFs (after files.ingest extracts text). DO NOT call on binary files — for IMAGES use `files.get_base64`, for AUDIO/VIDEO it cannot be transcribed via this tool, and for non-PDF DOCUMENTS run `files.ingest` first, THEN files.read. Calling on a binary mime-type returns an error — saves you a turn to read the routing hint before deciding.
Connector
files_read
dialogbrain
Read **text content** of an attached file. Works for: .txt, .md, .json, code files, and PDFs (after files.ingest extracts text). DO NOT call on binary files — for IMAGES use `files.get_base64`, for AUDIO/VIDEO it cannot be transcribed via this tool, and for non-PDF DOCUMENTS run `files.ingest` first, THEN files.read. Calling on a binary mime-type returns an error — saves you a turn to read the routing hint before deciding.
Connector
get_filing_extract_meta
ca-rate-filings
Lists **what's in** each extracted artefact for a filing — section counts, item names, and the page each item came from — without returning any of the bulky factor tables, descriptions, or rate rows themselves. **Call this FIRST**, before `get_filing_extracts`, for any "what does this filing contain" question. It costs a fraction of the tokens and tells you which file + which section you need to pull in detail. `get_filing_extracts` is then the targeted second call once you know the SERFF + file + section that actually answer the user's question. Use this when the user asks: - "What forms does this filing include?" / "List the form numbers in TSIS-134726605." - "How many exclusions does it carry? What are they called?" - "What rate tables are in this filing, and which PDF page are they on?" - "List the discounts / endorsements / coverages this filing offers." - "Where in the source PDF is the territory rate table?" - Any "how many", "what are the names of", or "which page is X on" question about a filing's extracted artefacts. Wrong surface for: - Anything that needs the actual numeric content (factor values, full rate rows, full exclusion text). Call `get_filing_extracts` instead, narrowing `files` to just the one(s) you discovered here. Whitelist (same as `get_filing_extracts`): - `calculations.json` — example rate-calculation walk-throughs. - `coverages.json` — coverage definitions (perils, limits, applicability). - `deductibles.json` — deductible options + factors. - `discounts.json` — discount / surcharge schedules. - `endorsements.json` — optional endorsements / riders. - `examples.json` — worked policyholder rating examples. - `exclusions.json` — coverage exclusions + the conditions they apply to. - `extraction_summary.json` — structured filing-overview fields. - `final_rating_calculation.json` — canonical rating expression. - `forms.json` — policy form numbers + types. - `rates_data.json` — base rates + rate-table headers. - `underwriting_guidelines.json` — eligibility / UW rules. Per item the tool returns `{ name, source_page? }`. The item name is picked from whichever identifying field exists (`name` → `form_number` → `id` → `key` → `code` → `coverage` → `label` → `title`). `source_page` is the page in the source PDF where the item was extracted from, when the pipeline recorded one. `rates_data.json` items additionally carry `source_file` — the source PDF the rate table lives in — when the filing has a single source PDF. Multi-source filings get `source_file_note` flagging the limit (per-item `source_file` on non-rate extracts needs a pipeline-side change, deferred). Args: `serff` (required), `files` (optional — pass a subset of the whitelist to narrow; omit for all 12). Returns: `{ serff, files: { "<name>": { file_name, filing_ref?, confidence?, sections: { "<key>": { count, items: [...] } }, total_items } }, count, skipped }`.
Connector
get_site_files
Just Publish
Read the files of a site you already published, so you can make a targeted edit instead of rebuilding the whole site from memory. Returns a complete manifest (every file's path, size, content-type, sha256) plus the contents of the text files (HTML/CSS/JS/etc). Also returns the site's current `version` — pass it back to update_site_file so you don't overwrite a newer change. Pass `paths` to fetch only specific files; omit it to get all text files. Requires site_id + edit_token.
Connector
firecrawl_parse
firecrawl-mcp
Parse a file using Firecrawl's /v2/parse endpoint. In local/non-cloud MCP mode, this tool reads filePath from the MCP server filesystem and posts multipart data to the configured self-hosted FIRECRAWL_API_URL, preserving the existing direct-read behavior. In hosted CLOUD_SERVICE mode, this tool is a two-call flow because hosted MCP cannot read your local filesystem: 1. Call with filePath, contentType, parse options, and optional declaredSizeBytes. The hosted server mints a short-lived upload URL and returns a safe local curl PUT command plus nextToolCall. 2. Run the returned curl command locally, then call firecrawl_parse again with uploadRef and the desired parse options. The hosted server calls /v2/parse server-side with your session credential. **Best for:** Extracting content from a local document (PDF, Word, Excel, HTML, etc.); pulling structured data out of a file with JSON format; converting binary documents into markdown for downstream reasoning. **Not recommended for:** Remote URLs (use firecrawl_scrape); multiple files at once (call parse multiple times); documents that require interactive actions, screenshots, or change tracking — those aren't supported by the parse endpoint. **Common mistakes:** In hosted mode, do not pass both filePath and uploadRef. Phase 1 uses filePath only to generate upload instructions; phase 2 uses uploadRef only to parse server-side. **Supported file types:** .html, .htm, .xhtml, .pdf, .docx, .doc, .odt, .rtf, .xlsx, .xls **Unsupported options:** actions, screenshot/branding/changeTracking formats, waitFor > 0, location, mobile, proxy values other than "auto" or "basic". **Privacy:** Set `redactPII: true` to return content with personally identifiable information redacted. **CRITICAL - Format Selection (same rules as firecrawl_scrape):** When the user asks for SPECIFIC data points from a document, you MUST use JSON format with a schema. Only use markdown when the user needs the ENTIRE document content. **Handling PDFs:** Add `"parsers": ["pdf"]` (optionally with `pdfOptions.maxPages`) when parsing a PDF so the PDF engine is invoked explicitly. For very long documents, cap `maxPages` to keep the response within token limits. **Hosted phase 1 example:** ```json { "name": "firecrawl_parse", "arguments": { "filePath": "/absolute/path/to/document.pdf", "contentType": "application/pdf", "formats": ["markdown"], "parsers": ["pdf"], "zeroDataRetention": true } } ``` **Hosted phase 2 example:** ```json { "name": "firecrawl_parse", "arguments": { "uploadRef": "upload-ref-from-phase-1", "formats": ["markdown"], "parsers": ["pdf"], "zeroDataRetention": true } } ``` **Returns:** Phase 1 hosted upload instructions or a parsed document with markdown, html, links, summary, json, or query results depending on the requested formats.
Connector
microsoft_docs_fetch
xpay✦ DevTools Collection
Fetch and convert a Microsoft Learn documentation webpage to markdown format. This tool retrieves the latest complete content of Microsoft documentation webpages including Azure, .NET, Microsoft 365, and other Microsoft technologies. ## When to Use This Tool - When search results provide incomplete information or truncated content - When you need complete step-by-step procedures or tutorials - When you need troubleshooting sections, prerequisites, or detailed explanations - When search results reference a specific page that seems highly relevant - For comprehensive guides that require full context ## Usage Pattern Use this tool AFTER microsoft_docs_search when you identify specific high-value pages that need complete content. The search tool gives you an overview; this tool gives you the complete picture. ## URL Requirements - The URL must be a valid HTML documentation webpage from the microsoft.com domain - Binary files (PDF, DOCX, images, etc.) are not supported ## Output Format markdown with headings, code blocks, tables, and links preserved.
Connector
sieve_dataroom_add
Sieve
Add a document to a deal's data room. Creates the deal if needed. This is the primary way to get documents into Sieve for screening. Upload a pitch deck, financials, or any document -- then call sieve_screen to analyze everything in the data room. Provide company_name to create a new deal (or find existing), or deal_id to add to an existing deal. Provide exactly one content source: file_path (local file), text (raw text/markdown), or url (fetch from URL). Args: title: Document title (e.g. "Pitch Deck Q1 2026"). company_name: Company name -- creates deal if new, finds existing if not. deal_id: Add to an existing deal (from sieve_deals or previous sieve_dataroom_add). website_url: Company website URL (used when creating a new deal). document_type: Type: 'pitch_deck', 'financials', 'legal', or 'other'. file_path: Path to a local file (PDF, DOCX, XLSX). The tool reads and uploads it. text: Raw text or markdown content (alternative to file). url: URL to fetch document from (alternative to file).
Connector
atlas_download_report
CareerProof MCP
Download a completed report as PDF. Returns base64-encoded PDF content. Confirm report status='completed' via atlas_get_report(report_id) first. report_id from atlas_start_report response or atlas_list_reports. Free.
Connector
list_docs
Pine Script
USE WHEN discovering what Pine Script v6 documentation is available. Returns a categorised list of doc file paths with one-line descriptions. AFTER calling this tool, call get_doc(path) for small files or list_sections(path) then get_section(path, header) for large files (ta.md, strategy.md, collections.md, drawing.md, general.md). Data sourced from bundled Pine Script v6 documentation.
Connector
create_powersource_docs
Heista
Build a complete creative intelligence profile from internal brand documents — creative briefs, brand guidelines, product specs, customer research, competitive analysis. Takes any mix of file_ids (from a previous upload), document_urls (public PDF/DOCX/TXT/MD links, up to 10), or documents_inline (base64-encoded files with filename), plus an optional context_url for layering live brand context (colors, fonts, current messaging) and optional idempotency_key. Returns a job_id; poll with get_powersource. Output shape is identical to create_powersource_url: identity, offer, selling points, voice, buyer profile, tensions, angles, emotional arcs, ctas, narrative. Use this when the user says "I have a brief", "here's my brand guidelines", "use this document", drops a PDF / DOCX / strategy deck, or when the truth lives in internal materials rather than the public website. The pipeline reads text only — convert PDFs to markdown before submitting via documents_inline when possible. Costs 100 credits. Do NOT use for URL-only scans — use create_powersource_url. For URL + docs combined (highest fidelity, triangulates public messaging against internal strategy), use create_powersource_full.
Connector
render_html_to_pdf
pdfzen
Prepare a paid PDF render from arbitrary Handlebars-flavoured HTML. Use only when no starter fits (one-off layouts, custom branding). Prefer render_template_to_pdf when a starter matches. Validates your HTML and returns the exact, ready-to-execute HTTP request to run against pdfzen's render endpoint — POST /v402/render/pdf (x402, $0.006 USDC on Base, no API key) or POST /v1/render/pdf (pdfzen API key). pdfzen renders are executed over HTTP, not streamed in-band over MCP; this tool is the bridge.
Connector
report_pdf_generate
AurelianFlo
Return an inline PDF artifact from supplied report_meta, tables, metrics, and summary content; this read-only renderer does not persist hosted files. Use this only when a structured report payload already exists; use report_docx_generate for editable Word output or compliance_edd_report to build the memo first.
Connector
extract_tables
DocImprint
Extract tables and forms as Markdown from a PDF or image (base64-encoded). Use when the document contains structured tabular data such as financial statements, data sheets, or forms. For plain prose documents, use extract_text instead. Returns: { pages: number, text: string } — text contains Markdown-formatted tables. Example prompts: - "Extract the tables from this financial statement." - "Pull the data table from this PDF into Markdown format." - "Get the tabular data from this form document."
Connector