370,206 tools. Last updated 2026-08-02 11:35

"Tools and methods for extracting data from PDF files" matching MCP tools:

prefetch_datasheets
sheetsdata-mcp
Trigger background datasheet extraction for multiple parts at once (up to 20). Non-blocking — returns immediately with the status of each part. Use this to warm up datasheets for a BOM before calling read_datasheet. Example: prefetch_datasheets(['TPS54302', 'ADS1115', 'LP5907']) If a part comes back 'no_source' on the first call, retry prefetch for that MPN once after 10-30s — the URL resolver is retriable and often finds a source on the second pass. If still 'no_source', use request_datasheet_upload + confirm_datasheet_upload to attach your own PDF (org-private). Part numbers must be specific MPNs (e.g. 'STM32F446RCT6', 'TPS54302DDCR') or LCSC numbers (e.g. 'C2837938'). Do NOT pass bare values ('100nF', '10K'), descriptions, BOM reference designators, test points, or board/module names — see the server instructions for the full rule set. When a BOM has values-only rows, use search_parts first to resolve each to an MPN. DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector
find_examples
Senzing
Find working SOURCE CODE examples from 37 indexed Senzing GitHub repositories. REQUIRED: either `query` (string, for search) or `repo` with `file_path` or `list_files=true` — the call WILL FAIL without one. Three modes: (1) Search: pass `query` to find examples across all repos, (2) File listing: pass `repo` + `list_files=true`, (3) File retrieval: pass `repo` + `file_path`. Indexes source code (.py, .java, .cs, .rs) and READMEs — NOT build/data files. For sample data, use get_sample_data. Covers Python, Java, C#, Rust SDK patterns: initialization, ingestion, search, redo, configuration, message queues, REST APIs. Use max_lines to limit large files. Returns GitHub raw URLs for file retrieval.
Connector
create_upload_url
ibo
Get a short-lived presigned URL to upload one brand-asset file to IBO's private storage. Requires order_token from get_order; the storage location is bound to the order server-side. PUT the raw file bytes to url, then reference key in submit_brief files[]. Allowed: jpg png webp pdf svg mp4 mov zip ai psd; 250MB/file, 1GB per order.
Connector
check_extraction_status
sheetsdata-mcp
Check the extraction status of one or more parts. Free. Each entry includes the current extraction step, elapsed seconds, and document ID. Use after prefetch_datasheets or after read_datasheet triggers a new extraction. Recommended polling cadence: every 5-10 seconds. Extraction typically takes 30s-2min for new parts, so polling faster than every 5s wastes calls. Stop polling once status is 'ready', 'failed', 'no_source', or 'unsupported'. DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector
get_part_details
sheetsdata-mcp
Get full details for a specific electronic component by manufacturer part number (MPN) or LCSC number. Returns specs, pricing, and stock from all configured providers, plus the cached datasheet summary if available. Includes datasheet_status and available_sections when ready. Set prefetch_datasheet=true to trigger background extraction — no extra charge. Use after search_parts to drill into a specific result. The part_number must be a specific manufacturer part number (e.g. 'TPS54302DDCR', 'STM32F446RCT6') or LCSC number (e.g. 'C2837938'). Do NOT pass bare component values ('100nF', '10K'), descriptions ('buck converter'), or reference designators ('R1', 'U3'). DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector
compare_parts
sheetsdata-mcp
Compare 2-5 electronic components side by side in a single call. For each part, returns merged provider data (pricing, stock, structured parameters, package) plus the cached datasheet summary if one exists, plus datasheet_status ('ready', 'extracting', or 'not_extracted'). Use this instead of calling get_part_details in a loop — it fans out provider queries in parallel and merges by MPN. For *discovering* candidates, use search_parts or find_alternative first; compare_parts assumes you already know which MPNs you want to compare. Behavior: - Uses only cached datasheet summaries — does not trigger extraction. Call prefetch_datasheets first if you need summaries for parts that haven't been extracted yet. - Validates every MPN upfront. If *any* input is not a real part number (value, description, reference designator), the whole call is rejected with a 'rejected' map listing the offenders — other parts are not compared. Filter your list before calling. - If a valid MPN is not found at any provider, that part still appears in the response with an 'error' field; the other parts are compared normally. IMPORTANT — part_numbers must be specific manufacturer part numbers (e.g. 'TPS54302DDCR', 'STM32F446RCT6') or LCSC numbers (e.g. 'C2837938'). Do NOT pass component values ('100nF', '10K'), descriptions ('buck converter'), or reference designators ('U3', 'R1'). Example: compare_parts(['TPS54302', 'LM2596', 'MP2359']) DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector

Matching MCP Servers

PDF Tools
File Systems Search Text Summarization
Open-Document-Alliance
A
license
-
quality
A
maintenance
The local PDF workflow for Claude Desktop and MCP hosts: fill, sign, merge, split, extract, and analyze PDFs without sending files to a web app.
Last updated 2026-07-31
6
147
MIT
methods-mcp
Bioinformatics Research & Data AI & Machine Learning
webwebb56
A
license
-
quality
D
maintenance
Provides MCP tool adapters for Bioconductor methods like limma, DESeq2, and fgsea, enabling statistical analysis of omics data through containerized R execution. It serves as a bridge between MCP clients and bioinformatics tools for reproducible research workflows.
Last updated 2026-04-18
Apache 2.0

Matching MCP Connectors

Tools for Agents
Decision Layer for AI Agents — 58+ tools, Advisor, MCP. Free key: POST /v1/register {}.
pdf
Markdown to PDF: headings, bold, code, lists, rules. A4/Letter/Legal. Free 30/hr. MCP + REST.

create_powersource_docs
Heista
Build a complete creative intelligence profile from internal brand documents — creative briefs, brand guidelines, product specs, customer research, competitive analysis. Takes any mix of file_ids (from a previous upload), document_urls (public PDF/DOCX/TXT/MD links, up to 10), or documents_inline (base64-encoded files with filename), plus an optional context_url for layering live brand context (colors, fonts, current messaging) and optional idempotency_key. Returns a job_id; poll with get_powersource. Output shape is identical to create_powersource_url: identity, offer, selling points, voice, buyer profile, tensions, angles, emotional arcs, ctas, narrative. Use this when the user says "I have a brief", "here's my brand guidelines", "use this document", drops a PDF / DOCX / strategy deck, or when the truth lives in internal materials rather than the public website. The pipeline reads text only — convert PDFs to markdown before submitting via documents_inline when possible. Costs 100 credits. Do NOT use for URL-only scans — use create_powersource_url. For URL + docs combined (highest fidelity, triangulates public messaging against internal strategy), use create_powersource_full.
Connector
convert_document
Carbone MCP
Convert any document to another format without storing a template. Supports 100+ input/output format combinations: Office documents, PDFs, images, web pages, spreadsheets, and more. The source file can be a local path, a URL, or a base64 string. Use render_document instead when you need data injection ({d.field} tags), translations, or batch generation. Common conversions: DOCX → PDF (file: "report.docx", convertTo: "pdf"), XLSX → PDF (file: "data.xlsx", convertTo: "pdf"), PPTX → PDF (file: "slides.pptx", convertTo: "pdf", converter: "O" for best fidelity), HTML → PDF (file: "page.html", convertTo: "pdf", converter: "C" for full CSS/JS rendering), DOCX → HTML (file: "doc.docx", convertTo: "html"), XLSX → CSV (file: "sheet.xlsx", convertTo: "csv"), PDF → PNG (file: "doc.pdf", convertTo: "png"), PPTX → PNG (first slide as image), MD → PDF (file: "readme.md", convertTo: "pdf").
Connector
parse_pdf_to_text
Nordic Financial MCP
Download a PDF from a URL and extract all text content, page by page. Use this to read the full text of a specific document — for example, an annual report PDF linked from a search_filings result. Best combined with search_filings: use search_filings to locate the document, then parse_pdf_to_text for the full text. Do not use for PDFs that are already well-represented in the database — search_filings is faster and returns pre-ranked, relevant excerpts. Not suitable for scanned (image-only) PDFs without embedded text; those pages will be returned as "(no extractable text)". Args: pdf_url: Direct HTTPS URL to the PDF file, e.g. https://example.com/report.pdf. Must be publicly accessible; authentication-protected URLs will fail. Returns: All text from the PDF with "--- Page N ---" separators between pages. Returns an error string if the download fails, the URL does not point to a valid PDF, or the document exceeds the 60-second download timeout.
Connector
convert_html_to_pdf
Sats4AI - Bitcoin-Powered AI Tools
Convert HTML or Markdown to a pixel-perfect PDF. Returns JSON: { url } — a temporary download URL (valid ~1 hour). Great for generating invoices, reports, receipts, or formatted documents programmatically. Supports full HTML/CSS including tables, images (base64 or URL), and inline styles. For Markdown input, set format='markdown'. 50 sats per conversion. Use convert_file instead for converting existing files between formats (e.g., DOCX→PDF). Pay per request with Bitcoin Lightning — no API key or signup needed. Requires create_payment with toolName='convert_html_to_pdf'.
Connector
search_parts
sheetsdata-mcp
Search for electronic components by part number, description, or keyword. Start here — this is the best entry point for finding components. Queries all configured providers in parallel. Results are merged by MPN with indicative pricing and stock from each source. Each result includes datasheet_status so you know which parts have datasheets available for read_datasheet. Best with specific part numbers or keywords (e.g. 'STM32F103', 'buck converter 3A'). For spec-based discovery in natural language, use search_datasheets instead. When the calling org has a private parts library, matching org-uploaded parts are appended to the results with source='private_library' and any tags the team has applied — including private parts whose MPN, manufacturer, description, type, category, or tag matches the query. DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector
firecrawl_parse
firecrawl-mcp
Parse one supported document into markdown, HTML, links, summary, targeted answers, or JSON matching a schema. Supported inputs include common HTML, PDF, Word, RTF, OpenDocument, and spreadsheet files; PDF parsing can be bounded with `pdfOptions.maxPages`. Local MCP reads `filePath` from the server filesystem. Hosted MCP uses two calls: first provide `filePath` to receive upload instructions, upload locally, then call again with the returned `uploadRef`; do not send both fields together. Remote web URLs belong in `firecrawl_scrape`. Set `redactPII` to request redaction of personally identifiable information in the returned content. `zeroDataRetention` requires an eligible authenticated account; omit it for anonymous keyless use. Returns upload instructions for hosted phase one or parsed document content for the final call.
Connector
pictify_render_multi_page_pdf
mcp
Generate a multi-page PDF from a template by providing multiple sets of variables. Each variable set produces one page in the final document. Supports 1-100 pages per PDF. Common use cases: bulk invoice generation, certificate batches for events/courses, multi-page reports, product catalogs, and employee ID cards. WORKFLOW: Call pictify_get_template_variables first to discover available variables, then provide an array of variable sets (one per page). Returns a single combined PDF URL. For generating separate image files per set, use pictify_batch_render instead.
Connector
get_filing_extract_meta
ca-rate-filings
Lists **what's in** each extracted artefact for a filing — section counts, item names, and the page each item came from — without returning any of the bulky factor tables, descriptions, or rate rows themselves. **Call this FIRST**, before `get_filing_extracts`, for any "what does this filing contain" question. It costs a fraction of the tokens and tells you which file + which section you need to pull in detail. `get_filing_extracts` is then the targeted second call once you know the SERFF + file + section that actually answer the user's question. Use this when the user asks: - "What forms does this filing include?" / "List the form numbers in TSIS-134726605." - "How many exclusions does it carry? What are they called?" - "What rate tables are in this filing, and which PDF page are they on?" - "List the discounts / endorsements / coverages this filing offers." - "Where in the source PDF is the territory rate table?" - Any "how many", "what are the names of", or "which page is X on" question about a filing's extracted artefacts. Wrong surface for: - Anything that needs the actual numeric content (factor values, full rate rows, full exclusion text). Call `get_filing_extracts` instead, narrowing `files` to just the one(s) you discovered here. Whitelist (same as `get_filing_extracts`): - `calculations.json` — example rate-calculation walk-throughs. - `coverages.json` — coverage definitions (perils, limits, applicability). - `deductibles.json` — deductible options + factors. - `discounts.json` — discount / surcharge schedules. - `endorsements.json` — optional endorsements / riders. - `examples.json` — worked policyholder rating examples. - `exclusions.json` — coverage exclusions + the conditions they apply to. - `extraction_summary.json` — structured filing-overview fields. - `final_rating_calculation.json` — canonical rating expression. - `forms.json` — policy form numbers + types. - `rates_data.json` — base rates + rate-table headers. - `underwriting_guidelines.json` — eligibility / UW rules. Per item the tool returns `{ name, source_page? }`. The item name is picked from whichever identifying field exists (`name` → `form_number` → `id` → `key` → `code` → `coverage` → `label` → `title`). `source_page` is the page in the source PDF where the item was extracted from, when the pipeline recorded one. `rates_data.json` items additionally carry `source_file` — the source PDF the rate table lives in — when the filing has a single source PDF. Multi-source filings get `source_file_note` flagging the limit (per-item `source_file` on non-rate extracts needs a pipeline-side change, deferred). Args: `serff` (required), `files` (optional — pass a subset of the whitelist to narrow; omit for all 12). Returns: `{ serff, files: { "<name>": { file_name, filing_ref?, confidence?, sections: { "<key>": { count, items: [...] } }, total_items } }, count, skipped }`.
Connector
nonprofit_get_filings
nonprofit-explorer-mcp-server
All Form 990 filings for a tax-exempt org by EIN: year-by-year revenue, expenses, assets, program-expense ratio (with inputs shown), executive compensation, and source PDF/XML links. Use for trend analysis, due diligence, and accessing primary 990 documents. The filing year (tax_prd_yr) is the fiscal year of the return — data lags 1–2 years; always cite the year. Program expense ratio is computed as (total_expenses − officer comp − other wages − fundraising) / total_expenses for 990/990-EZ; not available for 990-PF. Also returns filings_pdf_only — older filings with a PDF but no extracted financial data. Data from ProPublica Nonprofit Explorer, sourced from IRS Form 990 filings.
Connector
render_template_to_pdf
pdfzen
Prepare a paid PDF render from one of the 45 starter templates. Validates your template + data and returns the exact, ready-to-execute HTTP request to run against pdfzen's render endpoint — POST /v402/render/pdf (x402, $0.006 USDC on Base, no API key) or POST /v1/render/pdf (pdfzen API key, credit-billed). pdfzen renders are executed over HTTP, not streamed in-band over MCP; this tool is the bridge. Your `data` is merged ON TOP of the starter sampleData at render time, so partial payloads inherit demo defaults (e.g. ship just the customer name + total).
Connector
upload_data_source
clariBI.com
Create a new data source from an inline base64-encoded file (CSV, TSV, JSON, Excel, TXT, PDF). The file goes through the same validation and preprocessing as a web upload. Returns the data_source_id you can pass to run_analysis as soon as preprocessing completes (poll get_data_source_schema for readiness or pass wait_seconds to block here).
Connector
resolve_topic
Pine Script
USE WHEN looking up an exact Pine Script API term or known concept keyword. Returns the best-matching doc paths with matched keywords and a retrieval suggestion (get_doc or list_sections + get_section). AFTER calling this tool, follow the suggestion: call get_doc() for small files or list_sections() + get_section() for large files. For natural language questions use search_docs() instead. Data sourced from bundled TOPIC_MAP and doc file content scan.
Connector
t54_list_operations
Agentic Swarm Marketplace (T54 x402 MCP)
Returns operationIds, HTTP methods, paths, and query parameter names from the bundled OpenAPI spec (no network). Use before t54_x402_request or per-SKU tools.
Connector
get_full_text
Pubmed
Fetch the FULL TEXT of a biomedical paper from PubMed Central (the open-access subset) by PubMed ID. PREFER OVER get_abstract when you need methods/results/discussion, not just the abstract — "read the full paper", "what methods did <PMID> use", "extract details from the paper". Resolves the PMID to its PMC id and returns the article body text (capped ~40k chars). Only open-access articles are in PMC — returns has_full_text:false (use get_abstract) otherwise.
Connector