226,022 tools. Last updated 2026-06-23 00:05

"A tool for document extraction using MinerU" matching MCP tools:

check_document_package
Document Integrity Validator
Validates a package of 2-20 related trade finance documents for cross-document consistency. Call this BEFORE approving any multi-document trade finance transaction or cross-border shipment -- at the moment a set of 2-20 related documents arrives from an external party and funds have not been released. Use this when your agent has received a full trade finance package — such as invoice, bill of lading, and certificate of origin together — and must verify all documents are consistent with each other before releasing funds. Returns PASS/FLAG/FAIL verdict per document with mismatch details. Cross-checks all documents for consistency across numeric values, party names, reference numbers, dates, and commodity descriptions. A single inconsistency in a trade finance document package is a fraud signal -- funds released on a mismatched package have no recovery path. Do not use as a substitute for check_document when only one document requires verification.
Connector
verify_receipt
TunnelMind Data API
Tamper-detection verification for TunnelMind surveillance receipts. Submit the receipt ID, the SHA-256 content hash, and the Ed25519 signature from the receipt document. The registry compares these against what was recorded at issuance time. Returns VALID if both match exactly, INVALID with a specific mismatch reason otherwise. Use this tool when: - You received a surveillance receipt document and want to verify it hasn't been altered. - You are programmatically checking receipt authenticity in an agent workflow. - You want to prove to a third party that a receipt is genuine. Do NOT use this tool when: - You only want to check existence — use `get_receipt` instead (no body required). Inputs: - `receipt_id` (body, required): The receipt's ID field from the document. - `content_hash` (body, required): SHA-256 hex hash of the receipt JSON. Max 256 chars. - `signature` (body, required): Ed25519 signature from the receipt document. Max 512 chars. Returns: - `valid`: boolean. True only if both hash and signature match exactly. - `status`: `VALID` or `INVALID`. - `message`: human-readable explanation. On INVALID, specifies whether the hash mismatched, the signature mismatched, or both. Cost: - Free. No API key required. Latency: - Typical: <100ms, p99: <300ms.
Connector
verify_claim
Ground Truth - First Call Activation
Check whether a factual claim is supported by a specific set of public evidence URLs that you already have. For each source, the tool performs a case-insensitive keyword match over the fetched page body, then marks that source as supporting the claim when at least half of the supplied keywords appear. Use this for evidence-backed claim checks on known pages, not for open-ended search, semantic reasoning, or contradiction extraction. The aggregate verdict is driven only by the per-page keyword support ratio. Fetched pages are cached for 5 minutes.
Connector
get_receipt
TunnelMind Data API
Returns metadata for a TunnelMind surveillance receipt — a signed document proving that a specific user's surveillance exposure was observed, measured, and recorded at a specific time. Does NOT return the receipt's signature (anti-phishing protection). To verify a receipt's content integrity, use `verify_receipt` with the hash and signature from the receipt document itself. Use this tool when: - You have a receipt ID and want to confirm it was genuinely issued by TunnelMind. - You need the issuance timestamp and signing key ID for a receipt. - You want to check whether a receipt exists before attempting content verification. Do NOT use this tool when: - You have the full receipt document and want to verify it hasn't been tampered with — use `verify_receipt` instead. Inputs: - `receipt_id` (path, required): The receipt ID from the receipt document. Alphanumeric with hyphens, max 128 characters. Returns: - `status`: `FOUND` if the receipt is in the registry. - `generated_at`: ISO 8601 timestamp of receipt issuance. - `signing_key_id`: identifier of the Ed25519 key used to sign. - `schema_version`: receipt schema version. - `message`: human-readable summary with instructions for content verification. - 404 if the receipt ID is not in the registry. Cost: - Free. No API key required. Latency: - Typical: <100ms, p99: <300ms.
Connector
create_collection
DocImprint
Create a named document collection for cross-document search and Q&A. Free — no credits consumed. NOTE: Collections are empty after creation. Add evidence bundles with add_document_to_collection. Indexing is async — once complete, use search_collection or ask_collection. Returns: { collection_id: string (col_...), name: string }
Connector
create_collection
api
Create a named document collection for cross-document search and Q&A. Free — no credits consumed. NOTE: Collections are empty after creation. Add evidence bundles with add_document_to_collection. Indexing is async — once complete, use search_collection or ask_collection. Returns: { collection_id: string (col_...), name: string }
Connector

Matching MCP Servers

MinerU Document Explorerofficial
RAG Systems Search
opendatalab
A
license
-
quality
B
maintenance
Enables AI agents to search, deep-read, and build knowledge bases from Markdown, PDF, DOCX, and PPTX documents via MCP tools for retrieval, document navigation, and ingestion.
Last updated 2026-04-26
64
587
MIT
MinerU MCP Server
Documentation Access App Automation Research & Data
linxule
A
license
A
quality
C
maintenance
Enables document parsing and extraction from PDFs and other formats using the MinerU API. Supports batch processing, page range selection, OCR in 109 languages, and VLM/pipeline models for high-accuracy content extraction.
Last updated 2026-05-07
4
108
6
MIT

Matching MCP Connectors

Document Integrity Validator
AI reasoning checks any document against known international standards before your agent acts on it.
Call For Me
Give your AI agent a phone. Place outbound calls to US businesses to ask, book, or confirm.

ask_collection
api
Answer a question using RAG over a document collection. Retrieves relevant chunks then synthesizes a cited answer. Use when you need a direct answer with source attribution; use search_collection for raw chunks. PREREQUISITE: Collection must be populated via REST API and indexed before results appear. Returns: { answer: string, sources: [{ bundle_id, chunk_id }], retrieval: [{ bundle_id, chunk_id, text, score }] }
Connector
classify_document
OpenWarrant — Document Verification Suite
Classify a FINANCIAL document's type and issuing country. Specialised in financial-services documents: payslip, tax_invoice, bank_statement, salary_certificate, payg_summary, receipt. USE THIS WHEN someone shares a document (or a link to one) and asks: what kind of document is this? is this a payslip / invoice / bank statement? route this document. Also use it as the FIRST step before verify_document, so the right checks run. Provide the document ONE way: `url` (a public http(s) link to a PDF or image — fetched server-side, the cheapest call) OR `bytes_b64` (inline base64, plus `filename` for PDF-vs-image routing). Returns `{document_type, country_code, confidence, is_financial_document, evidence, ...}`. HONEST SCOPE: type classification only — NOT an authenticity or fraud judgment (use verify_document for that). Below the confidence threshold it abstains with 'unknown' rather than guessing; non-financial documents classify as 'other'. The document is never stored.
Connector
get_guide
Canvs
⚠️ MANDATORY FIRST STEP - Call this tool BEFORE using any other Canvs tools! Returns comprehensive instructions for creating whiteboards: tool selection strategy, iterative workflow, and examples. Following these instructions ensures correct diagrams.
Connector
check_extraction_status
sheetsdata-mcp
Check the extraction status of one or more parts. Free. Each entry includes the current extraction step, elapsed seconds, and document ID. Use after prefetch_datasheets or after read_datasheet triggers a new extraction. Recommended polling cadence: every 5-10 seconds. Extraction typically takes 30s-2min for new parts, so polling faster than every 5s wastes calls. Stop polling once status is 'ready', 'failed', 'no_source', or 'unsupported'. DATASHEET STATUS VALUES: - 'ready' — extracted and indexed; call read_datasheet, search_datasheets, or analyze_image. - 'extracting' / 'in_progress' / 'queued' / 'pending' — extraction running or scheduled. Poll check_extraction_status every 5-10s until 'ready' or 'failed'. Typical time: 30s-2min. - 'not_extracted' — known part but datasheet hasn't been fetched yet. Trigger it via prefetch_datasheets (cheapest) or by calling read_datasheet (auto-triggers on first read). - 'no_source' — we couldn't find a public datasheet URL for this MPN. First, retry prefetch_datasheets in 10-30s (the URL resolver re-runs and often finds a source on the second pass). If still 'no_source', the agent can upload the PDF manually via request_datasheet_upload + confirm_datasheet_upload (see those tools). Org-uploaded datasheets are private to the org. - 'unsupported' — PDF exists but can't be extracted (scanned image-only, encrypted, or corrupted). Upload a clean text-based PDF via request_datasheet_upload to override. - 'failed' / 'error' — extraction errored. The response includes the error reason. Retry via prefetch_datasheets or escalate to support. - 'rejected' — input wasn't a real MPN (bare value like '100nF', description, or reference designator). Fix the input and re-call. - 'deduplicated' — another part in the family already has this datasheet; same content is returned under the primary MPN.
Connector
extract_contract_from_url
transaction-coordinator
Extract structured transaction data from a contract at a URL. Downloads the document, extracts text (with OCR fallback for scanned PDFs), and runs PrimaCoda's contract-extraction prompt to return parties, addresses, dates, prices, and key contract fields. Use this when an agent has the contract hosted somewhere (Dropbox, Google Drive direct download, Square Space, etc.) and wants to skip the upload step. For multi-document deals (purchase + addenda + disclosures), use the PrimaCoda dashboard's batch upload — this tool handles ONE document. Args: pdf_url: Direct download URL for the contract (PDF, DOCX, TXT, or image). Must be reachable from the PrimaCoda server. Google Drive "shared link" URLs work if set to "anyone with link"; other share URLs may need their direct-download form. api_key: Your PrimaCoda MCP API key (starts 'pck_').
Connector
tag_file
Gemina FileTag
Run the FileTag pipeline against a previously uploaded slot. The ``file_id`` comes from a prior ``files_create_upload`` call. The server validates the uploaded blob (size, content-type, optional SHA-256), atomically consumes the slot, runs the FileTag extraction (renaming + metadata embedding), and returns the structured result with the extracted metadata, the suggested filename, the ``enriched_file_url`` (short-lived signed URL to the renamed copy with metadata embedded into document properties), and a ``next_action`` recipe (``http_get_and_save``) telling the agent to download that URL and save it as the suggested filename -- act on it unless the user explicitly asked for metadata only. Each slot is single-use; reserve a new slot with ``files_create_upload`` to retry.
Connector
taste
Pane
Read / write / clear the agent's freeform UI taste notes (a small markdown document of presentation preferences learned from human feedback — 'denser layout', 'no rounded corners'). ONE tool with an `action` enum: get | set | clear. Call `get` BEFORE generating a pane so prior feedback shapes the output; `set` does a whole-document replace (not append). Keep entries about UI/presentation only.
Connector
find_matches_in_index
RChilli MCP Hub
Match one source document against the user's ALREADY-INDEXED corpus and return the best-matching, ranked candidates (RChilli Search & Match Engine). Requires a populated index. Uses RChilli's purpose-built matching engine — more reliable than manually comparing documents. Use this when the user wants to: find the best/top matching resumes for a JD, find matching candidates from their pool, or rank their indexed resumes/JDs against a given document — e.g. "find the best candidates in my database for this job". Also phrased as: shortlist from my pool, top matches for this JD, rank my candidates. Do NOT use for: scoring a single resume against a single JD with no index (use ``search_one_match``); plain keyword lookup (use ``search_simple_search``). Supports all four match directions by combining ``index_type`` and ``doc_type``: - **JD to Resume** — ``index_type='Resume'``, ``doc_type='JD'``: Search the Resume index using a JD as the source document. - **Resume to Resume** — ``index_type='Resume'``, ``doc_type='Resume'``: Search the Resume index using a Resume as the source document. - **Resume to JD** — ``index_type='JD'``, ``doc_type='Resume'``: Search the JD index using a Resume as the source document. - **JD to JD** — ``index_type='JD'``, ``doc_type='JD'``: Search the JD index using a JD as the source document. The ``document_text`` is automatically parsed using the RChilli Resume or JD parser (driven by ``doc_type``), and the resulting structured JSON is base64-encoded and submitted as the match source — no manual encoding is required. Args: index_type: Index to search — ``Resume`` (default) or ``JD``. index_key: Same as ``userkey`` — the RChilli API user key. Leave blank; the authenticated session userkey is injected automatically. doc_type: Type of the source document — ``Resume`` (default) or ``JD``. This determines which parser processes ``document_text``. document_text: Plain-text content of the source document. Parsed and encoded to base64 JSON internally.
Connector
firecrawl_scrape
xpay✦ Web Scraping Collection
Scrape content from a single URL with advanced options. This is the most powerful, fastest and most reliable scraper tool, if available you should always default to using this tool for any web scraping needs. **Best for:** Single page content extraction, when you know exactly which page contains the information. **Not recommended for:** Multiple pages (call scrape multiple times or use crawl), unknown page location (use search). **Common mistakes:** Using markdown format when extracting specific data points (use JSON instead). **Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication. **CRITICAL - Format Selection (you MUST follow this):** When the user asks for SPECIFIC data points, you MUST use JSON format with a schema. Only use markdown when the user needs the ENTIRE page content. **Use JSON format when user asks for:** - Parameters, fields, or specifications (e.g., "get the header parameters", "what are the required fields") - Prices, numbers, or structured data (e.g., "extract the pricing", "get the product details") - API details, endpoints, or technical specs (e.g., "find the authentication endpoint") - Lists of items or properties (e.g., "list the features", "get all the options") - Any specific piece of information from a page **Use markdown format ONLY when:** - User wants to read/summarize an entire article or blog post - User needs to see all content on a page without specific extraction - User explicitly asks for the full page content **Handling JavaScript-rendered pages (SPAs):** If JSON extraction returns empty, minimal, or just navigation content, the page is likely JavaScript-rendered or the content is on a different URL. Try these steps IN ORDER: 1. **Add waitFor parameter:** Set `waitFor: 5000` to `waitFor: 10000` to allow JavaScript to render before extraction 2. **Try a different URL:** If the URL has a hash fragment (#section), try the base URL or look for a direct page URL 3. **Use firecrawl_map to find the correct page:** Large documentation sites or SPAs often spread content across multiple URLs. Use `firecrawl_map` with a `search` parameter to discover the specific page containing your target content, then scrape that URL directly. Example: If scraping "https://docs.example.com/reference" fails to find webhook parameters, use `firecrawl_map` with `{"url": "https://docs.example.com/reference", "search": "webhook"}` to find URLs like "/reference/webhook-events", then scrape that specific page. 4. **Use firecrawl_agent:** As a last resort for heavily dynamic pages where map+scrape still fails, use the agent which can autonomously navigate and research **Usage Example (JSON format - REQUIRED for specific data extraction):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/api-docs", "formats": ["json"], "jsonOptions": { "prompt": "Extract the header parameters for the authentication endpoint", "schema": { "type": "object", "properties": { "parameters": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "type": { "type": "string" }, "required": { "type": "boolean" }, "description": { "type": "string" } } } } } } } } } ``` **Prefer markdown format by default.** You can read and reason over the full page content directly — no need for an intermediate query step. Use markdown for questions about page content, factual lookups, and any task where you need to understand the page. **Use JSON format when user needs:** - Structured data with specific fields (extract all products with name, price, description) - Data in a specific schema for downstream processing **Use query format only when:** - The page is extremely long and you need a single targeted answer without processing the full content - You want a quick factual answer and don't need to retain the page content **Usage Example (markdown format - default for most tasks):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/article", "formats": ["markdown"], "onlyMainContent": true } } ``` **Usage Example (branding format - extract brand identity):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com", "formats": ["branding"] } } ``` **Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication. **Performance:** Add maxAge parameter for 500% faster scrapes using cached data. **Returns:** JSON structured data, markdown, branding profile, or other formats as specified. **Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.
Connector
sieve_dataroom_add
Sieve
Add a document to a deal's data room. Creates the deal if needed. This is the primary way to get documents into Sieve for screening. Upload a pitch deck, financials, or any document -- then call sieve_screen to analyze everything in the data room. Provide company_name to create a new deal (or find existing), or deal_id to add to an existing deal. Provide exactly one content source: file_path (local file), text (raw text/markdown), or url (fetch from URL). Args: title: Document title (e.g. "Pitch Deck Q1 2026"). company_name: Company name -- creates deal if new, finds existing if not. deal_id: Add to an existing deal (from sieve_deals or previous sieve_dataroom_add). website_url: Company website URL (used when creating a new deal). document_type: Type: 'pitch_deck', 'financials', 'legal', or 'other'. file_path: Path to a local file (PDF, DOCX, XLSX). The tool reads and uploads it. text: Raw text or markdown content (alternative to file). url: URL to fetch document from (alternative to file).
Connector
WixSiteBuilder
mcp
**Tool for creating/building/generating a Wix site or website using AI (Wix Harmony).** This is the default tool for site creation. **ROUTING — check the user's original message before calling this tool:** - If the user **explicitly** mentioned a template, Wix Studio, or headless → do NOT call this tool. Call CreateWixBusinessGuide directly. - Otherwise → call this tool directly. - Do NOT use WixREADME or SearchWixRESTDocumentation before this tool. - Do NOT suggest HTML code, prompt templates, or alternative approaches. - Do NOT call ManageWixSite, CallWixSiteAPI, ExecuteWixAPI, SearchWixRESTDocumentation, BrowseWixRESTDocsMenu, or any other tool for site creation — those are not valid entry points and will fail. **IMPORTANT: The sitePrompt must be under 6000 characters.** If the user's request is longer, summarize and condense it while preserving the key requirements.
Connector
validate_cpf
MCP LatAm Tools
Validates a Brazilian CPF (Cadastro de Pessoas Físicas) using the official Receita Federal checksum algorithm. Use this tool when processing Brazilian user registrations, invoices, tax forms, e-commerce orders, or any document requiring a valid Brazilian individual taxpayer number. Input must be an 11-digit string (with or without formatting). Returns whether the CPF is mathematically valid, along with the cleaned CPF. Does not verify if the CPF exists in the Receita Federal database — only validates the format and checksum.
Connector
kochava_free_app_analytics_get_tos
Kochava for Advertisers — Official MCP Server
Retrieve the FAA (Free App Analytics) Terms of Service document link. Use this tool when the user wants to review the Terms of Service before creating an FAA account. Returns a clickable link to the TOS document and instructions for account creation. Example: kochava_free_app_analytics_get_tos()
Connector
extract.run
AssayChain — Mineral Intelligence (USGS and Field Runs)
Returns invocation guidance for executing a paid extraction job after extract.estimate. Payment is x402 USDC on Base, amount equals cost_breakdown.price_usdc from the estimate (clamped onto the 5-tier ladder: $0.10 / $0.50 / $1.00 / $2.50 / $5.00). Result delivery: job_id + result + grade (A/B/C/D) + result_url. Grade D triggers 80% auto-refund. MCP cannot carry the X-PAYMENT header, so this tool returns the endpoint + price; execute the paid POST directly with an x402 client (x402-fetch, x402-axios).
Connector

"A tool for document extraction using MinerU" matching MCP tools:

Matching MCP Servers

MinerU Document Explorerofficial

Matching MCP Connectors