306,363 tools. Last updated 2026-07-26 19:03

"Crawling Websites to Extract Data" matching MCP tools:

document.extract_structured
api
Extract typed fields from document text using a caller-defined schema. Uses a quality AI model with retry logic. Use when you need specific data points from a document rather than full text. For invoices with known fields, document.parse_invoice (prebuilt schema) may be simpler. For general summarization, use document.summarize instead. Schema format: { "field_name": "type hint or description" } — e.g. { "contract_date": "ISO date", "party_a": "string", "penalty_usd": "number" }. Returns: { data: { <field>: value }, data_cited: { <field>: { value, confidence: "high"|"medium"|"low", citations: [{ quote, paragraphs[] }] } } } Example prompts: - "Extract the contract date, parties, and penalty amount from this agreement." - "Pull the vendor name, PO number, and total from this document." - "Get me all named fields from this form using my custom schema."
Connector
document.extract_structured
DocImprint
Extract typed fields from document text using a caller-defined schema. Uses a quality AI model with retry logic. Use when you need specific data points from a document rather than full text. For invoices with known fields, document.parse_invoice (prebuilt schema) may be simpler. For general summarization, use document.summarize instead. Schema format: { "field_name": "type hint or description" } — e.g. { "contract_date": "ISO date", "party_a": "string", "penalty_usd": "number" }. Returns: { data: { <field>: value }, data_cited: { <field>: { value, confidence: "high"|"medium"|"low", citations: [{ quote, paragraphs[] }] } } } Example prompts: - "Extract the contract date, parties, and penalty amount from this agreement." - "Pull the vendor name, PO number, and total from this document." - "Get me all named fields from this form using my custom schema."
Connector
job.status
DocImprint
Poll the status of an async job (extract, indexing, batch). Free — no credits consumed. Use after collection.add_document or async extract to check when processing completes. Poll this endpoint in a loop until status is "complete" or "failed". Completed jobs include the bundle_id or result_json in the response. Jobs are created when you POST /v1/extract with a webhook, or when collection.add_document triggers async indexing. Returns: { id, type: "extract"|"extract_batch"|"index_collection", status: "queued"|"processing"|"complete"|"failed"|"cancelled", progress_pct: number (0–100), progress_message, bundle_id (when complete), result_json (when complete), error (when failed), created_at, completed_at } Example prompts: - "Check the status of my indexing job job_550e8400." - "Is my async extract job done yet?" - "Poll job [job_id] — what is the current progress?"
Connector
download_resource
Senzing
Download workflow resources by name. Pass `filename` (string) or `filenames` (array); calling with neither returns the list of available resources (it does not fail). Available: sz_json_analyzer.py, sz_schema_generator.py, sz_verbatim_check.py, sz_routing_report.py, senzing_entity_specification.md, senzing_mapping_examples.md, identifier_crosswalk.json HTTP mode returns URLs; stdio mode returns `sz-mcp-coworker extract` commands. Supports batch via `filenames` array. Asset IDs are not stable across versions. If a previously-known ID fails to extract, call this tool again to obtain the current ID.
Connector
book_demo
gethal.ai
Submits a demo request. The prospect receives a confirmation email and must click the link in it before the request reaches a human at A Cloud Frontier. Use only when a real person has explicitly asked for a demo and provided their own working email address. Do NOT call this for testing, evaluation, or crawling purposes — automated and unconfirmable requests are rejected.
Connector
extract
Sofya
Fetch a webpage and extract specific information using AI. Use this when you need structured data from a page (e.g. pricing, specs, contact info) rather than the raw content. Costs 5 credits. If the page has no usable text (empty or JavaScript-rendered body), the model is NOT called: content comes back empty and usage.low_content is true, rather than a fabricated answer. Gate on usage.low_content (or usage.content_chars) to detect pages you cannot ground on. Returns: content (the extracted text), url, credits_used, credits_remaining, usage (input_tokens, output_tokens, content_chars, low_content). Args: url: The URL to extract from prompt: What information to extract (e.g. "list all pricing tiers with features" or "extract the author name and publication date")
Connector

Matching MCP Servers

Averra Extract MCP
Web Scraping RAG Systems Developer Tools
Swwyymm
A
license
A
quality
D
maintenance
MCP server for Averra Extract — lets AI agents like Claude, Cursor, and ChatGPT convert any webpage into clean, LLM-ready Markdown.
Last updated 2026-04-11
5
97
MIT
ocr-extract
Br0ski777
-
license
-
quality
B
maintenance
Enables OCR text extraction from images via URLs or base64, with pay-per-call using x402 micropayments (USDC on Base L2).
Last updated 2026-07-19

Matching MCP Connectors

extract
Web content extraction for AI agents. Pay per call with x402 (USDC on Base). No API key.
page-extract
URL to clean article markdown/text + metadata and links. Deterministic. $0.001/call via x402.

metadata
Fast.io
AI metadata templates & extraction (the unstructured-data automation pipeline): template CRUD/clone, assign/resolve, the AI pipeline (eligible -> preview-match -> suggest-fields -> template-create -> nodes-add/-list -> auto-match -> extract-all), saved views, and lexical metadata search. Call action='describe' for the full action/param reference. Node-level metadata (get/set/delete/extract on a single file) lives on the `storage` tool. Destructive: template-delete, view-delete, nodes-remove. AI/credit side-effects: preview-match, suggest-fields, auto-match, extract-all (each spends AI credits).
Connector
politics_methodology
officials
Get the full transparency methodology — how we collect, assess, extract, sign, and verify political data. Includes quality assessment algorithm (signals + weights), trust levels, data sources, cron schedule, and what we can and cannot prove. Open and auditable.
Connector
get_sample_data
Senzing
Get real sample data for entity resolution. Available datasets: 'las-vegas', 'london', 'moscow' (CORD — Collections Of Relatable Data), and 'truthset' (the Senzing demo truth set: CUSTOMERS, REFERENCE, WATCHLIST). Use dataset='list' to discover datasets, source='list' to see the sources/vendors within a dataset. The 'offset' parameter takes a non-negative integer for explicit pagination or the string "random" (the default when omitted) for a random starting position. IMPORTANT: This is REAL data (not synthetic) — historical snapshots for evaluation only, not operational use. Always inform the user of this. When records are returned, a 'download_url' in the citation provides a way to fetch the full dataset. In HTTP mode this is a URL the user (or an automation) can curl; in stdio mode it is a `sz-mcp-coworker extract` command the user runs locally to pull bytes from the embedded bundle. Always present the fetch instruction to the user. Do NOT download it yourself or dump raw records into the conversation — the inline records are a small preview of the data shape. Asset IDs are not stable across versions. If a previously-known ID fails to extract, call this tool again to obtain the current ID.
Connector
enrich_html
shopgraph
Extract product data from raw HTML you already have (no HTTP fetch needed). Ideal when using Bright Data, Firecrawl, or any scraping API — pipe the HTML through ShopGraph for structured product data. Uses schema.org + LLM fallback. Costs $0.02 per call (cached results are free). Each field carries verification metadata in _shopgraph: provenance (field_method — which source/tier produced it: schema_org, llm, or hybrid), freshness (field_freshness — recency + volatility_class, for volatile fields like price & availability), and abstain (a field is null when ShopGraph cannot verify it on the page). Rely on provenance, freshness, and the abstain signal to decide what to trust.
Connector
framefetch_extract
framefetch
Extract data from ONE public social-video URL (YouTube incl. Shorts, TikTok, Instagram Reels, Pinterest, Reddit): metadata/insights/transcript/frames/digest/comments/etc — see `fields`. When NOT to use: non-video pages, private/login-walled content, or bulk crawling (one URL per call). Returns one JSON object with only the requested fields + a `cost` block (micro-USD); shapes: https://framefetch.net/docs. Cost scales with what you request (frames/transcript cost more than metadata). No key? POST /v1/keys {email} -> instant key (~100 free calls); or x402 (USDC), no account. Example: {"url":"https://www.youtube.com/watch?v=...","fields":["metadata","transcript"]}.
Connector
extract_page_markdown
site-audit
FREE PREVIEW (1 request per day per IP, quota shared with audit_website_preview) of Santos Page-to-Markdown extraction. Fetches one public page and returns its main content as clean Markdown plus title, description, outbound links, and word count. Single page only — no crawling or JavaScript rendering. For unlimited extraction, use the machine-payable production endpoint: POST https://api.santosautomation.com/v1/extract with {"url": "…"} — $0.005 USDC per successful extraction on Base mainnet (eip155:8453) via x402 v2; no account or API key required.
Connector
document.parse_invoice
api
Parse a receipt or invoice document into structured fields. Uses a quality AI model for accuracy. Use when you need to extract line items, totals, and merchant info from financial documents. For general document text, use document.extract_text instead. Returns: { invoice: { merchant, date (YYYY-MM-DD), line_items[], subtotal, tax, total }, cited: { <field>: { value, confidence: "high"|"medium"|"low", citations: [{ quote, paragraphs[] }] } } } Example prompts: - "Parse this invoice and give me the line items and total." - "Extract the merchant, date, and amounts from this receipt." - "Read this scanned invoice and return structured data."
Connector
find_similar_sites
mcp
Find similar or competitor websites based on classification. Takes a URL, classifies it (or uses cached classification), and returns other websites from the same category and subcategory. Useful for competitive analysis and discovering related content. Rate limited to 1 request per minute per domain. Args: url: The website URL to find similar sites for. limit: Maximum number of similar sites to return (1-50, default 10). Returns: Dictionary with: - url: The input URL (normalized) - classification: The URL's category and subcategory - similar_sites: List of similar URLs from the same category - total_in_category: Total sites in this category/subcategory - cached: Whether the classification was from cache
Connector
get_metadata
microlink
Extract metadata from any URL to preview page content. Returns title, description, image, author, publisher, logo, and structured data—useful when you need to understand a webpage without visiting it directly.
Connector
extract_to_markdown
Frenchie
Extract structured documents (.docx, .xlsx, .csv, .tsv, .pptx) into Markdown through Frenchie. stdio mode auto-saves the result to .frenchie/<name>/result.md; HTTP mode returns inline Markdown.
Connector
request_scan
mailbox
Request document scanning (OCR + structured data extraction) for a package. The facility will scan the document and extract text, addresses, dates, and other structured data. Results are available via get_scan_results after processing.
Connector
list_user_sites
WebZum - The Hosting Layer for AI-Generated Web Content
List all websites created by the authenticated user. Returns an array of businessIds with names and URLs. Requires authentication via API key (Bearer token). Generate an API key at webzum.com/dashboard/account-settings.
Connector
site_directory
Lodi Kids Activities
Structured map of LKA's public URLs and content sections. Equivalent to llms.txt — gives an AI grounding agent the full topology of the site so it knows what's worth crawling/calling.
Connector
page_summary
Wikimedia Rest
"What is X" / "who is X" / "tell me about X" / "Wikipedia summary of X" / "biography of X" / "history of X" — fetches the Wikipedia article extract (title, description, thumbnail, lead-paragraph extract) for any topic, person, place, event, or concept. Use whenever an agent needs a quick encyclopedic reference. Defaults to en.wikipedia.org; pass project/lang for other Wikimedia projects or languages. Example: page_summary({ title: "Albert Einstein" }).
Connector