Skip to main content
Glama
127,071 tools. Last updated 2026-05-05 07:55

"Web scraping and content extraction" matching MCP tools:

  • Solve an image-based text captcha and return the recognized text. Works on standard alphanumeric captchas (web signup forms, login walls, scraping checkpoints). OCR via ddddocr — typical p50 latency 30-80ms, 70-90% accuracy on common captcha fonts. Provide either an image URL we fetch on your behalf, or raw base64 image bytes if you already have them. Use when an agent encounters a captcha mid-task and needs to continue without human intervention. Cheaper and faster than 2captcha for simple image captchas; not designed for reCAPTCHA v2/v3 or hCaptcha (those are interaction-based). (price: $0.003 USDC, tier: metered)
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • Create a job description from text within a hiring context. Returns a JD object with 'id' and stored content. Use JD content as jd_text in atlas_fit_match, atlas_fit_rank, atlas_start_jd_fit_batch, and atlas_start_jd_analysis. Requires context_id from atlas_create_context or atlas_list_contexts. Free.
    Connector
  • Get detailed CV version including structured content, sections, word count, and audience profile. cv_version_id from ceevee_upload_cv or ceevee_list_versions. Use to inspect CV content before running analysis tools. Free.
    Connector
  • Full machine-readable JSON report (~2k tokens). USE WHEN: you need to programmatically parse specific fields (CI gating, UI, sub-field extraction). Otherwise prefer get_package_prompt. RETURNS: {package, health:{score}, vulnerabilities[], latest, deprecated, maintainers, recommendation}.
    Connector
  • Scrape content from a single URL with advanced options. This is the most powerful, fastest and most reliable scraper tool, if available you should always default to using this tool for any web scraping needs. **Best for:** Single page content extraction, when you know exactly which page contains the information. **Not recommended for:** Multiple pages (call scrape multiple times or use crawl), unknown page location (use search). **Common mistakes:** Using markdown format when extracting specific data points (use JSON instead). **Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication. **CRITICAL - Format Selection (you MUST follow this):** When the user asks for SPECIFIC data points, you MUST use JSON format with a schema. Only use markdown when the user needs the ENTIRE page content. **Use JSON format when user asks for:** - Parameters, fields, or specifications (e.g., "get the header parameters", "what are the required fields") - Prices, numbers, or structured data (e.g., "extract the pricing", "get the product details") - API details, endpoints, or technical specs (e.g., "find the authentication endpoint") - Lists of items or properties (e.g., "list the features", "get all the options") - Any specific piece of information from a page **Use markdown format ONLY when:** - User wants to read/summarize an entire article or blog post - User needs to see all content on a page without specific extraction - User explicitly asks for the full page content **Handling JavaScript-rendered pages (SPAs):** If JSON extraction returns empty, minimal, or just navigation content, the page is likely JavaScript-rendered or the content is on a different URL. Try these steps IN ORDER: 1. **Add waitFor parameter:** Set `waitFor: 5000` to `waitFor: 10000` to allow JavaScript to render before extraction 2. **Try a different URL:** If the URL has a hash fragment (#section), try the base URL or look for a direct page URL 3. **Use firecrawl_map to find the correct page:** Large documentation sites or SPAs often spread content across multiple URLs. Use `firecrawl_map` with a `search` parameter to discover the specific page containing your target content, then scrape that URL directly. Example: If scraping "https://docs.example.com/reference" fails to find webhook parameters, use `firecrawl_map` with `{"url": "https://docs.example.com/reference", "search": "webhook"}` to find URLs like "/reference/webhook-events", then scrape that specific page. 4. **Use firecrawl_agent:** As a last resort for heavily dynamic pages where map+scrape still fails, use the agent which can autonomously navigate and research **Usage Example (JSON format - REQUIRED for specific data extraction):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/api-docs", "formats": ["json"], "jsonOptions": { "prompt": "Extract the header parameters for the authentication endpoint", "schema": { "type": "object", "properties": { "parameters": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "type": { "type": "string" }, "required": { "type": "boolean" }, "description": { "type": "string" } } } } } } } } } ``` **Prefer markdown format by default.** You can read and reason over the full page content directly — no need for an intermediate query step. Use markdown for questions about page content, factual lookups, and any task where you need to understand the page. **Use JSON format when user needs:** - Structured data with specific fields (extract all products with name, price, description) - Data in a specific schema for downstream processing **Use query format only when:** - The page is extremely long and you need a single targeted answer without processing the full content - You want a quick factual answer and don't need to retain the page content **Usage Example (markdown format - default for most tasks):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/article", "formats": ["markdown"], "onlyMainContent": true } } ``` **Usage Example (branding format - extract brand identity):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com", "formats": ["branding"] } } ``` **Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication. **Performance:** Add maxAge parameter for 500% faster scrapes using cached data. **Returns:** JSON structured data, markdown, branding profile, or other formats as specified. **Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.
    Connector

Matching MCP Servers

  • F
    license
    A
    quality
    C
    maintenance
    Enables retrieval and cleaning of official documentation content for popular AI/Python libraries (uv, langchain, openai, llama-index) through web scraping and LLM-powered content extraction. Uses Serper API for search and Groq API to clean HTML into readable text with source attribution.
    Last updated
    1
    1
  • A
    license
    B
    quality
    -
    maintenance
    Extract content from URLs, documents, videos, and audio files using intelligent auto-engine selection. Supports web pages, PDFs, Word docs, YouTube transcripts, and more with structured JSON responses.
    Last updated
    1
    147

Matching MCP Connectors

  • 40+ web scraping tools from Firecrawl, Bright Data, Jina, Olostep, ScrapeGraph, Notte, and Riveter. Scrape, crawl, screenshot, and extract from any website. Starts at $0.01/call. Get your API key at app.xpay.sh or xpay.tools

  • Generic URL crawl + HTML extraction — fallback for sites without dedicated MCPs.

  • Get Place Photos Fetches the photo gallery of a Google Maps place by dataId or placeId, paginated with nextPageToken and filterable by categoryId (all, latest, menu, by owner, videos, street view). Returns each photo with image URL, thumbnail, upload date, uploader, and photoId. Use for restaurant-menu extraction, venue/ambience visual audits, building rich place detail pages, and sourcing up-to-date imagery for POI listings.
    Connector
  • Lists all GA accounts and GA4 properties the user can access, including web and app data streams. Use this to discover propertyId, appStreamId, measurementId, or firebaseAppId values for reports.
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • [Read] Search the open web and return a synthesized answer with cited external pages. Built-in headline lookup, news-item search, or briefing-style news list -> search_news. X/Twitter-only discussion or tweet evidence -> search_x.
    Connector
  • Returns metadata for all US states currently supported by the ABC License API, including the agency name, data freshness SLA, extraction method, and whether CAPTCHA is present. Use this first when building a multi-state compliance workflow to understand coverage.
    Connector
  • Get the profile of a UK government organisation by its slug. Returns name, acronym, type, status, web URL, and parent/child organisations. Use govuk_list_organisations to browse all organisations and discover slugs.
    Connector
  • Fetch a web page and return its content as text, Markdown, or HTML. Includes rate limiting (2s per domain, max 10 req/min) for legal compliance. Automatically handles HTML-to-text conversion. Max response size: 1MB. Use for OEM verification and manufacturer website scraping.
    Connector
  • Get the scraped markdown content of a source URL Peec has indexed. Use this after get_url_report to inspect the actual content an AI engine read — useful for content gap analysis and competitive content comparison. Input notes: - url is the full URL. Copy it verbatim from get_url_report output. Trailing slashes and scheme variations change the resolved source ID. - Returns 404 if Peec has no record of the URL (it hasn't been scraped from any project). - max_length caps the returned content (default 100000 characters). If the stored content is longer, truncated=true and you can re-request with a higher max_length. Returned fields: - url, title, domain, channel_title: page metadata - classification: domain-level classification - url_classification: page-level classification (HOMEPAGE, LISTICLE, COMPARISON, ...) - content: markdown content, already extracted via Mozilla Readability and converted with Turndown GFM. null when the URL is tracked but scraping hasn't completed yet (can take up to 24h). - content_length: original character length before truncation (0 when content is null) - truncated: true if content was truncated to max_length - content_updated_at: ISO timestamp of last scrape, or null if not yet scraped
    Connector
  • List all job descriptions for a hiring context. Returns an array of JD objects with id, title, and content. Use JD content as jd_text in atlas_fit_match, atlas_fit_rank, and atlas_start_jd_fit_batch. Requires context_id from atlas_create_context or atlas_list_contexts. Free.
    Connector
  • Search the web for any topic and get clean, ready-to-use content. Best for: Finding current information, news, facts, people, companies, or answering questions about any topic. Returns: Clean text content from top search results. Query tips: describe the ideal page, not keywords. "blog post comparing React and Vue performance" not "React vs Vue". Use category:people / category:company to search through Linkedin profiles / companies respectively. If highlights are insufficient, follow up with web_fetch_exa on the best URLs.
    Connector
  • Search XPay Hub for paid API services. Use this PROACTIVELY when the user asks you to: search the web, find emails, enrich contacts/companies, verify emails, find similar websites, extract web page content, get company news, search for people by title/company, get job postings, generate images, or any data lookup task. Returns matching servers with slugs, tool counts, and pricing. Use xpay_details next to see the full tool list for a server.
    Connector
  • Extract structured information from web pages using LLM capabilities. Supports both cloud AI and self-hosted LLM extraction. **Best for:** Extracting specific structured data like prices, names, details from web pages. **Not recommended for:** When you need the full content of a page (use scrape); when you're not looking for specific structured data. **Arguments:** - urls: Array of URLs to extract information from - prompt: Custom prompt for the LLM extraction - schema: JSON schema for structured data extraction - allowExternalLinks: Allow extraction from external links - enableWebSearch: Enable web search for additional context - includeSubdomains: Include subdomains in extraction **Prompt Example:** "Extract the product name, price, and description from these product pages." **Usage Example:** ```json { "name": "firecrawl_extract", "arguments": { "urls": ["https://example.com/page1", "https://example.com/page2"], "prompt": "Extract product information including name, price, and description", "schema": { "type": "object", "properties": { "name": { "type": "string" }, "price": { "type": "number" }, "description": { "type": "string" } }, "required": ["name", "price"] }, "allowExternalLinks": false, "enableWebSearch": false, "includeSubdomains": false } } ``` **Returns:** Extracted structured data as defined by your schema.
    Connector
  • Performs web searches using the Brave Search API and returns comprehensive search results with rich metadata. When to use: - General web searches for information, facts, or current topics - Location-based queries (restaurants, businesses, points of interest) - News searches for recent events or breaking stories - Finding videos, discussions, or FAQ content - Research requiring diverse result types (web pages, images, reviews, etc.) Returns a JSON list of web results with title, description, and URL. When the "results_filter" parameter is empty, JSON results may also contain FAQ, Discussions, News, and Video results.
    Connector