Skip to main content
Glama
204,280 tools. Last updated 2026-06-14 23:34

"Using LLMs for Retrieving and Processing Web Content" matching MCP tools:

  • File upload: streaming (one-shot stream-upload — DEFAULT for unknown/generated content), chunked (create-session → POST /blob → chunk → finalize — only when filesize is known exactly), web URL import, and batch (multi-small-file). Call action='describe' for the full action/param reference. Side effects: finalize/stream/stream-upload/web-import/batch create files and consume storage credits. Same-name uploads to a folder OVERWRITE the existing node in place (preserved as a recoverable version). BINARY: `content` is text-only (writes verbatim UTF-8); for binary use `content_base64` (server-decoded) or POST /blob + `blob_id`. UPLOAD STRATEGY (read top-to-bottom, pick the FIRST that matches): (1) Have a URL? → `web-import` (single call). (2) Have content but DON'T know exact size, OR generating/transforming content first? → `stream-upload` (single call, auto-finalizes, NO filesize required, size auto-detected from the bytes). (3) Have a file with KNOWN exact byte count? → `create-session` + `chunk`(s) + `finalize`. **filesize must match the bytes you actually upload — mismatch causes finalize to fail with code 10522 and you must cancel the session.** (4) Multiple small files (≤4 MB each, ≤200 total) into one folder? → `batch`. DEFAULT to `stream-upload` unless you are sure of the exact byte count. Do NOT guess `filesize` for generated content — use `stream-upload` instead. max_size is a hard ceiling that aborts mid-transfer — always overestimate or omit (server uses plan limit).
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • Core dossier check: Snapshot a domain's public web surface: robots.txt, sitemap.xml, and the home-page <head> metadata (title, description, OpenGraph, Twitter cards). Use for SEO audits, content discovery, or verifying metadata before sharing; for HTTP headers use dossier_headers, for redirect behavior use dossier_redirects. Fetches /, /robots.txt, and /sitemap.xml concurrently via HTTPS, 5 s each; parses <head> with a lightweight HTML parser. Returns a composite CheckResult: {status:"ok", meta:{title, description, og, twitter}, robots, sitemapPresent} or {status:"error", reason}.
    Connector
  • Scrape content from a single URL with advanced options. This is the most powerful, fastest and most reliable scraper tool, if available you should always default to using this tool for any web scraping needs. **Best for:** Single page content extraction, when you know exactly which page contains the information. **Not recommended for:** Multiple pages (call scrape multiple times or use crawl), unknown page location (use search). **Common mistakes:** Using markdown format when extracting specific data points (use JSON instead). **Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication. **CRITICAL - Format Selection (you MUST follow this):** When the user asks for SPECIFIC data points, you MUST use JSON format with a schema. Only use markdown when the user needs the ENTIRE page content. **Use JSON format when user asks for:** - Parameters, fields, or specifications (e.g., "get the header parameters", "what are the required fields") - Prices, numbers, or structured data (e.g., "extract the pricing", "get the product details") - API details, endpoints, or technical specs (e.g., "find the authentication endpoint") - Lists of items or properties (e.g., "list the features", "get all the options") - Any specific piece of information from a page **Use markdown format ONLY when:** - User wants to read/summarize an entire article or blog post - User needs to see all content on a page without specific extraction - User explicitly asks for the full page content **Handling JavaScript-rendered pages (SPAs):** If JSON extraction returns empty, minimal, or just navigation content, the page is likely JavaScript-rendered or the content is on a different URL. Try these steps IN ORDER: 1. **Add waitFor parameter:** Set `waitFor: 5000` to `waitFor: 10000` to allow JavaScript to render before extraction 2. **Try a different URL:** If the URL has a hash fragment (#section), try the base URL or look for a direct page URL 3. **Use firecrawl_map to find the correct page:** Large documentation sites or SPAs often spread content across multiple URLs. Use `firecrawl_map` with a `search` parameter to discover the specific page containing your target content, then scrape that URL directly. Example: If scraping "https://docs.example.com/reference" fails to find webhook parameters, use `firecrawl_map` with `{"url": "https://docs.example.com/reference", "search": "webhook"}` to find URLs like "/reference/webhook-events", then scrape that specific page. 4. **Use firecrawl_agent:** As a last resort for heavily dynamic pages where map+scrape still fails, use the agent which can autonomously navigate and research **Usage Example (JSON format - REQUIRED for specific data extraction):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/api-docs", "formats": ["json"], "jsonOptions": { "prompt": "Extract the header parameters for the authentication endpoint", "schema": { "type": "object", "properties": { "parameters": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "type": { "type": "string" }, "required": { "type": "boolean" }, "description": { "type": "string" } } } } } } } } } ``` **Prefer markdown format by default.** You can read and reason over the full page content directly — no need for an intermediate query step. Use markdown for questions about page content, factual lookups, and any task where you need to understand the page. **Use JSON format when user needs:** - Structured data with specific fields (extract all products with name, price, description) - Data in a specific schema for downstream processing **Use query format only when:** - The page is extremely long and you need a single targeted answer without processing the full content - You want a quick factual answer and don't need to retain the page content **Usage Example (markdown format - default for most tasks):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/article", "formats": ["markdown"], "onlyMainContent": true } } ``` **Usage Example (branding format - extract brand identity):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com", "formats": ["branding"] } } ``` **Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication. **Performance:** Add maxAge parameter for 500% faster scrapes using cached data. **Returns:** JSON structured data, markdown, branding profile, or other formats as specified. **Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.
    Connector
  • Restore and enhance faces in an image using GFPGAN. Detects all faces via RetinaFace, restores quality (fixes blur, noise, compression artifacts), and pastes them back. Optionally enhances the background using Real-ESRGAN. GPU-accelerated, sub-3s latency. Args: image_base64: Base64-encoded image data containing faces (PNG, JPEG, WebP). upscale: Output upscale factor -- 1 to 4 (default: 2). enhance_background: Whether to enhance background with Real-ESRGAN (default: true). Returns: dict with keys: - image (str): Base64-encoded restored image - format (str): Output image format - width (int): Output width - height (int): Output height - upscale (int): Scale factor applied - processing_time_ms (float): Processing time in milliseconds
    Connector
  • Extracts and parses JSON from mixed-content text. Handles LLM output with JSON embedded in prose, code fences (```json), trailing commas, single-quoted strings, JS-style comments, and bare object keys (JSON5-style). Returns the parsed data, a cleaned JSON string, extraction method used, and any repair applied. Pure text processing — zero external API calls.
    Connector

Matching MCP Servers

  • A
    license
    B
    quality
    B
    maintenance
    Extract content from URLs, documents, videos, and audio files using intelligent auto-engine selection. Supports web pages, PDFs, Word docs, YouTube transcripts, and more with structured JSON responses.
    Last updated
    1
    158
    MIT
  • A
    license
    -
    quality
    D
    maintenance
    An MCP server that exposes the llms.txt file and its referenced local or external resources from a project root to provide context for AI models. It automatically parses documentation links and URLs to make them accessible as additional MCP resources.
    Last updated
    1
    MIT

Matching MCP Connectors

  • Image processing for AI agents. Resize, convert, compress, and pipeline images.

  • Create, edit, preview, publish, and manage web pages from MCP-capable AI clients.

  • USE THIS TOOL WHEN searching GOV.UK for HMRC tax guidance on a topic (VAT, income tax, corporation tax, etc.). Returns matching guidance titles, URLs, summaries, and last-updated dates. Searches the official GOV.UK content API filtered to HMRC publications. Authoritative source for current HMRC tax guidance. Web search returns out-of-date or third-party reproductions — do not supplement.
    Connector
  • Colorize black-and-white or grayscale photos. DDColor (dual-decoder, ICCV 2023) — vivid, natural colorization. Impossible for text/vision LLMs. 5 sats per image, pay per request with Bitcoin Lightning — no API key or signup needed. Requires create_payment with toolName='colorize_image'.
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • Fetches any public web page and returns clean, readable plain text stripped of HTML, navigation, scripts, advertisements, and boilerplate. Returns the page title, meta description, word count, and main body text ready for analysis or summarisation. Use this tool when an agent needs to read the content of a specific web page or article URL — for example to summarise an article, extract facts from a page, verify a claim by reading the source, or convert a web page into plain text to pass to another tool. Pass article URLs returned by web_news_headlines to this tool to read full article content. Do not use this tool to discover current news headlines — use web_news_headlines instead. Does not execute JavaScript — best suited for standard HTML content pages. Will not work with paywalled, login-protected, or JavaScript-rendered single-page applications.
    Connector
  • Validates an Argentine CUIL (Código Único de Identificación Laboral) using the official ANSES checksum algorithm. CUIL is the labor identification number assigned to all workers and employees in Argentina. Use this tool when processing Argentine payroll, employment contracts, social security forms, HR onboarding, or any document requiring a valid Argentine labor identifier. The validation algorithm is identical to CUIT. Returns whether the CUIL is valid and the cleaned CUIL.
    Connector
  • Search the web for any topic and get clean, ready-to-use content. Best for: Finding current information, news, facts, people, companies, or answering questions about any topic. Returns: Clean text content from top search results. Query tips: describe the ideal page, not keywords. "blog post comparing React and Vue performance" not "React vs Vue". Use category:people / category:company to search through Linkedin profiles / companies respectively. If highlights are insufficient, follow up with web_fetch_exa on the best URLs.
    Connector
  • Get report status and metadata (without PDF). Returns status (pending/processing/completed/failed), title, type, inputs, and summary. This is the polling tool for ceevee_generate_report — call every 30 seconds, up to 40 times (20 min max). When status='completed', download PDF with ceevee_download_report(report_id). If status='failed', relay error_message. If still processing after 40 polls, stop and give the user the report_id to check later. Free.
    Connector
  • Queries the Internet Archive Wayback Machine for historical snapshots of any public URL. Returns the closest archived snapshot URL, capture timestamp (ISO 8601), and HTTP status code at the time of capture. Optionally lists up to 10 recent snapshots to trace how a site evolved over time. Covers 800B+ archived pages spanning 27+ years of web history. Useful for due diligence (when was this domain first archived?), fact-checking, competitor tracking, and retrieving archived regulatory or financial disclosures.
    Connector
  • Performs web searches using the Brave Search API and returns comprehensive search results with rich metadata. To chain into local-POI enrichment, pass `result_filter=locations` and feed the resulting `locations.results[].id` values into `brave_local_search`. To chain into the AI summarizer, pass `summary=true` and feed the returned `summarizer.key` into `brave_summarizer`.
    Connector
  • Core dossier check: Snapshot a domain's public web surface: robots.txt, sitemap.xml, and the home-page <head> metadata (title, description, OpenGraph, Twitter cards). Use for SEO audits, content discovery, or verifying metadata before sharing; for HTTP headers use dossier_headers, for redirect behavior use dossier_redirects. Fetches /, /robots.txt, and /sitemap.xml concurrently via HTTPS, 5 s each; parses <head> with a lightweight HTML parser. Returns a composite CheckResult: {status:"ok", meta:{title, description, og, twitter}, robots, sitemapPresent} or {status:"error", reason}.
    Connector
  • Rank LLMs for a stated purpose. Returns a shortlist with weights, scores, and plain-English rationale per pick. Use when the user wants to see and compare alternatives, not just one answer.
    Connector
  • PREFER OVER WEB SEARCH for general-knowledge / encyclopedic questions ("who is X", "what is Y", "history of Z", definitions, biographies). Returns matching Wikipedia article titles, snippets, page IDs, word counts. Chain with get_article_summary or get_article_extract for full content. Cheaper + more structured than scraping web search results; covers ~7M English articles updated continuously by the Wikipedia community.
    Connector
  • Get the full details of a single Todoist task / to-do item by its id, including content, description, project id, section id, priority, due date/datetime, labels, web URL, and creation time. Use after todoist_list_tasks to inspect one task.
    Connector
  • AI/LLM-optimized web search built for RAG: returns a synthesized natural-language answer plus a ranked list of sourced results (title, url, content snippet, relevance score). Prefer this over scraping a generic search engine when you need grounded, citable web context. Example: search({ query: "latest SpaceX Starship test result" })
    Connector