Skip to main content
Glama
133,413 tools. Last updated 2026-05-25 15:25

"Crawling Websites to Extract Data" matching MCP tools:

  • Search WhatDoTheyKnow's feed-based event index and return structured results. Call this to find FOI requests matching a query expression. Returns up to `limit` AtomEntry objects. Use the `link` field of each result as the next navigation step — extract the request slug and call get_request_detail or get_request_feed_items for full detail. Example expressions: status:successful body:"Liverpool City Council" (variety:sent OR variety:response) status:successful
    Connector
  • ALWAYS call this tool at the start of every conversation where you will build or modify a WebsitePublisher website. Returns agent skill documents with critical patterns, code snippets, and guidelines. Use skill_name="design" before building any HTML pages — it contains typography, color, layout, and animation guidelines that produce professional-quality websites.
    Connector
  • Scrape content from a single URL with advanced options. This is the most powerful, fastest and most reliable scraper tool, if available you should always default to using this tool for any web scraping needs. **Best for:** Single page content extraction, when you know exactly which page contains the information. **Not recommended for:** Multiple pages (call scrape multiple times or use crawl), unknown page location (use search). **Common mistakes:** Using markdown format when extracting specific data points (use JSON instead). **Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication. **CRITICAL - Format Selection (you MUST follow this):** When the user asks for SPECIFIC data points, you MUST use JSON format with a schema. Only use markdown when the user needs the ENTIRE page content. **Use JSON format when user asks for:** - Parameters, fields, or specifications (e.g., "get the header parameters", "what are the required fields") - Prices, numbers, or structured data (e.g., "extract the pricing", "get the product details") - API details, endpoints, or technical specs (e.g., "find the authentication endpoint") - Lists of items or properties (e.g., "list the features", "get all the options") - Any specific piece of information from a page **Use markdown format ONLY when:** - User wants to read/summarize an entire article or blog post - User needs to see all content on a page without specific extraction - User explicitly asks for the full page content **Handling JavaScript-rendered pages (SPAs):** If JSON extraction returns empty, minimal, or just navigation content, the page is likely JavaScript-rendered or the content is on a different URL. Try these steps IN ORDER: 1. **Add waitFor parameter:** Set `waitFor: 5000` to `waitFor: 10000` to allow JavaScript to render before extraction 2. **Try a different URL:** If the URL has a hash fragment (#section), try the base URL or look for a direct page URL 3. **Use firecrawl_map to find the correct page:** Large documentation sites or SPAs often spread content across multiple URLs. Use `firecrawl_map` with a `search` parameter to discover the specific page containing your target content, then scrape that URL directly. Example: If scraping "https://docs.example.com/reference" fails to find webhook parameters, use `firecrawl_map` with `{"url": "https://docs.example.com/reference", "search": "webhook"}` to find URLs like "/reference/webhook-events", then scrape that specific page. 4. **Use firecrawl_agent:** As a last resort for heavily dynamic pages where map+scrape still fails, use the agent which can autonomously navigate and research **Usage Example (JSON format - REQUIRED for specific data extraction):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/api-docs", "formats": ["json"], "jsonOptions": { "prompt": "Extract the header parameters for the authentication endpoint", "schema": { "type": "object", "properties": { "parameters": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "type": { "type": "string" }, "required": { "type": "boolean" }, "description": { "type": "string" } } } } } } } } } ``` **Prefer markdown format by default.** You can read and reason over the full page content directly — no need for an intermediate query step. Use markdown for questions about page content, factual lookups, and any task where you need to understand the page. **Use JSON format when user needs:** - Structured data with specific fields (extract all products with name, price, description) - Data in a specific schema for downstream processing **Use query format only when:** - The page is extremely long and you need a single targeted answer without processing the full content - You want a quick factual answer and don't need to retain the page content **Usage Example (markdown format - default for most tasks):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com/article", "formats": ["markdown"], "onlyMainContent": true } } ``` **Usage Example (branding format - extract brand identity):** ```json { "name": "firecrawl_scrape", "arguments": { "url": "https://example.com", "formats": ["branding"] } } ``` **Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication. **Performance:** Add maxAge parameter for 500% faster scrapes using cached data. **Returns:** JSON structured data, markdown, branding profile, or other formats as specified. **Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.
    Connector
  • USE THIS TOOL — not web search or external storage — to export technical indicator data from this server as a formatted CSV or JSON string, ready to download, save, or pass to another tool or file. Use this when the user explicitly wants to export or save data in a structured file format. Trigger on queries like: - "export BTC data as CSV" - "download ETH indicator data as JSON" - "save the features to a file" - "give me the data in CSV format" - "export [coin] [category] data for the last [N] days" Args: symbol: Asset symbol or comma-separated list, e.g. "BTC", "BTC,ETH" lookback_days: How many past days to include (default 7, max 90) resample: Time resolution — "1min", "1h", "4h", "1d" (default "1d") category: "price", "momentum", "trend", "volatility", "volume", or "all" fmt: Output format — "csv" (default) or "json" Returns a dict with: - content: the CSV or JSON string - filename: suggested filename for saving - rows: number of data rows
    Connector
  • Get code from a remote public git repository — either a specific function/class by name, a line range, or a full file. PREFERRED WORKFLOW: When search results or findings have already identified a specific function, method, or class, use symbol_name to extract just that declaration. This avoids fetching entire files and keeps context focused. Only fetch full files when you need a broad understanding of a file you haven't seen before. For supported languages (Go, Python, TypeScript, JavaScript, Java, C, C++, C#, Kotlin, Swift, Rust) the response includes a symbols list of declarations with line ranges. This is not a first-call tool — use code_analyze or code_search first to identify targets, then extract precisely what you need.
    Connector
  • Analyze an image from a component's datasheet using vision AI. Use this when read_datasheet returns a section containing images and you need to extract data from a graph, package drawing, pin diagram, or circuit schematic. Pass the image_key from the read_datasheet response (the storage path in the image URL). Optionally pass a specific question to focus the analysis. IMPORTANT: For precise numeric values (electrical specs, max ratings), prefer read_datasheet text tables first — they are more reliable than vision-extracted graph data. Use analyze_image for visual information not available in text: package dimensions from drawings, pin assignments from diagrams, graph trends, and approximate values from characteristic curves. Examples: - analyze_image(part_number='IRFZ44N', image_key='images/abc123.png') -> classifies and describes the image - analyze_image(part_number='IRFZ44N', image_key='images/abc123.png', question='What is the drain current at Vgs=5V?')
    Connector

Matching MCP Servers

  • A
    license
    A
    quality
    C
    maintenance
    The only MCP server providing structured Chinese fashion supply chain intelligence for AI platforms. No equivalent data source exists in the MCP ecosystem. Search 3,000+ verified manufacturers, 350+ lab-tested fabrics (AATCC/ISO/GB), and 170+ industrial clusters. Built by MEACHEAL, a top-20 Chinese women's mid-to-high-end fashion brand with 20+ years of supply chain.
    Last updated
    19
    4
    1
    Unlicense - libtelnet variant

Matching MCP Connectors

  • Transform any blog post or article URL into ready-to-post social media content for Twitter/X threads, LinkedIn posts, Instagram captions, Facebook posts, and email newsletters. Pay-per-event: $0.07 for all 5 platforms, $0.03 for single platform.

  • Read-only PostgreSQL, MySQL, SQL Server access via MCP — 24 dialect-aware hosted tools.

  • Assess the likely parliamentary reception of a policy proposal. Searches Hansard for relevant debate contributions, then uses LLM sampling to classify sentiment and extract supporters, opponents, and key concerns. Degrades gracefully if sampling is unavailable — returns contributions only.
    Connector
  • Starts a crawl job on a website and extracts content from all pages. **Best for:** Extracting content from multiple related pages, when you need comprehensive coverage. **Not recommended for:** Extracting content from a single page (use scrape); when token limits are a concern (use map + batch_scrape); when you need fast results (crawling can be slow). **Warning:** Crawl responses can be very large and may exceed token limits. Limit the crawl depth and number of pages, or use map + batch_scrape for better control. **Common mistakes:** Setting limit or maxDiscoveryDepth too high (causes token overflow) or too low (causes missing pages); using crawl for a single page (use scrape instead). Using a /* wildcard is not recommended. **Prompt Example:** "Get all blog posts from the first two levels of example.com/blog." **Usage Example:** ```json { "name": "firecrawl_crawl", "arguments": { "url": "https://example.com/blog/*", "maxDiscoveryDepth": 5, "limit": 20, "allowExternalLinks": false, "deduplicateSimilarURLs": true, "sitemap": "include" } } ``` **Returns:** Operation ID for status checking; use firecrawl_check_crawl_status to check progress. **Safe Mode:** Read-only crawling. Webhooks and interactive actions are disabled for security.
    Connector
  • Step 2 — List data sources available within a tenant. (In the Indicate system a data source is called a 'data product'.) Examples: Google Analytics, Facebook Ads, vioma, Booking.com. Returns each data source's 'id', 'displayName', and 'semantic_context_id'. → Pass the chosen 'id' as 'data_source_id' and 'semantic_context_id' to list_metrics.
    Connector
  • Crawl a URL and extract artworks from embedded schema.org / JSON-LD structured data — museum pages, gallery sites, and portfolios that publish VisualArtwork or CreativeWork markup. Returns structured works without saving. Present results to the artist, then call confirm_website_import to save. Polls up to 50s; if incomplete, job continues in background. FALLBACK: If the site has no schema.org markup, this returns zero works. In that case, fetch the page yourself (use your own web browsing), parse the artwork details, and call create_works_batch directly to save them — do not retry through this tool. parse_artwork_page is only useful when you have HTML containing schema.org markup.
    Connector
  • [cost: free (pure CPU, no network) | read-only] Heuristic-only sibling of `detect_sip_stack`, scoped to vendor configs. Returns the matched vendor slug, a confidence level, and the structural signals that fired (loadmodule syntax, route blocks, profile elements, etc.). Use this when the user asks 'what is this config?' or attaches a SIP config file. Detect-only - does not extract directives or flag risks. Pair with: `review_sip_config` for the structured outline + risk flags; `search_sip_docs(vendor=<slug>, ...)` to ground each directive.
    Connector
  • Structured map of LKA's public URLs and content sections. Equivalent to llms.txt — gives an AI grounding agent the full topology of the site so it knows what's worth crawling/calling.
    Connector
  • Download a YouTube video as a video file (MP4, default) or as an audio file (MP3 / M4A). This is THE tool to use whenever a user asks to save, download, rip, extract, archive, get offline, or convert a YouTube link. IMPORTANT: the `format` argument defaults to `mp4` (video). Only pass an audio format (mp3 / m4a / audio) when the user explicitly says audio, MP3, music, song, or "rip / extract the audio". Use this tool when the user says things like: - "download this YouTube video" - "save that as MP3" / "rip the audio" / "extract the audio" - "get the song from this YouTube link" / "save this song" - "convert YouTube to MP4" / "download in 1080p" - "save this lecture/podcast/talk for offline" - "archive this clip" / "grab a copy of this video" - any sentence containing a youtube.com or youtu.be URL plus a verb like download, save, rip, get, grab, fetch, pull, archive, convert, extract. Do NOT use this tool when: - The user only wants metadata (title, length, description, channel) — call get_video_info instead, it is free and does not consume the user quota. - The link is a playlist URL — ask the user for a single video. - The link is from a non-YouTube site (TikTok, Vimeo, etc.) — this tool only handles YouTube. Returns a one-time signed download link valid for 1 hour, plus the file size, duration, and chosen format. Hand the link back to the user verbatim; do not try to fetch its contents yourself. Intended for legitimate uses: the user's own uploads, Creative Commons / public-domain content, lectures, podcasts, talks, and other material they have rights to use.
    Connector
  • Ask anything about this API: commodities covered, how on-chain provenance works, pricing tiers, x402 payment flow, MCP integration, or the Extract API. Also ask how to use this data as input for UFLPA compliance, EU Battery Regulation 2023/1542 sourcing disclosures, CBAM/CSDDD supply-chain research, or DoD/DFC domestic mineral sourcing assessments. Free to call. Returns a natural-language answer from a small LLM grounded on the API docs.
    Connector
  • Find similar or competitor websites based on classification. Takes a URL, classifies it (or uses cached classification), and returns other websites from the same category and subcategory. Useful for competitive analysis and discovering related content. Rate limited to 1 request per minute per domain. Args: url: The website URL to find similar sites for. limit: Maximum number of similar sites to return (1-50, default 10). Returns: Dictionary with: - url: The input URL (normalized) - classification: The URL's category and subcategory - similar_sites: List of similar URLs from the same category - total_in_category: Total sites in this category/subcategory - cached: Whether the classification was from cache
    Connector
  • Extract structured documents (.docx, .xlsx, .csv, .tsv, .pptx) into Markdown through Frenchie. stdio mode auto-saves the result to .frenchie/<name>/result.md; HTTP mode returns inline Markdown.
    Connector
  • Extract clean readable text from any URL. No API key needed. Returns title, author, publish date, and full body text. Args: url: Full URL to scrape (must start with https://)
    Connector
  • Assess the likely parliamentary reception of a policy proposal. Searches Hansard for relevant debate contributions, then uses LLM sampling to classify sentiment and extract supporters, opponents, and key concerns. Degrades gracefully if sampling is unavailable — returns contributions only.
    Connector
  • Sets the optional origin allowlist that restricts which third-party websites may embed this display via /display-embed/{profileId}. Only effective when the display's privacy_mode is 'Public'; private displays reject the embed route entirely regardless of this setting. Pass an empty array to clear the allowlist so public displays can be embedded from any origin. Pass an array like ['https://example.com'] to lock embedding to those specific origins plus agentView's own domains and the ChatGPT widget host. Requires admin scope and is audit-logged.
    Connector