Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

crawl_url_content

Extract content from a single web page, including main text, links, and images, with customizable options for maximum content length, using the Open Search MCP server.

Instructions

Crawl and extract content from a single web page

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
extractImagesNoExtract all images from the page
extractLinksNoExtract all links from the page
extractTextNoExtract main text content
maxContentLengthNoMaximum length of extracted content
urlYesURL of the web page to crawl

Implementation Reference

  • Core handler function for 'crawl_url_content' tool. Fetches webpage using axios with custom headers, extracts structured content (title, description, text, links, images) using cheerio, handles errors, and returns formatted result with metadata.
    execute: async (args: any) => { const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args; try { const startTime = Date.now(); // 验证URL格式 try { new URL(url); } catch { return { success: false, error: 'Invalid URL format' }; } const pageData = await client.fetchPage(url); const content = client.extractContent(pageData.html, { extractText, extractLinks, extractImages, maxContentLength, maxLinks: 50, maxImages: 20 }); const crawlTime = Date.now() - startTime; return { success: true, data: { source: 'Web Crawler', url, finalUrl: pageData.url, status: pageData.status, crawlTime, content: { ...content, extractedAt: new Date().toISOString() }, metadata: { extractText, extractLinks, extractImages, contentLength: content.contentLength || 0, linksCount: content.links?.length || 0, imagesCount: content.images?.length || 0 }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`, data: { source: 'Web Crawler', url, suggestions: [ 'Check if the URL is accessible', 'Verify the website allows crawling', 'Try again in a few moments', 'Check your internet connection' ] } }; }
  • Zod validation schema for 'crawl_url_content' inputs in the security input validator, ensuring safe URL and boolean flags.
    urlCrawl: z.object({ url: CommonSchemas.url, extractText: z.boolean().optional().default(true), extractLinks: z.boolean().optional().default(false), }),
  • Primary input schema definition for the 'crawl_url_content' tool used in MCP tool listing and execution.
    inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the web page to crawl' }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from the page', default: false }, extractImages: { type: 'boolean', description: 'Extract all images from the page', default: false }, maxContentLength: { type: 'number', description: 'Maximum length of extracted content', default: 5000, minimum: 100, maximum: 20000 } }, required: ['url'] },
  • Tool registration within registerWebCrawlerTools function, defining name, description, schema, and handler.
    registry.registerTool({ name: 'crawl_url_content', description: 'Crawl and extract content from a single web page', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the web page to crawl' }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from the page', default: false }, extractImages: { type: 'boolean', description: 'Extract all images from the page', default: false }, maxContentLength: { type: 'number', description: 'Maximum length of extracted content', default: 5000, minimum: 100, maximum: 20000 } }, required: ['url'] }, execute: async (args: any) => { const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args; try { const startTime = Date.now(); // 验证URL格式 try { new URL(url); } catch { return { success: false, error: 'Invalid URL format' }; } const pageData = await client.fetchPage(url); const content = client.extractContent(pageData.html, { extractText, extractLinks, extractImages, maxContentLength, maxLinks: 50, maxImages: 20 }); const crawlTime = Date.now() - startTime; return { success: true, data: { source: 'Web Crawler', url, finalUrl: pageData.url, status: pageData.status, crawlTime, content: { ...content, extractedAt: new Date().toISOString() }, metadata: { extractText, extractLinks, extractImages, contentLength: content.contentLength || 0, linksCount: content.links?.length || 0, imagesCount: content.images?.length || 0 }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`, data: { source: 'Web Crawler', url, suggestions: [ 'Check if the URL is accessible', 'Verify the website allows crawling', 'Try again in a few moments', 'Check your internet connection' ] } }; } } });
  • src/index.ts:248-248 (registration)
    Top-level registration of web crawler tools (including crawl_url_content) in the main MCP server initialization.
    registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server