crawl_url_content
Extract content from a single web page, including main text, links, and images, with customizable options for maximum content length, using the Open Search MCP server.
Instructions
Crawl and extract content from a single web page
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| extractImages | No | Extract all images from the page | |
| extractLinks | No | Extract all links from the page | |
| extractText | No | Extract main text content | |
| maxContentLength | No | Maximum length of extracted content | |
| url | Yes | URL of the web page to crawl |
Implementation Reference
- Core handler function for 'crawl_url_content' tool. Fetches webpage using axios with custom headers, extracts structured content (title, description, text, links, images) using cheerio, handles errors, and returns formatted result with metadata.execute: async (args: any) => { const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args; try { const startTime = Date.now(); // 验证URL格式 try { new URL(url); } catch { return { success: false, error: 'Invalid URL format' }; } const pageData = await client.fetchPage(url); const content = client.extractContent(pageData.html, { extractText, extractLinks, extractImages, maxContentLength, maxLinks: 50, maxImages: 20 }); const crawlTime = Date.now() - startTime; return { success: true, data: { source: 'Web Crawler', url, finalUrl: pageData.url, status: pageData.status, crawlTime, content: { ...content, extractedAt: new Date().toISOString() }, metadata: { extractText, extractLinks, extractImages, contentLength: content.contentLength || 0, linksCount: content.links?.length || 0, imagesCount: content.images?.length || 0 }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`, data: { source: 'Web Crawler', url, suggestions: [ 'Check if the URL is accessible', 'Verify the website allows crawling', 'Try again in a few moments', 'Check your internet connection' ] } }; }
- src/utils/input-validator.ts:128-132 (schema)Zod validation schema for 'crawl_url_content' inputs in the security input validator, ensuring safe URL and boolean flags.urlCrawl: z.object({ url: CommonSchemas.url, extractText: z.boolean().optional().default(true), extractLinks: z.boolean().optional().default(false), }),
- Primary input schema definition for the 'crawl_url_content' tool used in MCP tool listing and execution.inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the web page to crawl' }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from the page', default: false }, extractImages: { type: 'boolean', description: 'Extract all images from the page', default: false }, maxContentLength: { type: 'number', description: 'Maximum length of extracted content', default: 5000, minimum: 100, maximum: 20000 } }, required: ['url'] },
- src/tools/utility/web-crawler-tools.ts:184-289 (registration)Tool registration within registerWebCrawlerTools function, defining name, description, schema, and handler.registry.registerTool({ name: 'crawl_url_content', description: 'Crawl and extract content from a single web page', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the web page to crawl' }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from the page', default: false }, extractImages: { type: 'boolean', description: 'Extract all images from the page', default: false }, maxContentLength: { type: 'number', description: 'Maximum length of extracted content', default: 5000, minimum: 100, maximum: 20000 } }, required: ['url'] }, execute: async (args: any) => { const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args; try { const startTime = Date.now(); // 验证URL格式 try { new URL(url); } catch { return { success: false, error: 'Invalid URL format' }; } const pageData = await client.fetchPage(url); const content = client.extractContent(pageData.html, { extractText, extractLinks, extractImages, maxContentLength, maxLinks: 50, maxImages: 20 }); const crawlTime = Date.now() - startTime; return { success: true, data: { source: 'Web Crawler', url, finalUrl: pageData.url, status: pageData.status, crawlTime, content: { ...content, extractedAt: new Date().toISOString() }, metadata: { extractText, extractLinks, extractImages, contentLength: content.contentLength || 0, linksCount: content.links?.length || 0, imagesCount: content.images?.length || 0 }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`, data: { source: 'Web Crawler', url, suggestions: [ 'Check if the URL is accessible', 'Verify the website allows crawling', 'Try again in a few moments', 'Check your internet connection' ] } }; } } });
- src/index.ts:248-248 (registration)Top-level registration of web crawler tools (including crawl_url_content) in the main MCP server initialization.registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls