fetch_url
Retrieve web page content from a specified URL using a headless browser, with options to extract main content, disable media, and manage timeouts for efficient data collection.
Instructions
Retrieve web page content from a specified URL
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| debug | No | Whether to enable debug mode (showing browser window), overrides the --debug command line flag if specified | |
| disableMedia | No | Whether to disable media resources (images, stylesheets, fonts, media), default is true | |
| extractContent | No | Whether to intelligently extract the main content, default is true | |
| maxLength | No | Maximum length of returned content (in characters), default is no limit | |
| navigationTimeout | No | Maximum time to wait for additional navigation in milliseconds, default is 10000 (10 seconds) | |
| returnHtml | No | Whether to return HTML content instead of Markdown, default is false | |
| timeout | No | Page loading timeout in milliseconds, default is 30000 (30 seconds) | |
| url | Yes | URL to fetch | |
| waitForNavigation | No | Whether to wait for additional navigation after initial page load (useful for sites with anti-bot verification), default is false | |
| waitUntil | No | Specifies when navigation is considered complete, options: 'load', 'domcontentloaded', 'networkidle', 'commit', default is 'load' |
Implementation Reference
- src/tools/fetchUrl.ts:74-136 (handler)The main handler function for the fetch_url tool. It launches a Chromium browser using Playwright, navigates to the URL with specified options, processes the content using WebContentProcessor, and returns the content.export async function fetchUrl(args: any) { const url = String(args?.url || ""); if (!url) { console.error(`[Error] URL parameter missing`); throw new Error("URL parameter is required"); } const options: FetchOptions = { timeout: Number(args?.timeout) || 30000, waitUntil: String(args?.waitUntil || "load") as 'load' | 'domcontentloaded' | 'networkidle' | 'commit', extractContent: args?.extractContent !== false, maxLength: Number(args?.maxLength) || 0, returnHtml: args?.returnHtml === true, waitForNavigation: args?.waitForNavigation === true, navigationTimeout: Number(args?.navigationTimeout) || 10000, disableMedia: args?.disableMedia !== false, debug: args?.debug }; // 确定是否启用调试模式(优先使用参数指定的值,否则使用命令行标志) const useDebugMode = options.debug !== undefined ? options.debug : isDebugMode; if (useDebugMode) { console.log(`[Debug] Debug mode enabled for URL: ${url}`); } const processor = new WebContentProcessor(options, '[FetchURL]'); let browser = null; let page = null; try { browser = await chromium.launch({ headless: !useDebugMode }); const context = await browser.newContext({ javaScriptEnabled: true, ignoreHTTPSErrors: true, userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' }); await context.route('**/*', async (route) => { const resourceType = route.request().resourceType(); if (options.disableMedia && ['image', 'stylesheet', 'font', 'media'].includes(resourceType)) { await route.abort(); } else { await route.continue(); } }); page = await context.newPage(); const result = await processor.processPageContent(page, url); return { content: [{ type: "text", text: result.content }] }; } finally { if (!useDebugMode) { if (page) await page.close().catch(e => console.error(`[Error] Failed to close page: ${e.message}`)); if (browser) await browser.close().catch(e => console.error(`[Error] Failed to close browser: ${e.message}`)); } else { console.log(`[Debug] Browser and page kept open for debugging. URL: ${url}`); } } }
- src/tools/fetchUrl.ts:11-69 (schema)The schema definition for the fetch_url tool, including inputSchema with parameters like url, timeout, etc.export const fetchUrlTool = { name: "fetch_url", description: "Retrieve web page content from a specified URL", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL to fetch", }, timeout: { type: "number", description: "Page loading timeout in milliseconds, default is 30000 (30 seconds)", }, waitUntil: { type: "string", description: "Specifies when navigation is considered complete, options: 'load', 'domcontentloaded', 'networkidle', 'commit', default is 'load'", }, extractContent: { type: "boolean", description: "Whether to intelligently extract the main content, default is true", }, maxLength: { type: "number", description: "Maximum length of returned content (in characters), default is no limit", }, returnHtml: { type: "boolean", description: "Whether to return HTML content instead of Markdown, default is false", }, waitForNavigation: { type: "boolean", description: "Whether to wait for additional navigation after initial page load (useful for sites with anti-bot verification), default is false", }, navigationTimeout: { type: "number", description: "Maximum time to wait for additional navigation in milliseconds, default is 10000 (10 seconds)", }, disableMedia: { type: "boolean", description: "Whether to disable media resources (images, stylesheets, fonts, media), default is true", }, debug: { type: "boolean", description: "Whether to enable debug mode (showing browser window), overrides the --debug command line flag if specified", }, }, required: ["url"], } };
- src/tools/index.ts:1-14 (registration)Registration of the fetch_url tool: exported in tools array and mapped in toolHandlers object by name to its handler function.import { fetchUrlTool, fetchUrl } from './fetchUrl.js'; import { fetchUrlsTool, fetchUrls } from './fetchUrls.js'; // Export tool definitions export const tools = [ fetchUrlTool, fetchUrlsTool ]; // Export tool implementations export const toolHandlers = { [fetchUrlTool.name]: fetchUrl, [fetchUrlsTool.name]: fetchUrls };