visit_page
Extract webpage content by visiting a specified URL on the MCP Web Research Server, enabling real-time data retrieval for web research.
Instructions
Visit a webpage and extract its content
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to visit |
Implementation Reference
- src/index.ts:337-374 (handler)Main handler for the 'visit_page' tool: validates URL, launches browser page if needed, navigates safely, extracts title and markdown content, returns JSON result.case 'visit_page': { const args = request.params.arguments as unknown as VisitPageArgs; if (!args?.url) { throw new McpError(ErrorCode.InvalidParams, 'URL is required'); } if (!isValidUrl(args.url)) { throw new McpError( ErrorCode.InvalidParams, `Invalid URL: ${args.url}. Only http and https protocols are supported.` ); } const page = await ensureBrowser(); try { await safePageNavigation(page, args.url); const title = await page.title(); const content = await extractContentAsMarkdown(page); return { content: [ { type: 'text', text: JSON.stringify({ url: args.url, title, content }, null, 2) } ] }; } catch (error) { throw new McpError( ErrorCode.InternalError, `Failed to visit page: ${(error as Error).message}` ); } }
- src/index.ts:149-161 (registration)Registers the 'visit_page' tool in the listTools response, including name, description, and input schema.name: 'visit_page', description: 'Visit a webpage and extract its content', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL to visit' } }, required: ['url'] } }
- src/index.ts:28-30 (schema)TypeScript interface defining the input shape for visit_page arguments, used for type casting in handler.interface VisitPageArgs { url: string; }
- src/index.ts:217-271 (helper)Helper function to extract main content from the webpage and convert it to clean Markdown using TurndownService.async function extractContentAsMarkdown(page: Page): Promise<string> { const html = await page.evaluate(() => { // Try standard content containers first const contentSelectors = [ 'main', 'article', '[role="main"]', '#content', '.content', '.main', '.post', '.article' ]; for (const selector of contentSelectors) { const element = document.querySelector(selector); if (element) { return element.outerHTML; } } // Fallback to cleaning full body content const body = document.body; const elementsToRemove = [ 'header', 'footer', 'nav', '[role="navigation"]', 'aside', '.sidebar', '[role="complementary"]', '.nav', '.menu', '.header', '.footer', '.advertisement', '.ads', '.cookie-notice' ]; elementsToRemove.forEach(sel => { body.querySelectorAll(sel).forEach(el => el.remove()); }); return body.outerHTML; }); if (!html) { return ''; } try { const markdown = turndownService.turndown(html); return markdown .replace(/\n{3,}/g, '\n\n') .replace(/^- $/gm, '') .replace(/^\s+$/gm, '') .trim(); } catch (error) { console.error('Error converting HTML to Markdown:', error); return html; } }
- src/index.ts:176-214 (helper)Helper function for safe navigation to the URL with timeout, checks for bot protection and suspicious indicators.async function safePageNavigation(page: Page, url: string): Promise<void> { await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 10000 // 10 second timeout }); // Quick check for bot protection or security challenges const validation = await page.evaluate(() => { const botProtectionExists = [ '#challenge-running', '#cf-challenge-running', '#px-captcha', '#ddos-protection', '#waf-challenge-html' ].some(selector => document.querySelector(selector)); const suspiciousTitle = [ 'security check', 'ddos protection', 'please wait', 'just a moment', 'attention required' ].some(phrase => document.title.toLowerCase().includes(phrase)); return { botProtection: botProtectionExists, suspiciousTitle, title: document.title }; }); if (validation.botProtection) { throw new Error('Bot protection detected'); } if (validation.suspiciousTitle) { throw new Error(`Suspicious page title detected: "${validation.title}"`); } }