Skip to main content
Glama

visit_page

Extract and analyze web page content, read articles, and verify information directly from specific URLs. Includes optional screenshot capture for detailed research and documentation.

Instructions

Navigates to a specific URL and extracts the page content in readable format, with option to capture a screenshot. Use this tool to deeply analyze specific web pages, read articles, examine documentation, or verify information directly from the source. Especially useful for in-depth research after identifying relevant pages via search.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
takeScreenshotNoWhether to take a screenshot
urlYesURL to visit

Implementation Reference

  • The main execution logic for the 'visit_page' tool. Validates the input URL, navigates to the page using Playwright, extracts content as Markdown using Turndown, optionally captures and saves a screenshot, stores the result in the research session, and returns structured JSON output.
    case "visit_page": { // Extract URL and screenshot flag from request const { url, takeScreenshot } = request.params.arguments as { url: string; // Target URL to visit takeScreenshot?: boolean; // Optional screenshot flag }; // Step 1: Validate URL format and security if (!isValidUrl(url)) { return { content: [{ type: "text" as const, text: `Invalid URL: ${url}. Only http and https protocols are supported.` }], isError: true }; } try { // Step 2: Visit page and extract content with retry mechanism const result = await withRetry(async () => { // Navigate to target URL safely await safePageNavigation(page, url); const title = await page.title(); // Step 3: Extract and process page content const content = await withRetry(async () => { // Convert page content to markdown const extractedContent = await extractContentAsMarkdown(page); // If no content is extracted, throw an error if (!extractedContent) { throw new Error('Failed to extract content'); } // Return the extracted content return extractedContent; }); // Step 4: Create result object with page data const pageResult: ResearchResult = { url, // Original URL title, // Page title content, // Markdown content timestamp: new Date().toISOString(), // Capture time }; // Step 5: Take screenshot if requested let screenshotUri: string | undefined; if (takeScreenshot) { // Capture and process screenshot const screenshot = await takeScreenshotWithSizeLimit(page); pageResult.screenshotPath = await saveScreenshot(screenshot, title); // Get the index for the resource URI const resultIndex = currentSession ? currentSession.results.length : 0; screenshotUri = `research://screenshots/${resultIndex}`; // Notify clients about new screenshot resource server.notification({ method: "notifications/resources/list_changed" }); } // Step 6: Store result in session addResult(pageResult); return { pageResult, screenshotUri }; }); // Step 7: Return formatted result with screenshot URI if taken const response: ToolResult = { content: [{ type: "text" as const, text: JSON.stringify({ url: result.pageResult.url, title: result.pageResult.title, content: result.pageResult.content, timestamp: result.pageResult.timestamp, screenshot: result.screenshotUri ? `View screenshot via *MCP Resources* (Paperclip icon) @ URI: ${result.screenshotUri}` : undefined }, null, 2) }] }; return response; } catch (error) { // Handle and format page visit errors return { content: [{ type: "text" as const, text: `Failed to visit page: ${(error as Error).message}` }], isError: true }; } }
  • Input schema definition for the 'visit_page' tool, specifying required 'url' parameter and optional 'takeScreenshot' boolean.
    inputSchema: { type: "object", properties: { url: { type: "string", description: "URL to visit" }, takeScreenshot: { type: "boolean", description: "Whether to take a screenshot" }, }, required: ["url"], },
  • index.ts:155-166 (registration)
    Registration of the 'visit_page' tool in the TOOLS array, which is returned by ListToolsRequestSchema handler.
    { name: "visit_page", description: "Navigates to a specific URL and extracts the page content in readable format, with option to capture a screenshot. Use this tool to deeply analyze specific web pages, read articles, examine documentation, or verify information directly from the source. Especially useful for in-depth research after identifying relevant pages via search.", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL to visit" }, takeScreenshot: { type: "boolean", description: "Whether to take a screenshot" }, }, required: ["url"], }, },
  • Helper function for safe page navigation with anti-bot measures, cookie setting, status checks, and bot protection detection.
    async function safePageNavigation(page: Page, url: string): Promise<void> { try { // Step 1: Set cookies to bypass consent banner and simulate returning user await page.context().addCookies([ { name: 'CONSENT', value: 'YES+cb.20210720-07-p0.en+FX+410', domain: '.google.com', path: '/' }, { name: 'NID', value: `511=${Math.random().toString(36).substring(2)}`, domain: '.google.com', path: '/', expires: Math.floor(Date.now() / 1000) + 15552000 // 180 days }, { name: '1P_JAR', value: new Date().toISOString().slice(0, 10), domain: '.google.com', path: '/', expires: Math.floor(Date.now() / 1000) + 2592000 // 30 days }, { name: 'AEC', value: `AUEFqZ${Math.random().toString(36).substring(2, 15)}`, domain: '.google.com', path: '/', expires: Math.floor(Date.now() / 1000) + 15552000 // 180 days } ]); // Step 2: Initial navigation const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 15000 }); // Step 3: Basic response validation if (!response) { throw new Error('Navigation failed: no response received'); } // Check HTTP status code; if 400 or higher, throw an error const status = response.status(); if (status >= 400) { throw new Error(`HTTP ${status}: ${response.statusText()}`); } // Step 4: Wait for network to become idle or timeout await Promise.race([ page.waitForLoadState('networkidle', { timeout: 5000 }) .catch(() => {/* ignore timeout */ }), // Fallback timeout in case networkidle never occurs new Promise(resolve => setTimeout(resolve, 5000)) ]); // Step 5: Security and content validation const validation = await page.evaluate(() => { const botProtectionExists = [ '#challenge-running', // Cloudflare '#cf-challenge-running', // Cloudflare '#px-captcha', // PerimeterX '#ddos-protection', // Various '#waf-challenge-html' // Various WAFs ].some(selector => document.querySelector(selector)); // Check for suspicious page titles const suspiciousTitle = [ 'security check', 'ddos protection', 'please wait', 'just a moment', 'attention required' ].some(phrase => document.title.toLowerCase().includes(phrase)); // Count words in the page content const bodyText = document.body.innerText || ''; const words = bodyText.trim().split(/\s+/).length; // Return validation results return { wordCount: words, botProtection: botProtectionExists, suspiciousTitle, title: document.title }; }); // If bot protection is detected, throw an error if (validation.botProtection) { throw new Error('Bot protection detected'); } // If the page title is suspicious, throw an error if (validation.suspiciousTitle) { throw new Error(`Suspicious page title detected: "${validation.title}"`); } // If the page contains insufficient content, throw an error if (validation.wordCount < 1) { throw new Error('Page contains insufficient content'); } } catch (error) { // If an error occurs during navigation, throw an error with the URL and the error message throw new Error(`Navigation to ${url} failed: ${(error as Error).message}`); } }
  • Helper function to extract main page content as clean Markdown, targeting semantic elements and filtering out navigation/UI elements.
    async function extractContentAsMarkdown( page: Page, // Puppeteer page to extract from selector?: string // Optional CSS selector to target specific content ): Promise<string> { // Step 1: Execute content extraction in browser context const html = await page.evaluate((sel) => { // Handle case where specific selector is provided if (sel) { const element = document.querySelector(sel); // Return element content or empty string if not found return element ? element.outerHTML : ''; } // Step 2: Try standard content containers first const contentSelectors = [ 'main', // HTML5 semantic main content 'article', // HTML5 semantic article content '[role="main"]', // ARIA main content role '#content', // Common content ID '.content', // Common content class '.main', // Alternative main class '.post', // Blog post content '.article', // Article content container ]; // Try each selector in priority order for (const contentSelector of contentSelectors) { const element = document.querySelector(contentSelector); if (element) { return element.outerHTML; // Return first matching content } } // Step 3: Fallback to cleaning full body content const body = document.body; // Define elements to remove for cleaner content const elementsToRemove = [ // Navigation elements 'header', // Page header 'footer', // Page footer 'nav', // Navigation sections '[role="navigation"]', // ARIA navigation elements // Sidebars and complementary content 'aside', // Sidebar content '.sidebar', // Sidebar by class '[role="complementary"]', // ARIA complementary content // Navigation-related elements '.nav', // Navigation classes '.menu', // Menu elements // Page structure elements '.header', // Header classes '.footer', // Footer classes // Advertising and notices '.advertisement', // Advertisement containers '.ads', // Ad containers '.cookie-notice', // Cookie consent notices ]; // Remove each unwanted element from content elementsToRemove.forEach(sel => { body.querySelectorAll(sel).forEach(el => el.remove()); }); // Return cleaned body content return body.outerHTML; }, selector); // Step 4: Handle empty content case if (!html) { return ''; } try { // Step 5: Convert HTML to Markdown const markdown = turndownService.turndown(html); // Step 6: Clean up and format markdown return markdown .replace(/\n{3,}/g, '\n\n') // Replace excessive newlines with double .replace(/^- $/gm, '') // Remove empty list items .replace(/^\s+$/gm, '') // Remove whitespace-only lines .trim(); // Remove leading/trailing whitespace } catch (error) { // Log conversion errors and return original HTML as fallback console.error('Error converting HTML to Markdown:', error); return html; } }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/PhialsBasement/mcp-webresearch-stealthified'

If you have feedback or need assistance with the MCP directory API, please join our Discord server