Skip to main content
Glama
PhialsBasement

MCP Web Research Server

visit_page

Extract readable content from web pages to analyze articles, verify information, or examine documentation directly from the source for research purposes.

Instructions

Navigates to a specific URL and extracts the page content in readable format, with option to capture a screenshot. Use this tool to deeply analyze specific web pages, read articles, examine documentation, or verify information directly from the source. Especially useful for in-depth research after identifying relevant pages via search.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to visit
takeScreenshotNoWhether to take a screenshot

Implementation Reference

  • Main handler logic for 'visit_page' tool within the CallToolRequestSchema handler. Validates URL, navigates safely, extracts markdown content, optionally takes and saves screenshot, stores result in session, returns structured JSON response.
    case "visit_page": {
        // Extract URL and screenshot flag from request
        const { url, takeScreenshot } = request.params.arguments as {
            url: string;                    // Target URL to visit
            takeScreenshot?: boolean;       // Optional screenshot flag
        };
    
        // Step 1: Validate URL format and security
        if (!isValidUrl(url)) {
            return {
                content: [{
                    type: "text" as const,
                    text: `Invalid URL: ${url}. Only http and https protocols are supported.`
                }],
                isError: true
            };
        }
    
        try {
            // Step 2: Visit page and extract content with retry mechanism
            const result = await withRetry(async () => {
                // Navigate to target URL safely
                await safePageNavigation(page, url);
                const title = await page.title();
    
                // Step 3: Extract and process page content
                const content = await withRetry(async () => {
                    // Convert page content to markdown
                    const extractedContent = await extractContentAsMarkdown(page);
    
                    // If no content is extracted, throw an error
                    if (!extractedContent) {
                        throw new Error('Failed to extract content');
                    }
    
                    // Return the extracted content
                    return extractedContent;
                });
    
                // Step 4: Create result object with page data
                const pageResult: ResearchResult = {
                    url,      // Original URL
                    title,    // Page title
                    content,  // Markdown content
                    timestamp: new Date().toISOString(),  // Capture time
                };
    
                // Step 5: Take screenshot if requested
                let screenshotUri: string | undefined;
                if (takeScreenshot) {
                    // Capture and process screenshot
                    const screenshot = await takeScreenshotWithSizeLimit(page);
                    pageResult.screenshotPath = await saveScreenshot(screenshot, title);
    
                    // Get the index for the resource URI
                    const resultIndex = currentSession ? currentSession.results.length : 0;
                    screenshotUri = `research://screenshots/${resultIndex}`;
    
                    // Notify clients about new screenshot resource
                    server.notification({
                        method: "notifications/resources/list_changed"
                    });
                }
    
                // Step 6: Store result in session
                addResult(pageResult);
                return { pageResult, screenshotUri };
            });
    
            // Step 7: Return formatted result with screenshot URI if taken
            const response: ToolResult = {
                content: [{
                    type: "text" as const,
                    text: JSON.stringify({
                        url: result.pageResult.url,
                        title: result.pageResult.title,
                        content: result.pageResult.content,
                        timestamp: result.pageResult.timestamp,
                        screenshot: result.screenshotUri ? `View screenshot via *MCP Resources* (Paperclip icon) @ URI: ${result.screenshotUri}` : undefined
                    }, null, 2)
                }]
            };
    
            return response;
        } catch (error) {
            // Handle and format page visit errors
            return {
                content: [{
                    type: "text" as const,
                    text: `Failed to visit page: ${(error as Error).message}`
                }],
                isError: true
            };
        }
    }
  • index.ts:155-166 (registration)
    Tool registration in TOOLS array, listing 'visit_page' with description and input schema for ListToolsRequestSchema.
    {
        name: "visit_page",
        description: "Navigates to a specific URL and extracts the page content in readable format, with option to capture a screenshot. Use this tool to deeply analyze specific web pages, read articles, examine documentation, or verify information directly from the source. Especially useful for in-depth research after identifying relevant pages via search.",
        inputSchema: {
            type: "object",
            properties: {
                url: { type: "string", description: "URL to visit" },
                takeScreenshot: { type: "boolean", description: "Whether to take a screenshot" },
            },
            required: ["url"],
        },
    },
  • Input schema definition for 'visit_page' tool: requires 'url' string, optional 'takeScreenshot' boolean.
    inputSchema: {
        type: "object",
        properties: {
            url: { type: "string", description: "URL to visit" },
            takeScreenshot: { type: "boolean", description: "Whether to take a screenshot" },
        },
        required: ["url"],
  • Helper for safe page navigation used by visit_page: sets anti-bot cookies, navigates, validates response, checks for bot protection and content sufficiency.
    async function safePageNavigation(page: Page, url: string): Promise<void> {
        try {
            // Step 1: Set cookies to bypass consent banner and simulate returning user
            await page.context().addCookies([
                {
                    name: 'CONSENT',
                    value: 'YES+cb.20210720-07-p0.en+FX+410',
                    domain: '.google.com',
                    path: '/'
                },
                {
                    name: 'NID',
                    value: `511=${Math.random().toString(36).substring(2)}`,
                    domain: '.google.com',
                    path: '/',
                    expires: Math.floor(Date.now() / 1000) + 15552000 // 180 days
                },
                {
                    name: '1P_JAR',
                    value: new Date().toISOString().slice(0, 10),
                    domain: '.google.com',
                    path: '/',
                    expires: Math.floor(Date.now() / 1000) + 2592000 // 30 days
                },
                {
                    name: 'AEC',
                    value: `AUEFqZ${Math.random().toString(36).substring(2, 15)}`,
                    domain: '.google.com',
                    path: '/',
                    expires: Math.floor(Date.now() / 1000) + 15552000 // 180 days
                }
            ]);
    
            // Step 2: Initial navigation
            const response = await page.goto(url, {
                waitUntil: 'domcontentloaded',
                timeout: 15000
            });
    
            // Step 3: Basic response validation
            if (!response) {
                throw new Error('Navigation failed: no response received');
            }
    
            // Check HTTP status code; if 400 or higher, throw an error
            const status = response.status();
            if (status >= 400) {
                throw new Error(`HTTP ${status}: ${response.statusText()}`);
            }
    
            // Step 4: Wait for network to become idle or timeout
            await Promise.race([
                page.waitForLoadState('networkidle', { timeout: 5000 })
                .catch(() => {/* ignore timeout */ }),
                               // Fallback timeout in case networkidle never occurs
                               new Promise(resolve => setTimeout(resolve, 5000))
            ]);
    
            // Step 5: Security and content validation
            const validation = await page.evaluate(() => {
                const botProtectionExists = [
                    '#challenge-running',     // Cloudflare
                    '#cf-challenge-running',  // Cloudflare
                    '#px-captcha',           // PerimeterX
                    '#ddos-protection',       // Various
                    '#waf-challenge-html'     // Various WAFs
                ].some(selector => document.querySelector(selector));
    
                // Check for suspicious page titles
                const suspiciousTitle = [
                    'security check',
                    'ddos protection',
                    'please wait',
                    'just a moment',
                    'attention required'
                ].some(phrase => document.title.toLowerCase().includes(phrase));
    
                // Count words in the page content
                const bodyText = document.body.innerText || '';
                const words = bodyText.trim().split(/\s+/).length;
    
                // Return validation results
                return {
                    wordCount: words,
                    botProtection: botProtectionExists,
                    suspiciousTitle,
                    title: document.title
                };
            });
    
            // If bot protection is detected, throw an error
            if (validation.botProtection) {
                throw new Error('Bot protection detected');
            }
    
            // If the page title is suspicious, throw an error
            if (validation.suspiciousTitle) {
                throw new Error(`Suspicious page title detected: "${validation.title}"`);
            }
    
            // If the page contains insufficient content, throw an error
            if (validation.wordCount < 1) {
                throw new Error('Page contains insufficient content');
            }
    
        } catch (error) {
            // If an error occurs during navigation, throw an error with the URL and the error message
            throw new Error(`Navigation to ${url} failed: ${(error as Error).message}`);
        }
    }
  • Core helper for extracting clean markdown content from page: targets main/article, removes nav/ads, converts HTML to MD with link/image preservation.
    async function extractContentAsMarkdown(
        page: Page,        // Puppeteer page to extract from
        selector?: string  // Optional CSS selector to target specific content
    ): Promise<string> {
        // Step 1: Execute content extraction in browser context
        const html = await page.evaluate((sel) => {
            // Handle case where specific selector is provided
            if (sel) {
                const element = document.querySelector(sel);
                // Return element content or empty string if not found
                return element ? element.outerHTML : '';
            }
    
            // Step 2: Try standard content containers first
            const contentSelectors = [
                'main',           // HTML5 semantic main content
                'article',        // HTML5 semantic article content
                '[role="main"]',  // ARIA main content role
                '#content',       // Common content ID
                '.content',       // Common content class
                '.main',          // Alternative main class
                '.post',          // Blog post content
                '.article',       // Article content container
            ];
    
            // Try each selector in priority order
            for (const contentSelector of contentSelectors) {
                const element = document.querySelector(contentSelector);
                if (element) {
                    return element.outerHTML;  // Return first matching content
                }
            }
    
            // Step 3: Fallback to cleaning full body content
            const body = document.body;
    
            // Define elements to remove for cleaner content
            const elementsToRemove = [
                // Navigation elements
                'header',                    // Page header
                'footer',                    // Page footer
                'nav',                       // Navigation sections
                '[role="navigation"]',       // ARIA navigation elements
    
                // Sidebars and complementary content
                'aside',                     // Sidebar content
                '.sidebar',                  // Sidebar by class
                '[role="complementary"]',    // ARIA complementary content
    
                // Navigation-related elements
                '.nav',                      // Navigation classes
                '.menu',                     // Menu elements
    
                // Page structure elements
                '.header',                   // Header classes
                '.footer',                   // Footer classes
    
                // Advertising and notices
                '.advertisement',            // Advertisement containers
                '.ads',                      // Ad containers
                '.cookie-notice',            // Cookie consent notices
            ];
    
            // Remove each unwanted element from content
            elementsToRemove.forEach(sel => {
                body.querySelectorAll(sel).forEach(el => el.remove());
            });
    
            // Return cleaned body content
            return body.outerHTML;
        }, selector);
    
        // Step 4: Handle empty content case
        if (!html) {
            return '';
        }
    
        try {
            // Step 5: Convert HTML to Markdown
            const markdown = turndownService.turndown(html);
    
            // Step 6: Clean up and format markdown
            return markdown
            .replace(/\n{3,}/g, '\n\n')  // Replace excessive newlines with double
            .replace(/^- $/gm, '')       // Remove empty list items
            .replace(/^\s+$/gm, '')      // Remove whitespace-only lines
            .trim();                     // Remove leading/trailing whitespace
    
        } catch (error) {
            // Log conversion errors and return original HTML as fallback
            console.error('Error converting HTML to Markdown:', error);
            return html;
        }
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/PhialsBasement/mcp-webresearch-stealthified'

If you have feedback or need assistance with the MCP directory API, please join our Discord server