Skip to main content
Glama

read_website

Extract web content and convert it to clean Markdown for reading documentation, analyzing content, and gathering information from websites while preserving links and structure.

Instructions

Fast, token-efficient web content extraction - ideal for reading documentation, analyzing content, and gathering information from websites. Converts to clean Markdown while preserving links and structure.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesHTTP/HTTPS URL to fetch and convert to markdown
pagesNoMaximum number of pages to crawl (default: 1)
cookiesFileNoPath to Netscape cookie file for authenticated pages

Implementation Reference

  • Defines the MCP Tool object for 'read_website' including name, description, input schema (url required, optional pages and cookiesFile), and annotations indicating behavior.
    const READ_WEBSITE_TOOL: Tool = {
        name: 'read_website',
        description:
            'Fast, token-efficient web content extraction - ideal for reading documentation, analyzing content, and gathering information from websites. Converts to clean Markdown while preserving links and structure.',
        inputSchema: {
            type: 'object',
            properties: {
                url: {
                    type: 'string',
                    description: 'HTTP/HTTPS URL to fetch and convert to markdown',
                },
                pages: {
                    type: 'number',
                    description: 'Maximum number of pages to crawl (default: 1)',
                    default: 1,
                    minimum: 1,
                    maximum: 100,
                },
                cookiesFile: {
                    type: 'string',
                    description: 'Path to Netscape cookie file for authenticated pages',
                    optional: true,
                },
            },
            required: ['url'],
        },
        annotations: {
            title: 'Read Website',
            readOnlyHint: true, // Only reads content
            destructiveHint: false,
            idempotentHint: true, // Same URL returns same content (with cache)
            openWorldHint: true, // Interacts with external websites
        },
    };
  • src/serve.ts:104-114 (registration)
    Registers the 'read_website' tool by including it in the ListTools response.
    server.setRequestHandler(ListToolsRequestSchema, async () => {
        logger.debug('Received ListTools request');
        const response = {
            tools: [READ_WEBSITE_TOOL],
        };
        logger.debug(
            'Returning tools:',
            response.tools.map(t => t.name)
        );
        return response;
    });
  • MCP server handler for tool calls: dispatches 'read_website' requests, lazy loads core module, calls fetchMarkdown with parameters, formats response as MCP content.
    server.setRequestHandler(CallToolRequestSchema, async request => {
        logger.info('Received CallTool request:', request.params.name);
        logger.debug('Request params:', JSON.stringify(request.params, null, 2));
    
        if (request.params.name !== 'read_website') {
            const error = `Unknown tool: ${request.params.name}`;
            logger.error(error);
            throw new Error(error);
        }
    
        try {
            // Lazy load the module on first use
            if (!fetchMarkdownModule) {
                logger.debug('Lazy loading fetchMarkdown module...');
                fetchMarkdownModule = await import('./internal/fetchMarkdown.js');
                logger.info('fetchMarkdown module loaded successfully');
            }
    
            const args = request.params.arguments as any;
    
            // Validate URL
            if (!args.url || typeof args.url !== 'string') {
                throw new Error('URL parameter is required and must be a string');
            }
    
            logger.info(`Processing read request for URL: ${args.url}`);
            logger.debug('Read parameters:', {
                url: args.url,
                pages: args.pages,
                cookiesFile: args.cookiesFile,
            });
    
            logger.debug('Calling fetchMarkdown...');
            
            // Convert pages to depth (pages - 1 = depth)
            // pages: 1 = depth: 0 (single page)
            // pages: 2+ = depth: 1 (crawl one level to get multiple pages)
            const depth = args.pages > 1 ? 1 : 0;
            
            const result = await fetchMarkdownModule.fetchMarkdown(args.url, {
                depth: depth,
                respectRobots: false,  // Default to not respecting robots.txt
                maxPages: args.pages ?? 1,
                cookiesFile: args.cookiesFile,
            });
            logger.info('Content fetched successfully');
    
            // If there's an error but we still have some content, return it with a note
            if (result.error && result.markdown) {
                return {
                    content: [
                        {
                            type: 'text',
                            text: `${result.markdown}\n\n---\n*Note: ${result.error}*`,
                        },
                    ],
                };
            }
    
            // If there's an error and no content, throw it
            if (result.error && !result.markdown) {
                throw new Error(result.error);
            }
    
            return {
                content: [{ type: 'text', text: result.markdown }],
            };
        } catch (error: any) {
            logger.error('Error fetching content:', error.message);
            logger.debug('Error stack:', error.stack);
            logger.debug('Error details:', {
                name: error.name,
                code: error.code,
                ...error,
            });
    
            // Re-throw with more context
            throw new Error(
                `Failed to fetch content: ${error instanceof Error ? error.message : 'Unknown error'}`
            );
        }
    });
  • Core tool handler: iteratively crawls up to maxPages using @just-every/crawl (single depth), extracts same-origin links, combines markdown from multiple pages with sources indicated.
    export async function fetchMarkdown(
        url: string,
        options: FetchMarkdownOptions = {}
    ): Promise<FetchMarkdownResult> {
        try {
            const maxPages = options.maxPages ?? 1;
            const visited = new Set<string>();
            const toVisit = [url];
            const allResults: any[] = [];
            
            // If we want multiple pages, we need to crawl iteratively
            while (toVisit.length > 0 && allResults.length < maxPages) {
                const currentUrl = toVisit.shift()!;
                
                // Skip if already visited
                if (visited.has(currentUrl)) continue;
                visited.add(currentUrl);
                
                // Fetch single page
                const crawlOptions: CrawlOptions = {
                    depth: 0, // Always single page
                    maxConcurrency: options.maxConcurrency ?? 3,
                    respectRobots: options.respectRobots ?? true,
                    sameOriginOnly: options.sameOriginOnly ?? true,
                    userAgent: options.userAgent,
                    cacheDir: options.cacheDir ?? '.cache',
                    timeout: options.timeout ?? 30000,
                };
                if (options.cookiesFile) {
                    (crawlOptions as any).cookiesFile = options.cookiesFile;
                }
    
                const results = await fetch(currentUrl, crawlOptions);
                
                if (results && results.length > 0) {
                    const result = results[0];
                    allResults.push(result);
                    
                    // Extract links from markdown if we need more pages
                    if (allResults.length < maxPages && result.markdown) {
                        const links = extractMarkdownLinks(result.markdown, currentUrl);
                        const filteredLinks = options.sameOriginOnly !== false 
                            ? filterSameOriginLinks(links, currentUrl)
                            : links;
                        
                        // Add new links to visit queue
                        for (const link of filteredLinks) {
                            if (!visited.has(link) && !toVisit.includes(link)) {
                                toVisit.push(link);
                            }
                        }
                    }
                }
            }
            
            if (allResults.length === 0) {
                return {
                    markdown: '',
                    error: 'No results returned',
                };
            }
    
            // Process results as before
            const pagesToReturn = allResults;
    
            // Combine all pages into a single markdown document
            const combinedMarkdown = pagesToReturn
                .map((result, index) => {
                    if (result.error) {
                        return `<!-- Error fetching ${result.url}: ${result.error} -->`;
                    }
                    
                    let pageContent = '';
                    
                    // Add page separator for multiple pages
                    if (pagesToReturn.length > 1 && index > 0) {
                        pageContent += '\n\n---\n\n';
                    }
                    
                    // Add source URL as a comment
                    pageContent += `<!-- Source: ${result.url} -->\n`;
                    
                    // Add the content
                    pageContent += result.markdown || '';
                    
                    return pageContent;
                })
                .join('\n');
    
            // Return combined results
            return {
                markdown: combinedMarkdown,
                title: pagesToReturn[0].title,
                links: pagesToReturn.flatMap(r => r.links || []),
                error: pagesToReturn.some(r => r.error) 
                    ? `Some pages had errors: ${pagesToReturn.filter(r => r.error).map(r => r.url).join(', ')}`
                    : undefined,
            };
        } catch (error) {
            return {
                markdown: '',
                error: error instanceof Error ? error.message : 'Unknown error',
            };
        }
    }
  • Utility functions used by fetchMarkdown to extract links from markdown for crawling and filter to same-origin only.
    /**
     * Extract all HTTP/HTTPS links from markdown content
     * @param markdown The markdown content to extract links from
     * @param baseUrl The base URL to resolve relative links against
     * @returns Array of absolute URLs found in the markdown
     */
    export function extractMarkdownLinks(markdown: string, baseUrl: string): string[] {
        const links: string[] = [];
        
        // Match markdown links: [text](url)
        const markdownLinkRegex = /\[([^\]]+)\]\(([^)]+)\)/g;
        
        // Match bare URLs
        const bareUrlRegex = /https?:\/\/[^\s<>)\]]+/g;
        
        // Extract markdown links
        let match;
        while ((match = markdownLinkRegex.exec(markdown)) !== null) {
            const url = match[2];
            if (url && !url.startsWith('#') && !url.startsWith('mailto:') && !url.startsWith('tel:')) {
                links.push(url);
            }
        }
        
        // Extract bare URLs
        while ((match = bareUrlRegex.exec(markdown)) !== null) {
            links.push(match[0]);
        }
        
        // Convert relative URLs to absolute
        const absoluteLinks = links.map(link => {
            try {
                // If it's already absolute, return as-is
                if (link.startsWith('http://') || link.startsWith('https://')) {
                    return link;
                }
                // Otherwise, resolve relative to base URL
                return new URL(link, baseUrl).href;
            } catch {
                // If URL parsing fails, skip this link
                return null;
            }
        }).filter(Boolean) as string[];
        
        // Remove duplicates and return
        return [...new Set(absoluteLinks)];
    }
    
    /**
     * Filter links to only include those from the same origin
     * @param links Array of URLs to filter
     * @param baseUrl The base URL to compare against
     * @returns Filtered array of URLs from the same origin
     */
    export function filterSameOriginLinks(links: string[], baseUrl: string): string[] {
        try {
            const baseOrigin = new URL(baseUrl).origin;
            return links.filter(link => {
                try {
                    return new URL(link).origin === baseOrigin;
                } catch {
                    return false;
                }
            });
        } catch {
            return [];
        }
    }
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/just-every/mcp-read-website-fast'

If you have feedback or need assistance with the MCP directory API, please join our Discord server