Skip to main content
Glama

read_website

Extract web content and convert it to clean Markdown for reading documentation, analyzing content, and gathering information from websites while preserving links and structure.

Instructions

Fast, token-efficient web content extraction - ideal for reading documentation, analyzing content, and gathering information from websites. Converts to clean Markdown while preserving links and structure.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesHTTP/HTTPS URL to fetch and convert to markdown
pagesNoMaximum number of pages to crawl (default: 1)
cookiesFileNoPath to Netscape cookie file for authenticated pages

Implementation Reference

  • Defines the MCP Tool object for 'read_website' including name, description, input schema (url required, optional pages and cookiesFile), and annotations indicating behavior.
    const READ_WEBSITE_TOOL: Tool = { name: 'read_website', description: 'Fast, token-efficient web content extraction - ideal for reading documentation, analyzing content, and gathering information from websites. Converts to clean Markdown while preserving links and structure.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'HTTP/HTTPS URL to fetch and convert to markdown', }, pages: { type: 'number', description: 'Maximum number of pages to crawl (default: 1)', default: 1, minimum: 1, maximum: 100, }, cookiesFile: { type: 'string', description: 'Path to Netscape cookie file for authenticated pages', optional: true, }, }, required: ['url'], }, annotations: { title: 'Read Website', readOnlyHint: true, // Only reads content destructiveHint: false, idempotentHint: true, // Same URL returns same content (with cache) openWorldHint: true, // Interacts with external websites }, };
  • src/serve.ts:104-114 (registration)
    Registers the 'read_website' tool by including it in the ListTools response.
    server.setRequestHandler(ListToolsRequestSchema, async () => { logger.debug('Received ListTools request'); const response = { tools: [READ_WEBSITE_TOOL], }; logger.debug( 'Returning tools:', response.tools.map(t => t.name) ); return response; });
  • MCP server handler for tool calls: dispatches 'read_website' requests, lazy loads core module, calls fetchMarkdown with parameters, formats response as MCP content.
    server.setRequestHandler(CallToolRequestSchema, async request => { logger.info('Received CallTool request:', request.params.name); logger.debug('Request params:', JSON.stringify(request.params, null, 2)); if (request.params.name !== 'read_website') { const error = `Unknown tool: ${request.params.name}`; logger.error(error); throw new Error(error); } try { // Lazy load the module on first use if (!fetchMarkdownModule) { logger.debug('Lazy loading fetchMarkdown module...'); fetchMarkdownModule = await import('./internal/fetchMarkdown.js'); logger.info('fetchMarkdown module loaded successfully'); } const args = request.params.arguments as any; // Validate URL if (!args.url || typeof args.url !== 'string') { throw new Error('URL parameter is required and must be a string'); } logger.info(`Processing read request for URL: ${args.url}`); logger.debug('Read parameters:', { url: args.url, pages: args.pages, cookiesFile: args.cookiesFile, }); logger.debug('Calling fetchMarkdown...'); // Convert pages to depth (pages - 1 = depth) // pages: 1 = depth: 0 (single page) // pages: 2+ = depth: 1 (crawl one level to get multiple pages) const depth = args.pages > 1 ? 1 : 0; const result = await fetchMarkdownModule.fetchMarkdown(args.url, { depth: depth, respectRobots: false, // Default to not respecting robots.txt maxPages: args.pages ?? 1, cookiesFile: args.cookiesFile, }); logger.info('Content fetched successfully'); // If there's an error but we still have some content, return it with a note if (result.error && result.markdown) { return { content: [ { type: 'text', text: `${result.markdown}\n\n---\n*Note: ${result.error}*`, }, ], }; } // If there's an error and no content, throw it if (result.error && !result.markdown) { throw new Error(result.error); } return { content: [{ type: 'text', text: result.markdown }], }; } catch (error: any) { logger.error('Error fetching content:', error.message); logger.debug('Error stack:', error.stack); logger.debug('Error details:', { name: error.name, code: error.code, ...error, }); // Re-throw with more context throw new Error( `Failed to fetch content: ${error instanceof Error ? error.message : 'Unknown error'}` ); } });
  • Core tool handler: iteratively crawls up to maxPages using @just-every/crawl (single depth), extracts same-origin links, combines markdown from multiple pages with sources indicated.
    export async function fetchMarkdown( url: string, options: FetchMarkdownOptions = {} ): Promise<FetchMarkdownResult> { try { const maxPages = options.maxPages ?? 1; const visited = new Set<string>(); const toVisit = [url]; const allResults: any[] = []; // If we want multiple pages, we need to crawl iteratively while (toVisit.length > 0 && allResults.length < maxPages) { const currentUrl = toVisit.shift()!; // Skip if already visited if (visited.has(currentUrl)) continue; visited.add(currentUrl); // Fetch single page const crawlOptions: CrawlOptions = { depth: 0, // Always single page maxConcurrency: options.maxConcurrency ?? 3, respectRobots: options.respectRobots ?? true, sameOriginOnly: options.sameOriginOnly ?? true, userAgent: options.userAgent, cacheDir: options.cacheDir ?? '.cache', timeout: options.timeout ?? 30000, }; if (options.cookiesFile) { (crawlOptions as any).cookiesFile = options.cookiesFile; } const results = await fetch(currentUrl, crawlOptions); if (results && results.length > 0) { const result = results[0]; allResults.push(result); // Extract links from markdown if we need more pages if (allResults.length < maxPages && result.markdown) { const links = extractMarkdownLinks(result.markdown, currentUrl); const filteredLinks = options.sameOriginOnly !== false ? filterSameOriginLinks(links, currentUrl) : links; // Add new links to visit queue for (const link of filteredLinks) { if (!visited.has(link) && !toVisit.includes(link)) { toVisit.push(link); } } } } } if (allResults.length === 0) { return { markdown: '', error: 'No results returned', }; } // Process results as before const pagesToReturn = allResults; // Combine all pages into a single markdown document const combinedMarkdown = pagesToReturn .map((result, index) => { if (result.error) { return `<!-- Error fetching ${result.url}: ${result.error} -->`; } let pageContent = ''; // Add page separator for multiple pages if (pagesToReturn.length > 1 && index > 0) { pageContent += '\n\n---\n\n'; } // Add source URL as a comment pageContent += `<!-- Source: ${result.url} -->\n`; // Add the content pageContent += result.markdown || ''; return pageContent; }) .join('\n'); // Return combined results return { markdown: combinedMarkdown, title: pagesToReturn[0].title, links: pagesToReturn.flatMap(r => r.links || []), error: pagesToReturn.some(r => r.error) ? `Some pages had errors: ${pagesToReturn.filter(r => r.error).map(r => r.url).join(', ')}` : undefined, }; } catch (error) { return { markdown: '', error: error instanceof Error ? error.message : 'Unknown error', }; } }
  • Utility functions used by fetchMarkdown to extract links from markdown for crawling and filter to same-origin only.
    /** * Extract all HTTP/HTTPS links from markdown content * @param markdown The markdown content to extract links from * @param baseUrl The base URL to resolve relative links against * @returns Array of absolute URLs found in the markdown */ export function extractMarkdownLinks(markdown: string, baseUrl: string): string[] { const links: string[] = []; // Match markdown links: [text](url) const markdownLinkRegex = /\[([^\]]+)\]\(([^)]+)\)/g; // Match bare URLs const bareUrlRegex = /https?:\/\/[^\s<>)\]]+/g; // Extract markdown links let match; while ((match = markdownLinkRegex.exec(markdown)) !== null) { const url = match[2]; if (url && !url.startsWith('#') && !url.startsWith('mailto:') && !url.startsWith('tel:')) { links.push(url); } } // Extract bare URLs while ((match = bareUrlRegex.exec(markdown)) !== null) { links.push(match[0]); } // Convert relative URLs to absolute const absoluteLinks = links.map(link => { try { // If it's already absolute, return as-is if (link.startsWith('http://') || link.startsWith('https://')) { return link; } // Otherwise, resolve relative to base URL return new URL(link, baseUrl).href; } catch { // If URL parsing fails, skip this link return null; } }).filter(Boolean) as string[]; // Remove duplicates and return return [...new Set(absoluteLinks)]; } /** * Filter links to only include those from the same origin * @param links Array of URLs to filter * @param baseUrl The base URL to compare against * @returns Filtered array of URLs from the same origin */ export function filterSameOriginLinks(links: string[], baseUrl: string): string[] { try { const baseOrigin = new URL(baseUrl).origin; return links.filter(link => { try { return new URL(link).origin === baseOrigin; } catch { return false; } }); } catch { return []; } }

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/just-every/mcp-read-website-fast'

If you have feedback or need assistance with the MCP directory API, please join our Discord server