Skip to main content
Glama

fetch_doc_content

Retrieve web page content and explore linked pages up to a defined depth to extract comprehensive documentation insights for LLMs.

Instructions

Fetch web page content with the ability to explore linked pages up to a specified depth

Input Schema

NameRequiredDescriptionDefault
depthNoMaximum depth of directory/link exploration (default: 1)
urlYesURL of the web page to fetch

Input Schema (JSON Schema)

{ "properties": { "depth": { "description": "Maximum depth of directory/link exploration (default: 1)", "maximum": 5, "minimum": 1, "type": "number" }, "url": { "description": "URL of the web page to fetch", "type": "string" } }, "required": [ "url" ], "type": "object" }

Implementation Reference

  • Dispatch and execution wrapper for the fetch_doc_content tool: extracts parameters, implements timeout handling, calls the core fetchContent method, formats response as MCP content block, and handles errors.
    if (request.params.name === 'fetch_doc_content') { const { url, depth = 1 } = request.params.arguments as { url: string; depth?: number; }; // Reset visited URLs for each new request this.visitedUrls.clear(); // Setup global timeout let timeoutError: Error | null = null; const timeoutPromise = new Promise<never>((_, reject) => { this.globalTimeout = setTimeout(() => { timeoutError = new Error('Operation timed out'); reject(timeoutError); }, this.timeoutDuration); }); try { // Race the content fetch against our timeout const result = await Promise.race([ this.fetchContent(url, depth), timeoutPromise ]); // Clear timeout if we didn't hit it if (this.globalTimeout) { clearTimeout(this.globalTimeout); this.globalTimeout = null; } return { content: [ { type: 'text', text: JSON.stringify(result, null, 2), }, ], }; } catch (error: unknown) { // Clear timeout if we hit an error if (this.globalTimeout) { clearTimeout(this.globalTimeout); this.globalTimeout = null; } let errorMessage = "Unknown error occurred"; if (error === timeoutError) { errorMessage = "Operation timed out before completion"; } else if (error instanceof Error) { errorMessage = error.message; } else if (typeof error === 'string') { errorMessage = error; } return { content: [ { type: 'text', text: `Error fetching content: ${errorMessage}`, }, ], isError: true, }; } }
  • Core implementation of the tool logic: fetches page content preferring axios for speed, falls back to Puppeteer for JS-heavy pages, extracts clean text and same-domain links, supports recursive exploration up to maxDepth, tracks visited URLs to prevent cycles.
    private async fetchContent(url: string, maxDepth: number): Promise<any> { // Track visited URLs to avoid cycles this.visitedUrls.add(url); try { // First try a simple fetch with axios as a faster alternative try { const response = await axios.get(url, { timeout: 10000, headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } }); // If we got HTML content, use it if (response.status === 200 && response.data && typeof response.data === 'string') { const mainContent = this.extractTextContent(response.data); const links = this.extractLinks(response.data, url); // For depth=1, just return the main content if (maxDepth <= 1) { return { rootUrl: url, explorationDepth: maxDepth, pagesExplored: 1, content: [{ url, content: mainContent, links: links.slice(0, 10) // Limit to top 10 links }] }; } // For depth > 1, explore child links const childResults = []; const limit = Math.min(5, links.length); // Limit to 5 links for performance for (let i = 0; i < limit; i++) { const link = links[i]; if (!this.visitedUrls.has(link.url)) { try { const childContent = await this.fetchSimpleContent(link.url); if (childContent) { childResults.push({ url: link.url, content: childContent, links: [] // Don't include links for child pages }); this.visitedUrls.add(link.url); } } catch (e) { // Ignore errors on child pages } } } return { rootUrl: url, explorationDepth: maxDepth, pagesExplored: 1 + childResults.length, content: [{ url, content: mainContent, links: links.slice(0, 10) }, ...childResults] }; } } catch (e) { // Fall back to puppeteer if axios fails console.error('Axios fetch failed, falling back to puppeteer:', e); } // Fall back to puppeteer for more complex pages return await this.fetchWithPuppeteer(url, maxDepth); } catch (error) { console.error('Error in content fetch:', error); // Return partial results if we have visited any URLs if (this.visitedUrls.size > 0) { return { rootUrl: url, explorationDepth: maxDepth, pagesExplored: this.visitedUrls.size, content: [], error: error instanceof Error ? error.message : String(error) }; } throw error; } }
  • Input schema definition for the fetch_doc_content tool, registered in ListToolsRequestSchema handler. Defines required 'url' string and optional 'depth' number (1-5).
    { name: 'fetch_doc_content', description: 'Fetch web page content with the ability to explore linked pages up to a specified depth', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the web page to fetch', }, depth: { type: 'number', description: 'Maximum depth of directory/link exploration (default: 1)', minimum: 1, maximum: 5, }, }, required: ['url'], }, },
  • Helper function to extract plain text content from HTML by stripping tags, scripts, styles, normalizing whitespace, and truncating long content.
    private extractTextContent(html: string): string { // Very basic HTML to text conversion let text = html .replace(/<head>[\s\S]*?<\/head>/i, '') .replace(/<script[\s\S]*?<\/script>/gi, '') .replace(/<style[\s\S]*?<\/style>/gi, '') .replace(/<[^>]*>/g, ' ') .replace(/\s+/g, ' ') .trim(); // Truncate if too long if (text.length > 10000) { text = text.substring(0, 10000) + '... (content truncated)'; } return text; }
  • Helper function to parse HTML for anchor tags, extract and resolve relative hrefs to absolute same-domain URLs, filter out invalid/fragment/mailto/etc links, associate with link text.
    private extractLinks(html: string, baseUrl: string): Array<{url: string, text: string}> { const links: Array<{url: string, text: string}> = []; const linkRegex = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"(?:\s+[^>]*?)?>([^<]*)<\/a>/gi; let match; while ((match = linkRegex.exec(html)) !== null) { const href = match[1].trim(); const text = match[2].trim(); // Skip empty links or special protocols if (!href || href.startsWith('#') || href.startsWith('javascript:') || href.startsWith('mailto:') || href.startsWith('tel:')) { continue; } try { // Resolve relative URLs const fullUrl = new URL(href, baseUrl).href; // Only include links from same domain if (new URL(fullUrl).hostname === new URL(baseUrl).hostname) { links.push({ url: fullUrl, text: text || fullUrl }); } } catch (e) { // Skip invalid URLs } } return links; }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wolfyy970/docs-fetch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server