Skip to main content
Glama
jumasheff

RAG Documentation MCP Server

by jumasheff

extract_urls

Extract and analyze all URLs from a webpage to discover related documentation pages, API references, or build a documentation graph. Handles various URL formats and validates links during extraction.

Instructions

Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.
add_to_queueNoIf true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.

Implementation Reference

  • The ExtractUrlsHandler class implements the core execution logic for the 'extract_urls' tool. It uses browser automation to load the page, cheerio to parse HTML, extracts and filters links based on base path (e.g., for documentation sections), and optionally appends them to a queue file.
    export class ExtractUrlsHandler extends BaseHandler {
      async handle(args: any): Promise<McpToolResponse> {
        if (!args.url || typeof args.url !== 'string') {
          throw new McpError(ErrorCode.InvalidParams, 'URL is required');
        }
    
        await this.apiClient.initBrowser();
        const page = await this.apiClient.browser.newPage();
    
        try {
          const baseUrl = new URL(args.url);
          const basePath = baseUrl.pathname.split('/').slice(0, 3).join('/'); // Get the base path (e.g., /3/ for Python docs)
    
          await page.goto(args.url, { waitUntil: 'networkidle' });
          const content = await page.content();
          const $ = cheerio.load(content);
          const urls = new Set<string>();
    
          $('a[href]').each((_, element) => {
            const href = $(element).attr('href');
            if (href) {
              try {
                const url = new URL(href, args.url);
                // Only include URLs from the same documentation section
                if (url.hostname === baseUrl.hostname && 
                    url.pathname.startsWith(basePath) && 
                    !url.hash && 
                    !url.href.endsWith('#')) {
                  urls.add(url.href);
                }
              } catch (e) {
                // Ignore invalid URLs
              }
            }
          });
    
          const urlArray = Array.from(urls);
    
          if (args.add_to_queue) {
            try {
              // Ensure queue file exists
              try {
                await fs.access(QUEUE_FILE);
              } catch {
                await fs.writeFile(QUEUE_FILE, '');
              }
    
              // Append URLs to queue
              const urlsToAdd = urlArray.join('\n') + (urlArray.length > 0 ? '\n' : '');
              await fs.appendFile(QUEUE_FILE, urlsToAdd);
    
              return {
                content: [
                  {
                    type: 'text',
                    text: `Successfully added ${urlArray.length} URLs to the queue`,
                  },
                ],
              };
            } catch (error) {
              return {
                content: [
                  {
                    type: 'text',
                    text: `Failed to add URLs to queue: ${error}`,
                  },
                ],
                isError: true,
              };
            }
          }
    
          return {
            content: [
              {
                type: 'text',
                text: urlArray.join('\n') || 'No URLs found on this page.',
              },
            ],
          };
        } catch (error) {
          return {
            content: [
              {
                type: 'text',
                text: `Failed to extract URLs: ${error}`,
              },
            ],
            isError: true,
          };
        } finally {
          await page.close();
        }
      }
    }
  • The tool definition including name, description, and input schema (url required string, add_to_queue optional boolean) used when listing tools via MCP protocol.
    {
      name: 'extract_urls',
      description: 'Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'The complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.',
          },
          add_to_queue: {
            type: 'boolean',
            description: 'If true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.',
            default: false,
          },
        },
        required: ['url'],
      },
    } as ToolDefinition,
  • Registers the ExtractUrlsHandler instance in the handlers Map under the key 'extract_urls' during setup.
    this.handlers.set('extract_urls', new ExtractUrlsHandler(this.server, this.apiClient));

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jumasheff/mcp-ragdoc-fork'

If you have feedback or need assistance with the MCP directory API, please join our Discord server