extract_urls

extract_urls

Extract and analyze all URLs from a webpage to discover related documentation pages, API references, or build a documentation graph. Handles various URL formats and validates links during extraction.

Instructions

Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes	The complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.
`add_to_queue`	No	If true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.

Implementation Reference

src/handlers/extract-urls.ts:14-108 (handler)

The ExtractUrlsHandler class implements the core execution logic for the 'extract_urls' tool. It uses browser automation to load the page, cheerio to parse HTML, extracts and filters links based on base path (e.g., for documentation sections), and optionally appends them to a queue file.

export class ExtractUrlsHandler extends BaseHandler {
  async handle(args: any): Promise<McpToolResponse> {
    if (!args.url || typeof args.url !== 'string') {
      throw new McpError(ErrorCode.InvalidParams, 'URL is required');
    }

    await this.apiClient.initBrowser();
    const page = await this.apiClient.browser.newPage();

    try {
      const baseUrl = new URL(args.url);
      const basePath = baseUrl.pathname.split('/').slice(0, 3).join('/'); // Get the base path (e.g., /3/ for Python docs)

      await page.goto(args.url, { waitUntil: 'networkidle' });
      const content = await page.content();
      const $ = cheerio.load(content);
      const urls = new Set<string>();

      $('a[href]').each((_, element) => {
        const href = $(element).attr('href');
        if (href) {
          try {
            const url = new URL(href, args.url);
            // Only include URLs from the same documentation section
            if (url.hostname === baseUrl.hostname && 
                url.pathname.startsWith(basePath) && 
                !url.hash && 
                !url.href.endsWith('#')) {
              urls.add(url.href);
            }
          } catch (e) {
            // Ignore invalid URLs
          }
        }
      });

      const urlArray = Array.from(urls);

      if (args.add_to_queue) {
        try {
          // Ensure queue file exists
          try {
            await fs.access(QUEUE_FILE);
          } catch {
            await fs.writeFile(QUEUE_FILE, '');
          }

          // Append URLs to queue
          const urlsToAdd = urlArray.join('\n') + (urlArray.length > 0 ? '\n' : '');
          await fs.appendFile(QUEUE_FILE, urlsToAdd);

          return {
            content: [
              {
                type: 'text',
                text: `Successfully added ${urlArray.length} URLs to the queue`,
              },
            ],
          };
        } catch (error) {
          return {
            content: [
              {
                type: 'text',
                text: `Failed to add URLs to queue: ${error}`,
              },
            ],
            isError: true,
          };
        }
      }

      return {
        content: [
          {
            type: 'text',
            text: urlArray.join('\n') || 'No URLs found on this page.',
          },
        ],
      };
    } catch (error) {
      return {
        content: [
          {
            type: 'text',
            text: `Failed to extract URLs: ${error}`,
          },
        ],
        isError: true,
      };
    } finally {
      await page.close();
    }
  }
}

src/handler-registry.ts:77-95 (schema)

The tool definition including name, description, and input schema (url required string, add_to_queue optional boolean) used when listing tools via MCP protocol.

{
  name: 'extract_urls',
  description: 'Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.',
  inputSchema: {
    type: 'object',
    properties: {
      url: {
        type: 'string',
        description: 'The complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.',
      },
      add_to_queue: {
        type: 'boolean',
        description: 'If true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.',
        default: false,
      },
    },
    required: ['url'],
  },
} as ToolDefinition,

src/handler-registry.ts:41-41 (registration)
Registers the ExtractUrlsHandler instance in the handlers Map under the key 'extract_urls' during setup.
```
this.handlers.set('extract_urls', new ExtractUrlsHandler(this.server, this.apiClient));
```

RAG Documentation MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API