Skip to main content
Glama

extract_urls

Extract all URLs from a webpage to discover related documentation pages, API references, or build documentation graphs. Validates links and optionally queues them for processing.

Instructions

Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.
add_to_queueNoIf true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.

Implementation Reference

  • The ExtractUrlsHandler class implements the core logic of the 'extract_urls' tool. It uses a headless browser to load the page, cheerio to parse HTML, extracts anchor hrefs that match the base path of the documentation section, resolves relative URLs, and optionally appends them to a queue file.
    export class ExtractUrlsHandler extends BaseHandler { async handle(args: any): Promise<McpToolResponse> { if (!args.url || typeof args.url !== 'string') { throw new McpError(ErrorCode.InvalidParams, 'URL is required'); } await this.apiClient.initBrowser(); const page = await this.apiClient.browser.newPage(); try { const baseUrl = new URL(args.url); const basePath = baseUrl.pathname.split('/').slice(0, 3).join('/'); // Get the base path (e.g., /3/ for Python docs) await page.goto(args.url, { waitUntil: 'networkidle' }); const content = await page.content(); const $ = cheerio.load(content); const urls = new Set<string>(); $('a[href]').each((_, element) => { const href = $(element).attr('href'); if (href) { try { const url = new URL(href, args.url); // Only include URLs from the same documentation section if (url.hostname === baseUrl.hostname && url.pathname.startsWith(basePath) && !url.hash && !url.href.endsWith('#')) { urls.add(url.href); } } catch (e) { // Ignore invalid URLs } } }); const urlArray = Array.from(urls); if (args.add_to_queue) { try { // Ensure queue file exists try { await fs.access(QUEUE_FILE); } catch { await fs.writeFile(QUEUE_FILE, ''); } // Append URLs to queue const urlsToAdd = urlArray.join('\n') + (urlArray.length > 0 ? '\n' : ''); await fs.appendFile(QUEUE_FILE, urlsToAdd); return { content: [ { type: 'text', text: `Successfully added ${urlArray.length} URLs to the queue`, }, ], }; } catch (error) { return { content: [ { type: 'text', text: `Failed to add URLs to queue: ${error}`, }, ], isError: true, }; } } return { content: [ { type: 'text', text: urlArray.join('\n') || 'No URLs found on this page.', }, ], }; } catch (error) { return { content: [ { type: 'text', text: `Failed to extract URLs: ${error}`, }, ], isError: true, }; } finally { await page.close(); } } }
  • The HandlerRegistry sets up and registers the ExtractUrlsHandler instance for the 'extract_urls' tool name in its handlers Map.
    private setupHandlers() { this.handlers.set('add_documentation', new AddDocumentationHandler(this.server, this.apiClient)); this.handlers.set('search_documentation', new SearchDocumentationHandler(this.server, this.apiClient)); this.handlers.set('list_sources', new ListSourcesHandler(this.server, this.apiClient)); this.handlers.set('remove_documentation', new RemoveDocumentationHandler(this.server, this.apiClient)); this.handlers.set('extract_urls', new ExtractUrlsHandler(this.server, this.apiClient)); this.handlers.set('list_queue', new ListQueueHandler(this.server, this.apiClient)); this.handlers.set('run_queue', new RunQueueHandler(this.server, this.apiClient)); this.handlers.set('clear_queue', new ClearQueueHandler(this.server, this.apiClient)); }
  • The input schema and description for the 'extract_urls' tool, provided in the ListTools response.
    name: 'extract_urls', description: 'Extract and analyze all URLs from a given web page. This tool crawls the specified webpage, identifies all hyperlinks, and optionally adds them to the processing queue. Useful for discovering related documentation pages, API references, or building a documentation graph. Handles various URL formats and validates links before extraction.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'The complete URL of the webpage to analyze (must include protocol, e.g., https://). The page must be publicly accessible.', }, add_to_queue: { type: 'boolean', description: 'If true, automatically add extracted URLs to the processing queue for later indexing. This enables recursive documentation discovery. Use with caution on large sites to avoid excessive queuing.', default: false, }, }, required: ['url'], }, } as ToolDefinition,
  • Import of ExtractUrlsHandler from handlers/index.js
    ExtractUrlsHandler,
  • Alternative tool definition/schema in the tools directory, similar but possibly unused.
    get definition(): ToolDefinition { return { name: 'extract_urls', description: 'Extract all URLs from a given web page', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the page to extract URLs from', }, add_to_queue: { type: 'boolean', description: 'If true, automatically add extracted URLs to the queue', default: false, }, }, required: ['url'], }, }; }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jumasheff/mcp-ragdoc-fork'

If you have feedback or need assistance with the MCP directory API, please join our Discord server