Skip to main content
Glama
yatotm

Tavily MCP Load Balancer

by yatotm

tavily-crawl

Initiates a structured web crawl from a specified URL, following internal links to explore site content with configurable depth, breadth, and filtering options for targeted data extraction.

Instructions

A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
allow_externalNoWhether to allow following links that go to external domains
categoriesNoFilter URLs using predefined categories like documentation, blog, api, etc
extract_depthNoAdvanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latencybasic
instructionsNoNatural language instructions for the crawler
limitNoTotal number of links the crawler will process before stopping
max_breadthNoMax number of links to follow per level of the tree (i.e., per page)
max_depthNoMax depth of the crawl. Defines how far from the base URL the crawler can explore.
select_domainsNoRegex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$)
select_pathsNoRegex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)
urlYesThe root URL to begin the crawl

Implementation Reference

  • Main handler logic for executing the 'tavily-crawl' tool: constructs parameters from tool arguments, invokes tavilyClient.crawl(), formats results with formatCrawlResults, and returns MCP content response.
    case 'tavily-crawl': { const crawlParams: TavilyCrawlParams = { url: args?.url as string, max_depth: (args?.max_depth as number) || 1, max_breadth: (args?.max_breadth as number) || 20, limit: (args?.limit as number) || 50, instructions: args?.instructions as string, select_paths: Array.isArray(args?.select_paths) ? args.select_paths : [], select_domains: Array.isArray(args?.select_domains) ? args.select_domains : [], allow_external: (args?.allow_external as boolean) || false, categories: Array.isArray(args?.categories) ? args.categories : [], extract_depth: (args?.extract_depth as 'basic' | 'advanced') || 'basic', format: (args?.format as 'markdown' | 'text') || 'markdown', include_favicon: (args?.include_favicon as boolean) || false }; const crawlResult = await this.tavilyClient.crawl(crawlParams); return { content: [ { type: 'text', text: formatCrawlResults(crawlResult), }, ], }; }
  • src/index.ts:489-556 (registration)
    Registration of the 'tavily-crawl' tool in the tools list, including name, description, and complete inputSchema for MCP tool listing.
    { name: "tavily-crawl", description: "A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The root URL to begin the crawl" }, max_depth: { type: "integer", description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.", default: 1, minimum: 1 }, max_breadth: { type: "integer", description: "Max number of links to follow per level of the tree (i.e., per page)", default: 20, minimum: 1 }, limit: { type: "integer", description: "Total number of links the crawler will process before stopping", default: 50, minimum: 1 }, instructions: { type: "string", description: "Natural language instructions for the crawler" }, select_paths: { type: "array", items: { type: "string" }, description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)", default: [] }, select_domains: { type: "array", items: { type: "string" }, description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)", default: [] }, allow_external: { type: "boolean", description: "Whether to allow following links that go to external domains", default: false }, categories: { type: "array", items: { type: "string", enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"] }, description: "Filter URLs using predefined categories like documentation, blog, api, etc", default: [] }, extract_depth: { type: "string", enum: ["basic", "advanced"], description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency", default: "basic" } }, required: ["url"] } } as Tool,
  • Core handler in TavilyClient that executes the crawl by making a POST request to Tavily API endpoint https://api.tavily.com/crawl with the provided parameters.
    async crawl(params: TavilyCrawlParams): Promise<any> { return this.makeRequest(this.baseURLs.crawl, params); }
  • TypeScript interface defining the input parameters for the tavily-crawl tool, matching the JSON schema.
    export interface TavilyCrawlParams { url: string; max_depth?: number; max_breadth?: number; limit?: number; instructions?: string; select_paths?: string[]; select_domains?: string[]; allow_external?: boolean; categories?: string[]; extract_depth?: 'basic' | 'advanced'; format?: 'markdown' | 'text'; include_favicon?: boolean; }
  • Helper function that formats the raw crawl results from Tavily API into a string for the tool response.
    export function formatCrawlResults(response: any): string { try { if (!response || typeof response !== 'object') { return 'Error: Invalid crawl response format'; } const output = []; output.push(`Crawl Results:`); output.push(`Base URL: ${sanitizeText(response.base_url, 500)}`); output.push('\nCrawled Pages:'); if (Array.isArray(response.results)) { response.results.slice(0, 20).forEach((page: any, index: number) => { output.push(`\n[${index + 1}] URL: ${sanitizeText(page.url, 500)}`); if (page.raw_content) { const cleanContent = sanitizeText(page.raw_content, 1000); const contentPreview = cleanContent.length > 300 ? cleanContent.substring(0, 300) + "... [内容已截断]" : cleanContent; output.push(`Content: ${contentPreview}`); } if (page.title) { output.push(`Title: ${sanitizeText(page.title, 200)}`); } }); } const result = output.join('\n'); // 限制总输出大小 if (result.length > 30000) { return result.substring(0, 30000) + '\n\n... [爬取结果过长,已截断]'; } return result; } catch (error) { console.error('Error formatting crawl results:', error); return 'Error: Failed to format crawl results'; } } export function formatMapResults(response: any): string { try { if (!response || typeof response !== 'object') { return 'Error: Invalid map response format';

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yatotm/tavily-mcp-loadbalancer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server