Skip to main content
Glama

tavily-crawl

Crawl websites systematically from a base URL, following internal links with configurable depth, breadth, and filtering options to extract structured content.

Instructions

A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe root URL to begin the crawl
max_depthNoMax depth of the crawl. Defines how far from the base URL the crawler can explore.
max_breadthNoMax number of links to follow per level of the tree (i.e., per page)
limitNoTotal number of links the crawler will process before stopping
instructionsNoNatural language instructions for the crawler
select_pathsNoRegex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)
select_domainsNoRegex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$)
allow_externalNoWhether to allow following links that go to external domains
categoriesNoFilter URLs using predefined categories like documentation, blog, api, etc
extract_depthNoAdvanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latencybasic

Implementation Reference

  • The core handler function that executes the tavily-crawl tool logic by sending a POST request to Tavily's crawl API endpoint with the input parameters and the API key.
    async crawl(params: any): Promise<TavilyCrawlResponse> {
      try {
        const response = await this.axiosInstance.post(this.baseURLs.crawl, {
          ...params,
          api_key: API_KEY
        });
        return response.data;
      } catch (error: any) {
        if (error.response?.status === 401) {
          throw new Error('Invalid API key');
        } else if (error.response?.status === 429) {
          throw new Error('Usage limit exceeded');
        }
        throw error;
      }
    }
  • Input schema definition for the tavily-crawl tool, including all parameters, types, descriptions, defaults, and required fields, used in the ListTools response.
    {
      name: "tavily-crawl",
      description: "A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.",
      inputSchema: {
        type: "object",
        properties: {
          url: {
            type: "string",
            description: "The root URL to begin the crawl"
          },
          max_depth: {
            type: "integer",
            description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.",
            default: 1,
            minimum: 1
          },
          max_breadth: {
            type: "integer",
            description: "Max number of links to follow per level of the tree (i.e., per page)",
            default: 20,
            minimum: 1
          },
          limit: {
            type: "integer",
            description: "Total number of links the crawler will process before stopping",
            default: 50,
            minimum: 1
          },
          instructions: {
            type: "string",
            description: "Natural language instructions for the crawler"
          },
          select_paths: {
            type: "array",
            items: { type: "string" },
            description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)",
            default: []
          },
          select_domains: {
            type: "array",
            items: { type: "string" },
            description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)",
            default: []
          },
          allow_external: {
            type: "boolean",
            description: "Whether to allow following links that go to external domains",
            default: false
          },
          categories: {
            type: "array",
            items: { 
              type: "string",
              enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"]
            },
            description: "Filter URLs using predefined categories like documentation, blog, api, etc",
            default: []
          },
          extract_depth: {
            type: "string",
            enum: ["basic", "advanced"],
            description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency",
            default: "basic"
          }
        },
        required: ["url"]
      }
  • src/index.ts:381-399 (registration)
    Registration and dispatch logic in the CallToolRequestHandler switch statement that calls the crawl handler with parsed arguments and formats the response using formatCrawlResults.
    case "tavily-crawl":
      const crawlResponse = await this.crawl({
        url: args.url,
        max_depth: args.max_depth,
        max_breadth: args.max_breadth,
        limit: args.limit,
        instructions: args.instructions,
        select_paths: Array.isArray(args.select_paths) ? args.select_paths : [],
        select_domains: Array.isArray(args.select_domains) ? args.select_domains : [],
        allow_external: args.allow_external,
        categories: Array.isArray(args.categories) ? args.categories : [],
        extract_depth: args.extract_depth
      });
      return {
        content: [{
          type: "text",
          text: formatCrawlResults(crawlResponse)
        }]
      };
  • Helper function that formats the TavilyCrawlResponse into a human-readable string output for the tool response.
    function formatCrawlResults(response: TavilyCrawlResponse): string {
      const output: string[] = [];
      
      output.push(`Crawl Results:`);
      output.push(`Base URL: ${response.base_url}`);
      
      output.push('\nCrawled Pages:');
      response.results.forEach((page, index) => {
        output.push(`\n[${index + 1}] URL: ${page.url}`);
        if (page.raw_content) {
          // Truncate content if it's too long
          const contentPreview = page.raw_content.length > 200 
            ? page.raw_content.substring(0, 200) + "..." 
            : page.raw_content;
          output.push(`Content: ${contentPreview}`);
        }
      });
      
      return output.join('\n');
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It describes the crawling mechanism ('expands like a tree, following internal links') and control aspects (depth, breadth, guidance), but does not disclose critical behavioral traits such as rate limits, authentication needs, potential impacts on target sites, or output format. While it adds useful context, it misses key operational details for a tool with 10 parameters.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, starting with the core purpose and key features in a single, efficient sentence. It avoids redundancy and wastes no words, though it could be slightly more structured by explicitly separating purpose from parameter guidance. Every sentence earns its place by adding value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (10 parameters, no annotations, no output schema), the description is incomplete. It covers the crawling mechanism and parameter roles but lacks details on output (what data is returned), error handling, performance implications, or integration with siblings. While it provides a good foundation, it does not fully address the needs for a tool of this scope without structured support.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds meaning beyond the input schema by explaining the crawling behavior ('expands from that point like a tree') and how parameters like 'max_depth' and 'max_breadth' relate to this process ('control how deep and wide it goes'). However, with 100% schema description coverage, the schema already documents all parameters thoroughly, so the description provides supplementary context rather than essential semantics, warranting a score above baseline but not maximal.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as a 'powerful web crawler that initiates a structured web crawl starting from a specified base URL,' specifying the verb (crawl), resource (web pages), and mechanism (tree-like expansion following internal links). It distinguishes from siblings like 'tavily-extract' (likely for extraction), 'tavily-map' (likely for site mapping), and 'tavily-search' (likely for search queries) by focusing on crawling behavior.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for web crawling tasks with control over depth, breadth, and focus, but does not explicitly state when to use this tool versus its siblings (e.g., 'tavily-extract' for extraction vs. crawling). It provides general context ('guide it to focus on specific sections') but lacks explicit alternatives or exclusions, leaving usage somewhat open to interpretation.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jackedelic/tavily-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server