Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

crawl_url_content

Extract content from web pages by crawling URLs to retrieve text, links, and images for research and analysis.

Instructions

Crawl and extract content from a single web page

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL of the web page to crawl
extractTextNoExtract main text content
extractLinksNoExtract all links from the page
extractImagesNoExtract all images from the page
maxContentLengthNoMaximum length of extracted content

Implementation Reference

  • The core handler function that executes the tool: validates URL, fetches page using axios via WebCrawlerClient, extracts content using cheerio, and returns structured data with metadata.
    execute: async (args: any) => {
      const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args;
    
      try {
        const startTime = Date.now();
        
        // 验证URL格式
        try {
          new URL(url);
        } catch {
          return {
            success: false,
            error: 'Invalid URL format'
          };
        }
    
        const pageData = await client.fetchPage(url);
        const content = client.extractContent(pageData.html, {
          extractText,
          extractLinks,
          extractImages,
          maxContentLength,
          maxLinks: 50,
          maxImages: 20
        });
    
        const crawlTime = Date.now() - startTime;
    
        return {
          success: true,
          data: {
            source: 'Web Crawler',
            url,
            finalUrl: pageData.url,
            status: pageData.status,
            crawlTime,
            content: {
              ...content,
              extractedAt: new Date().toISOString()
            },
            metadata: {
              extractText,
              extractLinks,
              extractImages,
              contentLength: content.contentLength || 0,
              linksCount: content.links?.length || 0,
              imagesCount: content.images?.length || 0
            },
            timestamp: Date.now()
          }
        };
      } catch (error) {
        return {
          success: false,
          error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`,
          data: {
            source: 'Web Crawler',
            url,
            suggestions: [
              'Check if the URL is accessible',
              'Verify the website allows crawling',
              'Try again in a few moments',
              'Check your internet connection'
            ]
          }
        };
      }
    }
  • Input schema defining parameters for the tool: url (required), extractText, extractLinks, extractImages, maxContentLength.
    inputSchema: {
      type: 'object',
      properties: {
        url: {
          type: 'string',
          description: 'URL of the web page to crawl'
        },
        extractText: {
          type: 'boolean',
          description: 'Extract main text content',
          default: true
        },
        extractLinks: {
          type: 'boolean',
          description: 'Extract all links from the page',
          default: false
        },
        extractImages: {
          type: 'boolean',
          description: 'Extract all images from the page',
          default: false
        },
        maxContentLength: {
          type: 'number',
          description: 'Maximum length of extracted content',
          default: 5000,
          minimum: 100,
          maximum: 20000
        }
      },
      required: ['url']
    },
  • Local registration of the 'crawl_url_content' tool within the registerWebCrawlerTools function, including name, description, schema, and handler.
    registry.registerTool({
      name: 'crawl_url_content',
      description: 'Crawl and extract content from a single web page',
      category: 'utility',
      source: 'Web Crawler',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'URL of the web page to crawl'
          },
          extractText: {
            type: 'boolean',
            description: 'Extract main text content',
            default: true
          },
          extractLinks: {
            type: 'boolean',
            description: 'Extract all links from the page',
            default: false
          },
          extractImages: {
            type: 'boolean',
            description: 'Extract all images from the page',
            default: false
          },
          maxContentLength: {
            type: 'number',
            description: 'Maximum length of extracted content',
            default: 5000,
            minimum: 100,
            maximum: 20000
          }
        },
        required: ['url']
      },
      execute: async (args: any) => {
        const { url, extractText = true, extractLinks = false, extractImages = false, maxContentLength = 5000 } = args;
    
        try {
          const startTime = Date.now();
          
          // 验证URL格式
          try {
            new URL(url);
          } catch {
            return {
              success: false,
              error: 'Invalid URL format'
            };
          }
    
          const pageData = await client.fetchPage(url);
          const content = client.extractContent(pageData.html, {
            extractText,
            extractLinks,
            extractImages,
            maxContentLength,
            maxLinks: 50,
            maxImages: 20
          });
    
          const crawlTime = Date.now() - startTime;
    
          return {
            success: true,
            data: {
              source: 'Web Crawler',
              url,
              finalUrl: pageData.url,
              status: pageData.status,
              crawlTime,
              content: {
                ...content,
                extractedAt: new Date().toISOString()
              },
              metadata: {
                extractText,
                extractLinks,
                extractImages,
                contentLength: content.contentLength || 0,
                linksCount: content.links?.length || 0,
                imagesCount: content.images?.length || 0
              },
              timestamp: Date.now()
            }
          };
        } catch (error) {
          return {
            success: false,
            error: `Web crawling failed: ${error instanceof Error ? error.message : String(error)}`,
            data: {
              source: 'Web Crawler',
              url,
              suggestions: [
                'Check if the URL is accessible',
                'Verify the website allows crawling',
                'Try again in a few moments',
                'Check your internet connection'
              ]
            }
          };
        }
      }
    });
  • src/index.ts:248-248 (registration)
    Top-level registration call in the main server initialization that registers the web crawler tools including 'crawl_url_content'.
    registerWebCrawlerTools(this.toolRegistry);         // 2 tools: crawl_url_content, batch_crawl_urls
  • WebCrawlerClient class providing helper methods: fetchPage (axios fetch), extractContent (cheerio parsing for title, desc, content, links, images), crawlMultiplePages.
    class WebCrawlerClient {
      private userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36';
    
      async fetchPage(url: string, options: any = {}) {
        try {
          const response = await axios.get(url, {
            timeout: options.timeout || 15000,
            headers: {
              'User-Agent': this.userAgent,
              'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
              'Accept-Language': 'en-US,en;q=0.5',
              'Accept-Encoding': 'gzip, deflate',
              'Connection': 'keep-alive',
              'Upgrade-Insecure-Requests': '1'
            },
            maxRedirects: 5,
            validateStatus: (status) => status < 400
          });
    
          return {
            html: response.data,
            status: response.status,
            headers: response.headers,
            url: response.config.url || url
          };
        } catch (error) {
          throw error;
        }
      }
    
      extractContent(html: string, options: any = {}) {
        const $ = cheerio.load(html);
        
        // 移除脚本和样式标签
        $('script, style, nav, footer, aside, .advertisement, .ads').remove();
        
        const result: any = {
          title: $('title').text().trim() || $('h1').first().text().trim() || 'No title found',
          description: $('meta[name="description"]').attr('content') || 
                       $('meta[property="og:description"]').attr('content') || 
                       'No description found',
          keywords: $('meta[name="keywords"]').attr('content') || '',
          author: $('meta[name="author"]').attr('content') || 
                  $('meta[property="article:author"]').attr('content') || '',
          publishDate: $('meta[property="article:published_time"]').attr('content') || 
                       $('meta[name="date"]').attr('content') || '',
          url: $('meta[property="og:url"]').attr('content') || '',
          image: $('meta[property="og:image"]').attr('content') || 
                 $('img').first().attr('src') || ''
        };
    
        // 提取主要内容
        if (options.extractText) {
          // 尝试多种内容选择器
          const contentSelectors = [
            'article',
            '.content',
            '.post-content',
            '.entry-content',
            '.article-content',
            'main',
            '.main-content',
            '#content',
            '.container'
          ];
    
          let mainContent = '';
          for (const selector of contentSelectors) {
            const content = $(selector).text().trim();
            if (content && content.length > mainContent.length) {
              mainContent = content;
            }
          }
    
          // 如果没有找到主要内容,提取body中的文本
          if (!mainContent) {
            mainContent = $('body').text().trim();
          }
    
          // 清理文本
          mainContent = mainContent
            .replace(/\s+/g, ' ')
            .replace(/\n\s*\n/g, '\n')
            .trim();
    
          result.content = mainContent.substring(0, options.maxContentLength || 5000);
          result.contentLength = mainContent.length;
        }
    
        // 提取链接
        if (options.extractLinks) {
          const links: any[] = [];
          $('a[href]').each((_, element) => {
            const href = $(element).attr('href');
            const text = $(element).text().trim();
            if (href && text) {
              links.push({
                url: href,
                text: text,
                isExternal: href.startsWith('http') && !href.includes(new URL(result.url || '').hostname)
              });
            }
          });
          result.links = links.slice(0, options.maxLinks || 50);
        }
    
        // 提取图片
        if (options.extractImages) {
          const images: any[] = [];
          $('img[src]').each((_, element) => {
            const src = $(element).attr('src');
            const alt = $(element).attr('alt') || '';
            if (src) {
              images.push({
                url: src,
                alt: alt,
                title: $(element).attr('title') || ''
              });
            }
          });
          result.images = images.slice(0, options.maxImages || 20);
        }
    
        return result;
      }
    
      async crawlMultiplePages(urls: string[], options: any = {}) {
        const results = [];
        const maxConcurrent = options.maxConcurrent || 3;
        
        for (let i = 0; i < urls.length; i += maxConcurrent) {
          const batch = urls.slice(i, i + maxConcurrent);
          const batchPromises = batch.map(async (url) => {
            try {
              const pageData = await this.fetchPage(url, options);
              const content = this.extractContent(pageData.html, options);
              return {
                url,
                success: true,
                data: {
                  ...content,
                  status: pageData.status,
                  finalUrl: pageData.url
                }
              };
            } catch (error) {
              return {
                url,
                success: false,
                error: error instanceof Error ? error.message : String(error)
              };
            }
          });
    
          const batchResults = await Promise.all(batchPromises);
          results.push(...batchResults);
    
          // 添加延迟以避免过于频繁的请求
          if (i + maxConcurrent < urls.length) {
            await new Promise(resolve => setTimeout(resolve, options.delay || 1000));
          }
        }
    
        return results;
      }
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. While 'crawl and extract' implies a read operation, it doesn't specify potential limitations like rate limits, authentication needs, network timeouts, or what happens with inaccessible URLs. The description is too vague about the extraction process and output format, leaving key behavioral traits undefined.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core functionality without unnecessary words. Every part of the sentence ('crawl and extract content from a single web page') directly contributes to understanding the tool's purpose, making it appropriately concise and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of web crawling (with potential network issues, content variability, and no output schema), the description is incomplete. It lacks details on what the extracted content looks like, error handling, performance characteristics, or how it integrates with sibling tools. Without annotations or output schema, this leaves significant gaps for an AI agent to use the tool effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description doesn't add any meaningful semantic context beyond what's in the schema—it doesn't explain relationships between parameters (e.g., how 'extractText' interacts with 'maxContentLength') or provide usage examples. This meets the baseline for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('crawl and extract') and resource ('content from a single web page'), making it easy to understand what the tool does. However, it doesn't explicitly differentiate from sibling tools like 'batch_crawl_urls' or 'search_*' tools, which handle multiple URLs or different types of content extraction.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention that this is for single URLs (versus 'batch_crawl_urls' for multiple) or how it differs from search tools like 'search_brave' that might also extract content. There's no context about prerequisites, limitations, or typical use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server