Skip to main content
Glama
himly0302

Web Analysis MCP

by himly0302

web_search

Search the web using SearXNG, crawl pages with Creeper, and get LLM-summarized results to access current online information while avoiding token limits.

Instructions

使用 SearXNG 搜索网络内容,通过 Creeper 爬取网页,并返回经过 LLM 总结的结果。适用于需要获取最新网络信息的场景。

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryYes搜索查询(不能为空或仅包含空格字符)
max_resultsNo最大结果数量 (1-20)
languageNo搜索语言zh
time_rangeNo时间范围
include_domainsNo只搜索这些域名(白名单)
exclude_domainsNo排除这些域名(黑名单)
save_to_fileNo是否将爬取的内容保存到本地文件(需要启用 SAVE_CONTENT_ENABLED 配置)

Implementation Reference

  • Main handler function that executes the web_search tool. Uses SearXNG for search, Creeper for web crawling, and LLM for summarization. Implements multi-stage filtering (rule-based + LLM), content crawling, and intelligent summarization (single or Map-Reduce based on content size).
    export async function executeWebSearch(
      input: WebSearchInput,
      searxngService: SearXNGService,
      creeperService: CreeperService,
      summarizerService: SummarizerService,
      config: {
        filter: any;
        filterLlm: any;
        mapReduce: any;
        maxContentLength: number;
      }
    ): Promise<string> {
      const startTime = Date.now();
    
      logger.info('Executing web_search', {
        query: input.query,
        max_results: input.max_results,
        language: input.language,
        time_range: input.time_range
      });
    
      try {
        // 1. SearXNG 搜索(多获取一些结果供过滤)
        const searchResults = await searxngService.search(input.query, {
          language: input.language,
          timeRange: input.time_range,
          maxResults: input.max_results * 2,  // 多取一倍供过滤
        });
    
        if (searchResults.length === 0) {
          return `未找到与 "${input.query}" 相关的搜索结果。`;
        }
    
        logger.info('SearXNG search completed', {
          initialResultCount: searchResults.length,
          requestedResults: input.max_results
        });
    
        // 2. 规则过滤
        const filter = new ResultFilter({
          maxResults: input.max_results,
          domainWhitelist: input.include_domains,
          domainBlacklist: input.exclude_domains,
          ...config.filter
        });
    
        const filteredResults = filter.filter(searchResults, input.query);
    
        logger.info('Rule filtering completed', {
          filteredResultCount: filteredResults.length,
          whitelistCount: input.include_domains?.length || 0,
          blacklistCount: input.exclude_domains?.length || 0
        });
    
        // 2.5. LLM 智能过滤(如果启用)
        let topic: TopicCategory = 'other';
        let finalResults = filteredResults;
    
        if (config.filterLlm.enabled && filteredResults.length > 0) {
          const filterLlmService = new FilterLlmService(config.filterLlm);
    
          const { filtered: llmFiltered, topic: detectedTopic } = await filter.filterWithLlm(
            filteredResults,
            input.query,
            (query, results) => filterLlmService.filter(query, results)
          );
    
          topic = detectedTopic;
          finalResults = llmFiltered;
    
          logger.info('LLM filtering completed', {
            topic,
            keptCount: finalResults.length,
            filteredCount: filteredResults.length - finalResults.length
          });
        }
    
        if (finalResults.length === 0) {
          return `搜索结果已被过滤器全部排除。请尝试调整域名过滤设置。`;
        }
    
        // 3. Creeper 爬取网页内容
        const urls = finalResults.map(r => r.url);
        const crawledPages = await creeperService.crawl(urls, input.save_to_file, input.query);
    
        // 统计成功/失败
        const successPages = crawledPages.filter(p => p.success);
        const failedPages = crawledPages.filter(p => !p.success);
    
        logger.info('Crawling completed', {
          totalUrls: urls.length,
          successCount: successPages.length,
          failureCount: failedPages.length
        });
    
        if (successPages.length === 0) {
          return `所有网页爬取失败。请检查 Creeper 配置或网页可访问性。`;
        }
    
        // 4. 计算内容总大小
        const totalLength = successPages.reduce((sum, p) => sum + p.content.length, 0);
    
        logger.info('Content size analysis', {
          totalLength,
          avgLength: Math.round(totalLength / successPages.length),
          threshold: config.mapReduce.threshold
        });
    
        // 5. 选择总结策略(主题驱动)
        let summary: string;
        logger.info('Using topic-driven summarization', { topic });
    
        if (totalLength < config.mapReduce.threshold) {
          // 单次总结
          logger.info('Using single summarization');
          const combinedContent = formatForSingleSummary(successPages);
          const summaryPrompt = getSearchSummaryPromptByTopic(topic);
          const summaryResponse = await summarizerService.summarize({
            content: combinedContent,
            prompt: summaryPrompt
          });
          summary = summaryResponse.summary;
        } else {
          // Map-Reduce 总结
          logger.info('Using Map-Reduce summarization');
          summary = await mapReduceSummarize(
            successPages,
            summarizerService,
            {
              chunkSize: config.mapReduce.chunkSize,
              maxConcurrency: config.mapReduce.maxConcurrency
            }
          );
        }
    
        // 6. 格式化输出
        const output = formatOutput(summary, successPages, failedPages);
    
        const duration = Date.now() - startTime;
        logger.info('Web search completed', {
          query: input.query,
          duration: `${duration}ms`,
          topic,
          initialResults: searchResults.length,
          filteredResults: filteredResults.length,
          finalResults: finalResults.length,
          pagesCrawled: successPages.length
        });
    
        return output;
    
      } catch (error) {
        const errorMessage = error instanceof Error ? error.message : 'Unknown error';
        logger.error('Web search failed', {
          query: input.query,
          error: errorMessage
        });
    
        return `搜索执行失败: ${errorMessage}`;
      }
    }
  • Input validation schema using Zod. Defines and validates parameters: query (required string), max_results (1-20, default 10), language, time_range, include/exclude domains, and save_to_file option.
    export const webSearchInputSchema = z.object({
      query: z.string()
        .min(1, '搜索查询不能为空')
        .transform(val => val.trim())
        .refine(val => val.length > 0, '搜索查询不能只包含空格字符')
        .describe('搜索查询'),
      max_results: z.number().min(1).max(20).default(10).describe('最大结果数量'),
      language: z.string().optional().default('zh').describe('搜索语言'),
      time_range: z.enum(['day', 'week', 'month', 'year']).optional().describe('时间范围'),
      include_domains: z.array(z.string()).optional().describe('只搜索这些域名'),
      exclude_domains: z.array(z.string()).optional().describe('排除这些域名'),
      save_to_file: z.boolean().default(false).describe('是否将爬取的内容保存到本地文件'),
    });
  • src/server.ts:86-132 (registration)
    Tool registration in MCP server's ListToolsRequestSchema handler. Defines the web_search tool with name, description, and complete JSON Schema input specification including all parameters and validation constraints.
    {
      name: 'web_search',
      description:
        '使用 SearXNG 搜索网络内容,通过 Creeper 爬取网页,并返回经过 LLM 总结的结果。适用于需要获取最新网络信息的场景。',
      inputSchema: {
        type: 'object',
        properties: {
          query: {
            type: 'string',
            description: '搜索查询(不能为空或仅包含空格字符)',
          },
          max_results: {
            type: 'number',
            description: '最大结果数量 (1-20)',
            default: 10,
            minimum: 1,
            maximum: 20,
          },
          language: {
            type: 'string',
            description: '搜索语言',
            default: 'zh',
          },
          time_range: {
            type: 'string',
            enum: ['day', 'week', 'month', 'year'],
            description: '时间范围',
          },
          include_domains: {
            type: 'array',
            items: { type: 'string' },
            description: '只搜索这些域名(白名单)',
          },
          exclude_domains: {
            type: 'array',
            items: { type: 'string' },
            description: '排除这些域名(黑名单)',
          },
          save_to_file: {
            type: 'boolean',
            description: '是否将爬取的内容保存到本地文件(需要启用 SAVE_CONTENT_ENABLED 配置)',
            default: false,
          },
        },
        required: ['query'],
      },
    },
  • src/server.ts:148-163 (registration)
    Tool execution handler in CallToolRequestSchema. Routes 'web_search' requests to executeWebSearch function with validated input and configured services (SearXNG, Creeper, Summarizer).
    case 'web_search': {
      const input = webSearchInputSchema.parse(args);
      result = await executeWebSearch(
        input,
        searxngService,
        creeperService,
        summarizerService,
        {
          filter: config.filter,
          filterLlm: config.filterLlm,
          mapReduce: config.mapReduce,
          maxContentLength: config.service.maxContentLength,
        }
      );
      break;
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It mentions the multi-step process (search, crawl, LLM summarization) and hints at recency ('最新网络信息'), but fails to disclose critical behavioral traits such as rate limits, authentication needs, potential costs, error handling, or what 'LLM 总结' entails. For a complex tool with no annotations, this leaves significant gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (two sentences) and front-loaded with the core functionality. Every sentence contributes: the first explains what the tool does, and the second provides usage context. There's no wasted text, though it could be slightly more structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (multi-step process with 7 parameters), lack of annotations, and no output schema, the description is incomplete. It doesn't explain the return format, error conditions, or behavioral constraints. The agent would lack sufficient context to use this tool effectively without trial and error.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents all 7 parameters. The description adds no parameter-specific information beyond what's in the schema. According to the rules, with high schema coverage (>80%), the baseline is 3 even with no param info in the description.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '使用 SearXNG 搜索网络内容,通过 Creeper 爬取网页,并返回经过 LLM 总结的结果' (search web content using SearXNG, crawl pages via Creeper, and return LLM-summarized results). It specifies the verb (search/crawl/summarize) and resource (web content), but since there are no sibling tools, it cannot demonstrate differentiation from alternatives.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides some usage context: '适用于需要获取最新网络信息的场景' (suitable for scenarios requiring up-to-date web information). This implies when to use it (for current information), but lacks explicit guidance on when not to use it or comparisons to alternatives. No prerequisites or exclusions are mentioned.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/himly0302/web-analysis-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server