Skip to main content
Glama
funinii

TrendRadar

by funinii

trigger_crawl

Manually initiate a web scraping task to collect trending news from specified platforms, with options to save data locally and include URLs.

Instructions

手动触发一次爬取任务(可选持久化)

Args: platforms: 指定平台ID列表,如 ['zhihu', 'weibo', 'douyin'] - 不指定时:使用 config.yaml 中配置的所有平台 - 支持的平台来自 config/config.yaml 的 platforms 配置 - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 - 注意:失败的平台会在返回结果的 failed_platforms 字段中列出 save_to_local: 是否保存到本地 output 目录,默认 False include_url: 是否包含URL链接,默认False(节省token)

Returns: JSON格式的任务状态信息,包含: - platforms: 成功爬取的平台列表 - failed_platforms: 失败的平台列表(如有) - total_news: 爬取的新闻总数 - data: 新闻数据

Examples: - 临时爬取: trigger_crawl(platforms=['zhihu']) - 爬取并保存: trigger_crawl(platforms=['weibo'], save_to_local=True) - 使用默认平台: trigger_crawl() # 爬取config.yaml中配置的所有平台

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
platformsNo
save_to_localNo
include_urlNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The 'trigger_crawl' method in 'SystemManagementTools' class performs the actual logic for triggering a web crawl based on configured platforms, managing retries, and optionally saving results.
    def trigger_crawl(self, platforms: Optional[List[str]] = None, save_to_local: bool = False, include_url: bool = False) -> Dict:
        """
        手动触发一次临时爬取任务(可选持久化)
    
        Args:
            platforms: 指定平台列表,为空则爬取所有平台
            save_to_local: 是否保存到本地 output 目录,默认 False
            include_url: 是否包含URL链接,默认False(节省token)
    
        Returns:
            爬取结果字典,包含新闻数据和保存路径(如果保存)
    
        Example:
            >>> tools = SystemManagementTools()
            >>> # 临时爬取,不保存
            >>> result = tools.trigger_crawl(platforms=['zhihu', 'weibo'])
            >>> print(result['data'])
            >>> # 爬取并保存到本地
            >>> result = tools.trigger_crawl(platforms=['zhihu'], save_to_local=True)
            >>> print(result['saved_files'])
        """
        try:
            import json
            import time
            import random
            import requests
            from datetime import datetime
            import pytz
            import yaml
    
            # 参数验证
            platforms = validate_platforms(platforms)
    
            # 加载配置文件
            config_path = self.project_root / "config" / "config.yaml"
            if not config_path.exists():
                raise CrawlTaskError(
                    "配置文件不存在",
                    suggestion=f"请确保配置文件存在: {config_path}"
                )
    
            # 读取配置
            with open(config_path, "r", encoding="utf-8") as f:
                config_data = yaml.safe_load(f)
    
            # 获取平台配置
            all_platforms = config_data.get("platforms", [])
            if not all_platforms:
                raise CrawlTaskError(
                    "配置文件中没有平台配置",
                    suggestion="请检查 config/config.yaml 中的 platforms 配置"
                )
    
            # 过滤平台
            if platforms:
                target_platforms = [p for p in all_platforms if p["id"] in platforms]
                if not target_platforms:
                    raise CrawlTaskError(
                        f"指定的平台不存在: {platforms}",
                        suggestion=f"可用平台: {[p['id'] for p in all_platforms]}"
                    )
            else:
                target_platforms = all_platforms
    
            # 获取请求间隔
            request_interval = config_data.get("crawler", {}).get("request_interval", 100)
    
            # 构建平台ID列表
            ids = []
            for platform in target_platforms:
                if "name" in platform:
                    ids.append((platform["id"], platform["name"]))
                else:
                    ids.append(platform["id"])
    
            print(f"开始临时爬取,平台: {[p.get('name', p['id']) for p in target_platforms]}")
    
            # 爬取数据
            results = {}
            id_to_name = {}
            failed_ids = []
    
            for i, id_info in enumerate(ids):
                if isinstance(id_info, tuple):
                    id_value, name = id_info
                else:
                    id_value = id_info
                    name = id_value
    
                id_to_name[id_value] = name
    
                # 构建请求URL
                url = f"https://newsnow.busiyi.world/api/s?id={id_value}&latest"
    
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
                    "Accept": "application/json, text/plain, */*",
                    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
                    "Connection": "keep-alive",
                    "Cache-Control": "no-cache",
                }
    
                # 重试机制
                max_retries = 2
                retries = 0
                success = False
    
                while retries <= max_retries and not success:
                    try:
                        response = requests.get(url, headers=headers, timeout=10)
                        response.raise_for_status()
    
                        data_text = response.text
                        data_json = json.loads(data_text)
    
                        status = data_json.get("status", "未知")
                        if status not in ["success", "cache"]:
                            raise ValueError(f"响应状态异常: {status}")
    
                        status_info = "最新数据" if status == "success" else "缓存数据"
                        print(f"获取 {id_value} 成功({status_info})")
    
                        # 解析数据
                        results[id_value] = {}
                        for index, item in enumerate(data_json.get("items", []), 1):
                            title = item["title"]
                            url_link = item.get("url", "")
                            mobile_url = item.get("mobileUrl", "")
    
                            if title in results[id_value]:
                                results[id_value][title]["ranks"].append(index)
                            else:
                                results[id_value][title] = {
                                    "ranks": [index],
                                    "url": url_link,
                                    "mobileUrl": mobile_url,
                                }
    
                        success = True
    
                    except Exception as e:
                        retries += 1
                        if retries <= max_retries:
                            wait_time = random.uniform(3, 5)
                            print(f"请求 {id_value} 失败: {e}. {wait_time:.2f}秒后重试...")
                            time.sleep(wait_time)
                        else:
                            print(f"请求 {id_value} 失败: {e}")
                            failed_ids.append(id_value)
    
                # 请求间隔
                if i < len(ids) - 1:
                    actual_interval = request_interval + random.randint(-10, 20)
                    actual_interval = max(50, actual_interval)
                    time.sleep(actual_interval / 1000)
    
            # 格式化返回数据
            news_data = []
            for platform_id, titles_data in results.items():
                platform_name = id_to_name.get(platform_id, platform_id)
                for title, info in titles_data.items():
                    news_item = {
                        "platform_id": platform_id,
                        "platform_name": platform_name,
                        "title": title,
                        "ranks": info["ranks"]
                    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It does well by explaining the optional persistence ('可选持久化'), what happens with failed platforms, token-saving considerations for include_url, and the JSON return structure. However, it doesn't mention potential rate limits, authentication needs, or error handling beyond failed platforms.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (Args, Returns, Examples) and front-loaded purpose statement. While comprehensive, some information could be more concise (e.g., the platforms explanation has some redundancy). Every sentence adds value, but the overall length is slightly above optimal.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (3 parameters, 0% schema coverage, no annotations, but has output schema), the description is remarkably complete. It covers purpose, parameters, return values, usage examples, and behavioral context. The output schema existence means the description doesn't need to detail return structure, which it appropriately references without over-explaining.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for 3 parameters, the description fully compensates by providing rich semantic information. It explains platforms parameter options (list of IDs, null behavior, supported platforms from config), save_to_local's effect (save to output directory), and include_url's purpose (save tokens). The examples further clarify parameter usage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '手动触发一次爬取任务' (manually trigger a crawling task). It specifies the verb ('触发' - trigger) and resource ('爬取任务' - crawling task), and distinguishes itself from sibling tools like get_latest_news or search_news by focusing on initiating a crawl rather than retrieving existing data.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool versus alternatives. It explains that not specifying platforms uses all configured platforms from config.yaml, and gives concrete examples for different scenarios (temporary crawl, crawl with save, default platforms). This clearly defines the tool's context and how it differs from data retrieval siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/funinii/TrendRadar'

If you have feedback or need assistance with the MCP directory API, please join our Discord server