Skip to main content
Glama

scrape_webpage

Extract webpage content for analysis or storage. Specify a URL to scrape and optionally save the output as a TXT file using the FullScope-MCP server.

Instructions

抓取网页内容,可选保存为txt文件

Args:
    url: 要抓取的网页URL
    save_to_file: 是否保存内容到txt文件

Returns:
    抓取结果和文件路径(如果保存)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
save_to_fileNo
urlYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The primary handler function for the 'scrape_webpage' tool. Decorated with @mcp.tool() for automatic registration in FastMCP. Scrapes the webpage URL using the WebScraper instance, extracts title and content with BeautifulSoup, optionally saves to a temp file, and returns formatted result.
    @mcp.tool()
    async def scrape_webpage(url: str, ctx: Context, save_to_file: bool = False) -> str:
        """
        抓取网页内容,可选保存为txt文件
        
        Args:
            url: 要抓取的网页URL
            save_to_file: 是否保存内容到txt文件
        
        Returns:
            抓取结果和文件路径(如果保存)
        """
        try:
            ctx.info(f"开始抓取网页: {url}")
            
            title, content = await scraper.scrape_url(url)
            
            result = f"网页标题: {title}\n\n网页内容:\n{content}"
            
            if save_to_file:
                # 生成文件名
                parsed_url = urlparse(url)
                filename = f"{parsed_url.netloc}_{title[:20]}.txt".replace(" ", "_")
                # 移除非法字符
                filename = "".join(c for c in filename if c.isalnum() or c in "._-")
                
                filepath = await scraper.save_content_to_file(result, filename)
                result += f"\n\n已保存到文件: {filepath}"
            
            ctx.info("网页抓取完成")
            return result
            
        except Exception as e:
            logger.error(f"网页抓取失败: {e}")
            return f"网页抓取失败: {str(e)}"
  • Key helper method in WebScraper class that performs HTTP GET request with httpx, parses HTML with BeautifulSoup, extracts title, cleans content by removing scripts/styles and normalizing text.
    async def scrape_url(self, url: str) -> tuple[str, str]:
        """
        抓取网页内容
        
        Args:
            url: 目标URL
            
        Returns:
            (title, content): 网页标题和清理后的文本内容
        """
        try:
            response = await self.session.get(url)
            response.raise_for_status()
            
            # 使用BeautifulSoup解析HTML
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 获取标题
            title = soup.find('title')
            title = title.get_text().strip() if title else "无标题"
            
            # 移除script和style标签
            for script in soup(["script", "style"]):
                script.decompose()
            
            # 提取主要内容
            content = soup.get_text()
            
            # 清理文本
            lines = (line.strip() for line in content.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            content = ' '.join(chunk for chunk in chunks if chunk)
            
            return title, content
            
        except Exception as e:
            logger.error(f"网页抓取失败 {url}: {e}")
            raise Exception(f"无法抓取网页: {str(e)}")
  • Helper method in WebScraper class to save scraped content to a temporary .txt file, generating filename if not provided using content hash.
    async def save_content_to_file(self, content: str, filename: str = None) -> str:
        """
        保存内容到文件
        
        Args:
            content: 要保存的内容
            filename: 文件名(可选)
            
        Returns:
            保存的文件路径
        """
        if not filename:
            # 生成基于内容哈希的文件名
            content_hash = hashlib.md5(content.encode()).hexdigest()[:8]
            filename = f"scraped_content_{content_hash}.txt"
        
        # 确保文件扩展名为.txt
        if not filename.endswith('.txt'):
            filename += '.txt'
        
        filepath = Path(tempfile.gettempdir()) / filename
        
        try:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(content)
            return str(filepath)
        except Exception as e:
            logger.error(f"保存文件失败: {e}")
            raise Exception(f"无法保存文件: {str(e)}")
  • Global instance of WebScraper class used by the scrape_webpage handler.
    file_processor = FileProcessor()
    rag_processor = RAGProcessor(summarizer)
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It mentions optional file saving but lacks critical details: whether the tool performs authentication, respects robots.txt, handles rate limits, manages errors, or what format the scraped content takes (e.g., raw HTML, cleaned text). For a web scraping tool with zero annotation coverage, this leaves significant behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and concise, using a clear header format with Args and Returns sections. Each sentence earns its place by defining purpose and parameters efficiently. It could be slightly more front-loaded by stating the core purpose first, but overall it avoids redundancy and waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (web scraping with file output), no annotations, and an output schema present (which handles return values), the description is minimally adequate. It covers the basic operation and parameters but lacks context on scraping behavior, error handling, or integration with siblings. The output schema relieves the description from detailing return values, but more behavioral context is needed for full completeness.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds basic semantics for both parameters: 'url: 要抓取的网页URL' (URL of the webpage to scrape) and 'save_to_file: 是否保存内容到txt文件' (whether to save content to txt file). With 0% schema description coverage, this compensates somewhat by explaining what each parameter does. However, it doesn't provide format details (e.g., URL validation) or file path behavior, keeping it at the baseline 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '抓取网页内容' (scrape webpage content) with an optional '保存为txt文件' (save as txt file). It specifies the verb (scrape) and resource (webpage content), distinguishing it from sibling tools like summarize_webpage or topic_based_summary that process rather than extract content. However, it doesn't explicitly differentiate from hypothetical scraping alternatives, keeping it at 4 instead of 5.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools like summarize_webpage for summarization tasks or other content extraction methods. There's no context about prerequisites, limitations, or typical use cases, leaving the agent with minimal usage direction.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yzfly/fullscope-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server