Skip to main content
Glama

scrape_webpage

Extract webpage content for analysis or storage. Specify a URL to scrape and optionally save the output as a TXT file using the FullScope-MCP server.

Instructions

抓取网页内容,可选保存为txt文件

Args:
    url: 要抓取的网页URL
    save_to_file: 是否保存内容到txt文件

Returns:
    抓取结果和文件路径(如果保存)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
save_to_fileNo
urlYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The primary handler function for the 'scrape_webpage' tool. Decorated with @mcp.tool() for automatic registration in FastMCP. Scrapes the webpage URL using the WebScraper instance, extracts title and content with BeautifulSoup, optionally saves to a temp file, and returns formatted result.
    @mcp.tool()
    async def scrape_webpage(url: str, ctx: Context, save_to_file: bool = False) -> str:
        """
        抓取网页内容,可选保存为txt文件
        
        Args:
            url: 要抓取的网页URL
            save_to_file: 是否保存内容到txt文件
        
        Returns:
            抓取结果和文件路径(如果保存)
        """
        try:
            ctx.info(f"开始抓取网页: {url}")
            
            title, content = await scraper.scrape_url(url)
            
            result = f"网页标题: {title}\n\n网页内容:\n{content}"
            
            if save_to_file:
                # 生成文件名
                parsed_url = urlparse(url)
                filename = f"{parsed_url.netloc}_{title[:20]}.txt".replace(" ", "_")
                # 移除非法字符
                filename = "".join(c for c in filename if c.isalnum() or c in "._-")
                
                filepath = await scraper.save_content_to_file(result, filename)
                result += f"\n\n已保存到文件: {filepath}"
            
            ctx.info("网页抓取完成")
            return result
            
        except Exception as e:
            logger.error(f"网页抓取失败: {e}")
            return f"网页抓取失败: {str(e)}"
  • Key helper method in WebScraper class that performs HTTP GET request with httpx, parses HTML with BeautifulSoup, extracts title, cleans content by removing scripts/styles and normalizing text.
    async def scrape_url(self, url: str) -> tuple[str, str]:
        """
        抓取网页内容
        
        Args:
            url: 目标URL
            
        Returns:
            (title, content): 网页标题和清理后的文本内容
        """
        try:
            response = await self.session.get(url)
            response.raise_for_status()
            
            # 使用BeautifulSoup解析HTML
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 获取标题
            title = soup.find('title')
            title = title.get_text().strip() if title else "无标题"
            
            # 移除script和style标签
            for script in soup(["script", "style"]):
                script.decompose()
            
            # 提取主要内容
            content = soup.get_text()
            
            # 清理文本
            lines = (line.strip() for line in content.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            content = ' '.join(chunk for chunk in chunks if chunk)
            
            return title, content
            
        except Exception as e:
            logger.error(f"网页抓取失败 {url}: {e}")
            raise Exception(f"无法抓取网页: {str(e)}")
  • Helper method in WebScraper class to save scraped content to a temporary .txt file, generating filename if not provided using content hash.
    async def save_content_to_file(self, content: str, filename: str = None) -> str:
        """
        保存内容到文件
        
        Args:
            content: 要保存的内容
            filename: 文件名(可选)
            
        Returns:
            保存的文件路径
        """
        if not filename:
            # 生成基于内容哈希的文件名
            content_hash = hashlib.md5(content.encode()).hexdigest()[:8]
            filename = f"scraped_content_{content_hash}.txt"
        
        # 确保文件扩展名为.txt
        if not filename.endswith('.txt'):
            filename += '.txt'
        
        filepath = Path(tempfile.gettempdir()) / filename
        
        try:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(content)
            return str(filepath)
        except Exception as e:
            logger.error(f"保存文件失败: {e}")
            raise Exception(f"无法保存文件: {str(e)}")
  • Global instance of WebScraper class used by the scrape_webpage handler.
    file_processor = FileProcessor()
    rag_processor = RAGProcessor(summarizer)

Tool Definition Quality

Score is being calculated. Check back soon.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yzfly/fullscope-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server