Skip to main content
Glama

scrape_webpage

Extract webpage content for analysis or storage. Specify a URL to scrape and optionally save the output as a TXT file using the FullScope-MCP server.

Instructions

抓取网页内容,可选保存为txt文件

Args:
    url: 要抓取的网页URL
    save_to_file: 是否保存内容到txt文件

Returns:
    抓取结果和文件路径(如果保存)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
save_to_fileNo
urlYes

Implementation Reference

  • The primary handler function for the 'scrape_webpage' tool. Decorated with @mcp.tool() for automatic registration in FastMCP. Scrapes the webpage URL using the WebScraper instance, extracts title and content with BeautifulSoup, optionally saves to a temp file, and returns formatted result.
    @mcp.tool()
    async def scrape_webpage(url: str, ctx: Context, save_to_file: bool = False) -> str:
        """
        抓取网页内容,可选保存为txt文件
        
        Args:
            url: 要抓取的网页URL
            save_to_file: 是否保存内容到txt文件
        
        Returns:
            抓取结果和文件路径(如果保存)
        """
        try:
            ctx.info(f"开始抓取网页: {url}")
            
            title, content = await scraper.scrape_url(url)
            
            result = f"网页标题: {title}\n\n网页内容:\n{content}"
            
            if save_to_file:
                # 生成文件名
                parsed_url = urlparse(url)
                filename = f"{parsed_url.netloc}_{title[:20]}.txt".replace(" ", "_")
                # 移除非法字符
                filename = "".join(c for c in filename if c.isalnum() or c in "._-")
                
                filepath = await scraper.save_content_to_file(result, filename)
                result += f"\n\n已保存到文件: {filepath}"
            
            ctx.info("网页抓取完成")
            return result
            
        except Exception as e:
            logger.error(f"网页抓取失败: {e}")
            return f"网页抓取失败: {str(e)}"
  • Key helper method in WebScraper class that performs HTTP GET request with httpx, parses HTML with BeautifulSoup, extracts title, cleans content by removing scripts/styles and normalizing text.
    async def scrape_url(self, url: str) -> tuple[str, str]:
        """
        抓取网页内容
        
        Args:
            url: 目标URL
            
        Returns:
            (title, content): 网页标题和清理后的文本内容
        """
        try:
            response = await self.session.get(url)
            response.raise_for_status()
            
            # 使用BeautifulSoup解析HTML
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 获取标题
            title = soup.find('title')
            title = title.get_text().strip() if title else "无标题"
            
            # 移除script和style标签
            for script in soup(["script", "style"]):
                script.decompose()
            
            # 提取主要内容
            content = soup.get_text()
            
            # 清理文本
            lines = (line.strip() for line in content.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            content = ' '.join(chunk for chunk in chunks if chunk)
            
            return title, content
            
        except Exception as e:
            logger.error(f"网页抓取失败 {url}: {e}")
            raise Exception(f"无法抓取网页: {str(e)}")
  • Helper method in WebScraper class to save scraped content to a temporary .txt file, generating filename if not provided using content hash.
    async def save_content_to_file(self, content: str, filename: str = None) -> str:
        """
        保存内容到文件
        
        Args:
            content: 要保存的内容
            filename: 文件名(可选)
            
        Returns:
            保存的文件路径
        """
        if not filename:
            # 生成基于内容哈希的文件名
            content_hash = hashlib.md5(content.encode()).hexdigest()[:8]
            filename = f"scraped_content_{content_hash}.txt"
        
        # 确保文件扩展名为.txt
        if not filename.endswith('.txt'):
            filename += '.txt'
        
        filepath = Path(tempfile.gettempdir()) / filename
        
        try:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(content)
            return str(filepath)
        except Exception as e:
            logger.error(f"保存文件失败: {e}")
            raise Exception(f"无法保存文件: {str(e)}")
  • Global instance of WebScraper class used by the scrape_webpage handler.
    file_processor = FileProcessor()
    rag_processor = RAGProcessor(summarizer)
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yzfly/fullscope-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server