Skip to main content
Glama

check_robots_txt

Analyze a website's robots.txt file to determine crawl permissions and ensure compliance with ethical web scraping practices. Provides insights into allowed and disallowed paths for crawling.

Instructions

Check the robots.txt file for a domain to understand crawling permissions.

This tool helps ensure ethical scraping by checking the robots.txt file of a website to see what crawling rules are in place.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes

Implementation Reference

  • The handler function that implements the check_robots_txt tool logic. It fetches and parses the site's robots.txt file to determine crawling permissions.
    @app.tool()
    async def check_robots_txt(
        url: Annotated[
            str,
            Field(
                ...,
                description="""网站域名 URL,必须包含协议前缀(http://或https://),将检查该域名的 robots.txt文件。
                    示例:"https://example.com"将检查"https://example.com/robots.txt"。用于确保道德抓取,遵循网站的爬虫规则""",
            ),
        ],
    ) -> RobotsResponse:
        """
        Check the robots.txt file for a domain to understand crawling permissions.
    
        This tool helps ensure ethical scraping by checking the robots.txt file
        of a website to see what crawling rules are in place.
    
        Returns:
            RobotsResponse object containing success status, robots.txt content, base domain, and content availability.
            Helps determine crawling permissions and restrictions for the specified domain.
        """
        try:
            # Validate inputs
            parsed = urlparse(url)
            if not parsed.scheme or not parsed.netloc:
                raise ValueError("Invalid URL format")
    
            logger.info(f"Checking robots.txt for: {url}")
    
            # Parse URL to get base domain
            robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
            # Scrape robots.txt
            result = await web_scraper.simple_scraper.scrape(robots_url, extract_config={})
    
            if "error" in result:
                return RobotsResponse(
                    success=False,
                    url=url,
                    robots_txt_url=robots_url,
                    is_allowed=False,
                    user_agent="*",
                    error=f"Could not fetch robots.txt: {result['error']}",
                )
    
            robots_content = result.get("content", {}).get("text", "")
    
            return RobotsResponse(
                success=True,
                url=url,
                robots_txt_url=robots_url,
                robots_content=robots_content,
                is_allowed=True,  # Basic check, could be enhanced
                user_agent="*",
            )
    
        except Exception as e:
            logger.error(f"Error checking robots.txt for {url}: {str(e)}")
            return RobotsResponse(
                success=False,
                url=url,
                robots_txt_url="",
                is_allowed=False,
                user_agent="*",
                error=str(e),
            )
  • Pydantic model defining the output schema for the check_robots_txt tool, including fields for success status, robots.txt content, allowance, and errors.
    class RobotsResponse(BaseModel):
        """Response model for robots.txt check."""
    
        success: bool = Field(..., description="操作是否成功")
        url: str = Field(..., description="检查的URL")
        robots_txt_url: str = Field(..., description="robots.txt文件URL")
        robots_content: Optional[str] = Field(default=None, description="robots.txt内容")
        is_allowed: bool = Field(..., description="是否允许抓取")
        user_agent: str = Field(..., description="使用的User-Agent")
        error: Optional[str] = Field(default=None, description="错误信息(如果有)")
  • The @app.tool() decorator registers the check_robots_txt function as an MCP tool in the FastMCP application.
    @app.tool()
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server