check_robots_txt
Analyze a website's robots.txt file to determine crawl permissions and ensure compliance with ethical web scraping practices. Provides insights into allowed and disallowed paths for crawling.
Instructions
Check the robots.txt file for a domain to understand crawling permissions.
This tool helps ensure ethical scraping by checking the robots.txt file of a website to see what crawling rules are in place.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes |
Input Schema (JSON Schema)
{
"properties": {
"url": {
"title": "Url",
"type": "string"
}
},
"required": [
"url"
],
"type": "object"
}
Implementation Reference
- extractor/server.py:612-678 (handler)The handler function that implements the check_robots_txt tool logic. It fetches and parses the site's robots.txt file to determine crawling permissions.@app.tool() async def check_robots_txt( url: Annotated[ str, Field( ..., description="""网站域名 URL,必须包含协议前缀(http://或https://),将检查该域名的 robots.txt文件。 示例:"https://example.com"将检查"https://example.com/robots.txt"。用于确保道德抓取,遵循网站的爬虫规则""", ), ], ) -> RobotsResponse: """ Check the robots.txt file for a domain to understand crawling permissions. This tool helps ensure ethical scraping by checking the robots.txt file of a website to see what crawling rules are in place. Returns: RobotsResponse object containing success status, robots.txt content, base domain, and content availability. Helps determine crawling permissions and restrictions for the specified domain. """ try: # Validate inputs parsed = urlparse(url) if not parsed.scheme or not parsed.netloc: raise ValueError("Invalid URL format") logger.info(f"Checking robots.txt for: {url}") # Parse URL to get base domain robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt" # Scrape robots.txt result = await web_scraper.simple_scraper.scrape(robots_url, extract_config={}) if "error" in result: return RobotsResponse( success=False, url=url, robots_txt_url=robots_url, is_allowed=False, user_agent="*", error=f"Could not fetch robots.txt: {result['error']}", ) robots_content = result.get("content", {}).get("text", "") return RobotsResponse( success=True, url=url, robots_txt_url=robots_url, robots_content=robots_content, is_allowed=True, # Basic check, could be enhanced user_agent="*", ) except Exception as e: logger.error(f"Error checking robots.txt for {url}: {str(e)}") return RobotsResponse( success=False, url=url, robots_txt_url="", is_allowed=False, user_agent="*", error=str(e), )
- extractor/server.py:119-129 (schema)Pydantic model defining the output schema for the check_robots_txt tool, including fields for success status, robots.txt content, allowance, and errors.class RobotsResponse(BaseModel): """Response model for robots.txt check.""" success: bool = Field(..., description="操作是否成功") url: str = Field(..., description="检查的URL") robots_txt_url: str = Field(..., description="robots.txt文件URL") robots_content: Optional[str] = Field(default=None, description="robots.txt内容") is_allowed: bool = Field(..., description="是否允许抓取") user_agent: str = Field(..., description="使用的User-Agent") error: Optional[str] = Field(default=None, description="错误信息(如果有)")
- extractor/server.py:612-612 (registration)The @app.tool() decorator registers the check_robots_txt function as an MCP tool in the FastMCP application.@app.tool()