Skip to main content
Glama

check_robots_txt

Analyze a website's robots.txt file to determine crawl permissions and ensure compliance with ethical web scraping practices. Provides insights into allowed and disallowed paths for crawling.

Instructions

Check the robots.txt file for a domain to understand crawling permissions.

This tool helps ensure ethical scraping by checking the robots.txt file of a website to see what crawling rules are in place.

Input Schema

NameRequiredDescriptionDefault
urlYes

Input Schema (JSON Schema)

{ "properties": { "url": { "title": "Url", "type": "string" } }, "required": [ "url" ], "type": "object" }

Implementation Reference

  • The handler function that implements the check_robots_txt tool logic. It fetches and parses the site's robots.txt file to determine crawling permissions.
    @app.tool() async def check_robots_txt( url: Annotated[ str, Field( ..., description="""网站域名 URL,必须包含协议前缀(http://或https://),将检查该域名的 robots.txt文件。 示例:"https://example.com"将检查"https://example.com/robots.txt"。用于确保道德抓取,遵循网站的爬虫规则""", ), ], ) -> RobotsResponse: """ Check the robots.txt file for a domain to understand crawling permissions. This tool helps ensure ethical scraping by checking the robots.txt file of a website to see what crawling rules are in place. Returns: RobotsResponse object containing success status, robots.txt content, base domain, and content availability. Helps determine crawling permissions and restrictions for the specified domain. """ try: # Validate inputs parsed = urlparse(url) if not parsed.scheme or not parsed.netloc: raise ValueError("Invalid URL format") logger.info(f"Checking robots.txt for: {url}") # Parse URL to get base domain robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt" # Scrape robots.txt result = await web_scraper.simple_scraper.scrape(robots_url, extract_config={}) if "error" in result: return RobotsResponse( success=False, url=url, robots_txt_url=robots_url, is_allowed=False, user_agent="*", error=f"Could not fetch robots.txt: {result['error']}", ) robots_content = result.get("content", {}).get("text", "") return RobotsResponse( success=True, url=url, robots_txt_url=robots_url, robots_content=robots_content, is_allowed=True, # Basic check, could be enhanced user_agent="*", ) except Exception as e: logger.error(f"Error checking robots.txt for {url}: {str(e)}") return RobotsResponse( success=False, url=url, robots_txt_url="", is_allowed=False, user_agent="*", error=str(e), )
  • Pydantic model defining the output schema for the check_robots_txt tool, including fields for success status, robots.txt content, allowance, and errors.
    class RobotsResponse(BaseModel): """Response model for robots.txt check.""" success: bool = Field(..., description="操作是否成功") url: str = Field(..., description="检查的URL") robots_txt_url: str = Field(..., description="robots.txt文件URL") robots_content: Optional[str] = Field(default=None, description="robots.txt内容") is_allowed: bool = Field(..., description="是否允许抓取") user_agent: str = Field(..., description="使用的User-Agent") error: Optional[str] = Field(default=None, description="错误信息(如果有)")
  • The @app.tool() decorator registers the check_robots_txt function as an MCP tool in the FastMCP application.
    @app.tool()

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server