invoke_firecrawl_llmtxt
Crawl a website to generate a standardized markdown file (llmfull.txt) for LLM inference, extracting data via GPT-4 optimization. Results are uploaded to a specified S3 URI for processing.
Instructions
Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10)
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| max_urls | No | ||
| s3_uri | Yes | ||
| url | Yes |
Implementation Reference
- The main handler function for the 'invoke_firecrawl_llmtxt' MCP tool. It prepares job-specific parameters and delegates to the generic _invoke_firecrawl_job helper to start the Firecrawl llmfulltxt generation job.async def invoke_firecrawl_llmtxt( url: str, s3_uri: str, max_urls: int = 10, ) -> Dict[str, Any]: """Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10) Returns: Dictionary with job information including the job ID """ # Call the generic invoke function with llmfull.txt-specific parameters params = {"maxUrls": max_urls, "showFullText": False} return await _invoke_firecrawl_job( url=url, s3_uri=s3_uri, job_type="llmfulltxt", job_params=params, )
- Core helper function that implements the logic to initialize Firecrawl client, start the llmfulltxt job, create a background task for completion handling, and return job status.async def _invoke_firecrawl_job( url: str, s3_uri: str, job_type: Firecrawl_JobType, job_params: Dict[str, Any], ) -> Dict[str, Any]: """Generic function to start a Firecrawl job (either HTML crawl or llmfull.txt generation). Args: url: URL to process s3_uri: S3 URI where results will be uploaded job_type: Type of job ('crawlhtml' or 'llmtxt') job_params: Parameters specific to the job type Returns: Dictionary with job information including the job ID """ # Get configuration with API key config = _prepare_firecrawl_config() # Check if config contains an error if "error" in config: return {"error": config["error"]} # Validate and normalize S3 URI first - # doing this outside the try block to handle validation errors specifically try: validated_s3_uri = _ensure_valid_s3_uri(s3_uri) except ValueError as ve: return {"error": f"Invalid S3 URI: {str(ve)}"} try: # Initialize the Firecrawl client firecrawl = FirecrawlApp(api_key=config["api_key"]) # Start the job based on job_type if job_type == "crawlhtml": job_status = firecrawl.async_crawl_url(url, params=job_params) elif job_type == "llmfulltxt": job_status = firecrawl.async_generate_llms_text(url, params=job_params) else: return {"error": f"Unknown job type: {job_type}"} # Handle the response if "id" in job_status: job_id = job_status["id"] # Start background task without waiting for it asyncio.create_task(wait_for_job_completion(job_id, validated_s3_uri, job_type)) # Prepare and return the response response = { "id": job_id, "status": job_status.get("status", "started"), "s3_uri": f"{validated_s3_uri}{job_id}/", "message": f"Firecrawl {job_type} job started " f"and will be auto-processed when complete", } return response else: return {"error": f"Failed to start Firecrawl {job_type} job", "details": job_status} except Exception as e: return {"error": f"Error starting Firecrawl {job_type} job: {str(e)}"}
- uns_mcp/connectors/external/__init__.py:22-22 (registration)MCP tool registration for invoke_firecrawl_llmtxt using the FastMCP decorator.mcp.tool()(invoke_firecrawl_llmtxt)
- Helper to retrieve and validate the Firecrawl API key from environment variable.def _prepare_firecrawl_config() -> Dict[str, str]: """Prepare the Firecrawl configuration by retrieving and validating the API key. Returns: A dictionary containing either an API key or an error message """ api_key = os.getenv("FIRECRAWL_API_KEY") if not api_key: return { "error": "Firecrawl API key is required. Set FIRECRAWL_API_KEY environment variable.", } return {"api_key": api_key}
- Helper to validate and normalize the S3 URI format.def _ensure_valid_s3_uri(s3_uri: str) -> str: """Ensure S3 URI is properly formatted. Args: s3_uri: S3 URI to validate Returns: Properly formatted S3 URI Raises: ValueError: If S3 URI doesn't start with 's3://' """ if not s3_uri: raise ValueError("S3 URI is required") if not s3_uri.startswith("s3://"): raise ValueError("S3 URI must start with 's3://'") # Ensure URI ends with a slash if not s3_uri.endswith("/"): s3_uri += "/" return s3_uri