Skip to main content
Glama
Unstructured-IO

Unstructured API MCP Server

Official

invoke_firecrawl_llmtxt

Crawl websites to generate structured LLM-ready markdown files using Firecrawl, extracting data for AI inference and storing results in S3.

Instructions

Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10)

Returns:
    Dictionary with job information including the job ID

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
s3_uriYes
max_urlsNo

Implementation Reference

  • Main handler function for the invoke_firecrawl_llmtxt tool. Prepares parameters and delegates to the generic _invoke_firecrawl_job for starting the Firecrawl llmfulltxt job.
    async def invoke_firecrawl_llmtxt(
        url: str,
        s3_uri: str,
        max_urls: int = 10,
    ) -> Dict[str, Any]:
        """Start an asynchronous llmfull.txt generation job using Firecrawl.
        This file is a standardized markdown file containing information to help LLMs
        use a website at inference time.
        The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data
        using gpt-4o-mini
        Args:
            url: URL to crawl
            s3_uri: S3 URI where results will be uploaded
            max_urls: Maximum number of pages to crawl (1-100, default: 10)
    
        Returns:
            Dictionary with job information including the job ID
        """
        # Call the generic invoke function with llmfull.txt-specific parameters
        params = {"maxUrls": max_urls, "showFullText": False}
    
        return await _invoke_firecrawl_job(
            url=url,
            s3_uri=s3_uri,
            job_type="llmfulltxt",
            job_params=params,
        )
  • Core helper function that implements the logic to initialize Firecrawl client, start the llmfulltxt job, and kick off background processing/uploading to S3.
    async def _invoke_firecrawl_job(
        url: str,
        s3_uri: str,
        job_type: Firecrawl_JobType,
        job_params: Dict[str, Any],
    ) -> Dict[str, Any]:
        """Generic function to start a Firecrawl job (either HTML crawl or llmfull.txt generation).
    
        Args:
            url: URL to process
            s3_uri: S3 URI where results will be uploaded
            job_type: Type of job ('crawlhtml' or 'llmtxt')
            job_params: Parameters specific to the job type
    
        Returns:
            Dictionary with job information including the job ID
        """
        # Get configuration with API key
        config = _prepare_firecrawl_config()
    
        # Check if config contains an error
        if "error" in config:
            return {"error": config["error"]}
    
        # Validate and normalize S3 URI first -
        # doing this outside the try block to handle validation errors specifically
        try:
            validated_s3_uri = _ensure_valid_s3_uri(s3_uri)
        except ValueError as ve:
            return {"error": f"Invalid S3 URI: {str(ve)}"}
    
        try:
            # Initialize the Firecrawl client
            firecrawl = FirecrawlApp(api_key=config["api_key"])
    
            # Start the job based on job_type
            if job_type == "crawlhtml":
                job_status = firecrawl.async_crawl_url(url, params=job_params)
    
            elif job_type == "llmfulltxt":
                job_status = firecrawl.async_generate_llms_text(url, params=job_params)
            else:
                return {"error": f"Unknown job type: {job_type}"}
    
            # Handle the response
            if "id" in job_status:
                job_id = job_status["id"]
    
                # Start background task without waiting for it
                asyncio.create_task(wait_for_job_completion(job_id, validated_s3_uri, job_type))
    
                # Prepare and return the response
                response = {
                    "id": job_id,
                    "status": job_status.get("status", "started"),
                    "s3_uri": f"{validated_s3_uri}{job_id}/",
                    "message": f"Firecrawl {job_type} job started "
                    f"and will be auto-processed when complete",
                }
    
                return response
            else:
                return {"error": f"Failed to start Firecrawl {job_type} job", "details": job_status}
    
        except Exception as e:
            return {"error": f"Error starting Firecrawl {job_type} job: {str(e)}"}
  • Registration of the invoke_firecrawl_llmtxt tool (and related Firecrawl tools) with the MCP server using mcp.tool() decorator.
    from .firecrawl import (
        cancel_crawlhtml_job,
        check_crawlhtml_status,
        check_llmtxt_status,
        invoke_firecrawl_crawlhtml,
        invoke_firecrawl_llmtxt,
    )
    
    mcp.tool()(invoke_firecrawl_crawlhtml)
    mcp.tool()(check_crawlhtml_status)
    mcp.tool()(invoke_firecrawl_llmtxt)
    mcp.tool()(check_llmtxt_status)
    mcp.tool()(cancel_crawlhtml_job)
    # mcp.tool()(cancel_llmtxt_job) # currently commented till firecrawl brings up a cancel feature
  • Type definition for job types, including llmfulltxt used in the tool implementation.
    Firecrawl_JobType = Literal["crawlhtml", "llmfulltxt"]
    
    
    def _prepare_firecrawl_config() -> Dict[str, str]:
        """Prepare the Firecrawl configuration by retrieving and validating the API key.
    
        Returns:
            A dictionary containing either an API key or an error message
        """
        api_key = os.getenv("FIRECRAWL_API_KEY")
    
        if not api_key:
            return {
                "error": "Firecrawl API key is required. Set FIRECRAWL_API_KEY environment variable.",
            }
    
        return {"api_key": api_key}
    
    
    def _ensure_valid_s3_uri(s3_uri: str) -> str:
        """Ensure S3 URI is properly formatted.
    
        Args:
            s3_uri: S3 URI to validate
    
        Returns:
            Properly formatted S3 URI
    
        Raises:
            ValueError: If S3 URI doesn't start with 's3://'
        """
        if not s3_uri:
            raise ValueError("S3 URI is required")
    
        if not s3_uri.startswith("s3://"):
            raise ValueError("S3 URI must start with 's3://'")
    
        # Ensure URI ends with a slash
        if not s3_uri.endswith("/"):
            s3_uri += "/"
    
        return s3_uri
    
    
    async def invoke_firecrawl_crawlhtml(
        url: str,
        s3_uri: str,
        limit: int = 100,
    ) -> Dict[str, Any]:
        """Start an asynchronous web crawl job using Firecrawl to retrieve HTML content.
    
        Args:
            url: URL to crawl
            s3_uri: S3 URI where results will be uploaded
            limit: Maximum number of pages to crawl (default: 100)
    
        Returns:
            Dictionary with crawl job information including the job ID
        """
        # Call the generic invoke function with crawl-specific parameters
        params = {
            "limit": limit,
            "scrapeOptions": {
                "formats": ["html"],  # Only use HTML format TODO: Bring in other features of this API
            },
        }
    
        return await _invoke_firecrawl_job(
            url=url,
            s3_uri=s3_uri,
            job_type="crawlhtml",
            job_params=params,
        )
    
    
    async def invoke_firecrawl_llmtxt(
        url: str,
        s3_uri: str,
        max_urls: int = 10,
    ) -> Dict[str, Any]:
        """Start an asynchronous llmfull.txt generation job using Firecrawl.
        This file is a standardized markdown file containing information to help LLMs
        use a website at inference time.
        The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data
        using gpt-4o-mini
        Args:
            url: URL to crawl
            s3_uri: S3 URI where results will be uploaded
            max_urls: Maximum number of pages to crawl (1-100, default: 10)
    
        Returns:
            Dictionary with job information including the job ID
        """
        # Call the generic invoke function with llmfull.txt-specific parameters
        params = {"maxUrls": max_urls, "showFullText": False}
    
        return await _invoke_firecrawl_job(
            url=url,
            s3_uri=s3_uri,
            job_type="llmfulltxt",
            job_params=params,
        )

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Unstructured-IO/UNS-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server