Skip to main content
Glama
Unstructured-IO

Unstructured API MCP Server

Official

invoke_firecrawl_crawlhtml

Initiate an asynchronous web crawl to extract HTML content from a specified URL. Results are stored in an S3 bucket, with control over the maximum number of pages to crawl.

Instructions

Start an asynchronous web crawl job using Firecrawl to retrieve HTML content.

Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded limit: Maximum number of pages to crawl (default: 100) Returns: Dictionary with crawl job information including the job ID

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
limitNo
s3_uriYes
urlYes

Implementation Reference

  • Primary handler function for invoke_firecrawl_crawlhtml tool that sets crawl-specific parameters and delegates to the core _invoke_firecrawl_job helper.
    async def invoke_firecrawl_crawlhtml( url: str, s3_uri: str, limit: int = 100, ) -> Dict[str, Any]: """Start an asynchronous web crawl job using Firecrawl to retrieve HTML content. Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded limit: Maximum number of pages to crawl (default: 100) Returns: Dictionary with crawl job information including the job ID """ # Call the generic invoke function with crawl-specific parameters params = { "limit": limit, "scrapeOptions": { "formats": ["html"], # Only use HTML format TODO: Bring in other features of this API }, } return await _invoke_firecrawl_job( url=url, s3_uri=s3_uri, job_type="crawlhtml", job_params=params, )
  • Core implementation logic for invoking Firecrawl jobs, handling API key config, S3 URI validation, client initialization, job starting, and response preparation with background completion task.
    async def _invoke_firecrawl_job( url: str, s3_uri: str, job_type: Firecrawl_JobType, job_params: Dict[str, Any], ) -> Dict[str, Any]: """Generic function to start a Firecrawl job (either HTML crawl or llmfull.txt generation). Args: url: URL to process s3_uri: S3 URI where results will be uploaded job_type: Type of job ('crawlhtml' or 'llmtxt') job_params: Parameters specific to the job type Returns: Dictionary with job information including the job ID """ # Get configuration with API key config = _prepare_firecrawl_config() # Check if config contains an error if "error" in config: return {"error": config["error"]} # Validate and normalize S3 URI first - # doing this outside the try block to handle validation errors specifically try: validated_s3_uri = _ensure_valid_s3_uri(s3_uri) except ValueError as ve: return {"error": f"Invalid S3 URI: {str(ve)}"} try: # Initialize the Firecrawl client firecrawl = FirecrawlApp(api_key=config["api_key"]) # Start the job based on job_type if job_type == "crawlhtml": job_status = firecrawl.async_crawl_url(url, params=job_params) elif job_type == "llmfulltxt": job_status = firecrawl.async_generate_llms_text(url, params=job_params) else: return {"error": f"Unknown job type: {job_type}"} # Handle the response if "id" in job_status: job_id = job_status["id"] # Start background task without waiting for it asyncio.create_task(wait_for_job_completion(job_id, validated_s3_uri, job_type)) # Prepare and return the response response = { "id": job_id, "status": job_status.get("status", "started"), "s3_uri": f"{validated_s3_uri}{job_id}/", "message": f"Firecrawl {job_type} job started " f"and will be auto-processed when complete", } return response else: return {"error": f"Failed to start Firecrawl {job_type} job", "details": job_status} except Exception as e: return {"error": f"Error starting Firecrawl {job_type} job: {str(e)}"}
  • Registration of the invoke_firecrawl_crawlhtml tool with the MCP server using mcp.tool() decorator.
    mcp.tool()(invoke_firecrawl_crawlhtml)
  • Import of firecrawl functions including invoke_firecrawl_crawlhtml and its subsequent registration in the register_external_connectors function.
    from .firecrawl import ( cancel_crawlhtml_job, check_crawlhtml_status, check_llmtxt_status, invoke_firecrawl_crawlhtml, invoke_firecrawl_llmtxt, ) mcp.tool()(invoke_firecrawl_crawlhtml)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Unstructured-IO/UNS-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server