extract_from_url
Extracts structured data from web content by downloading text from a URL and processing it with Large Language Models. Use it to analyze articles, documents, or other HTTP/HTTPS-accessible text with customizable prompts and parameters.
Instructions
Extract structured information from text content at a URL.
Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS.
Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)
Returns: Dictionary containing extracted entities with source locations and metadata
Raises: ToolError: If URL is invalid, download fails, or extraction fails
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| examples | Yes | ||
| extraction_passes | No | ||
| max_char_buffer | No | ||
| max_workers | No | ||
| model_id | No | gemini-2.5-flash | |
| prompt_description | Yes | ||
| temperature | No | ||
| url | Yes |
Input Schema (JSON Schema)
Implementation Reference
- src/langextract_mcp/server.py:319-401 (handler)The main handler function for the 'extract_from_url' tool, decorated with @mcp.tool for registration. It handles input validation, configuration, calls the shared extraction logic for URLs, and formats the output.@mcp.tool def extract_from_url( url: str, prompt_description: str, examples: list[dict[str, Any]], model_id: str = "gemini-2.5-flash", max_char_buffer: int = 1000, temperature: float = 0.5, extraction_passes: int = 1, max_workers: int = 10 ) -> dict[str, Any]: """ Extract structured information from text content at a URL. Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS. Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10) Returns: Dictionary containing extracted entities with source locations and metadata Raises: ToolError: If URL is invalid, download fails, or extraction fails """ try: if not url.startswith(('http://', 'https://')): raise ToolError("URL must start with http:// or https://") if not examples: raise ToolError("At least one example is required for reliable extraction") if not prompt_description.strip(): raise ToolError("Prompt description cannot be empty") # Validate that only Gemini models are supported if not model_id.startswith('gemini'): raise ToolError( f"Only Google Gemini models are supported. Got: {model_id}. " f"Use 'list_supported_models' tool to see available options." ) # Create config object from individual parameters config = ExtractionConfig( model_id=model_id, max_char_buffer=max_char_buffer, temperature=temperature, extraction_passes=extraction_passes, max_workers=max_workers ) # Get API key (server-side only for security) api_key = _get_api_key() if not api_key: raise ToolError( "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable." ) # Perform optimized extraction using cached client result = _langextract_client.extract( text_or_url=url, prompt_description=prompt_description, examples=examples, config=config, api_key=api_key ) return _format_extraction_result(result, config, source_url=url) except ValueError as e: raise ToolError(f"Invalid parameters: {str(e)}") except Exception as e: raise ToolError(f"URL extraction failed: {str(e)}")
- src/langextract_mcp/server.py:21-27 (schema)Pydantic BaseModel defining the input configuration schema used by the extraction tools, including model selection and processing parameters.class ExtractionConfig(BaseModel): """Configuration for extraction parameters.""" model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use") max_char_buffer: int = Field(default=1000, description="Max characters per chunk") temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)") extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall") max_workers: int = Field(default=10, description="Max parallel workers")
- Shared helper method in LangExtractClient that performs the actual extraction, handling URL download via langextract.io.download_text_from_url and annotation with caching for efficiency.def extract( self, text_or_url: str, prompt_description: str, examples: list[dict[str, Any]], config: ExtractionConfig, api_key: str ) -> lx.data.AnnotatedDocument: """Optimized extraction using cached components.""" # Get or generate schema first schema, examples_hash = self._get_schema(examples, config.model_id) # Get cached components with schema-aware caching language_model = self._get_language_model(config, api_key, schema, examples_hash) resolver = self._get_resolver("JSON") # Convert examples langextract_examples = self._create_langextract_examples(examples) # Create prompt template prompt_template = lx.prompting.PromptTemplateStructured( description=prompt_description ) prompt_template.examples.extend(langextract_examples) # Create annotator annotator = lx.annotation.Annotator( language_model=language_model, prompt_template=prompt_template, format_type=lx.data.FormatType.JSON, fence_output=False, ) # Perform extraction if text_or_url.startswith(('http://', 'https://')): # Download text first text = lx.io.download_text_from_url(text_or_url) else: text = text_or_url return annotator.annotate_text( text=text, resolver=resolver, max_char_buffer=config.max_char_buffer, batch_length=10, additional_context=None, debug=False, # Disable debug for cleaner MCP output extraction_passes=config.extraction_passes, )
- Helper function that formats the raw langextract results into the standardized MCP tool response dictionary, including source URL if provided.def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]: """Format langextract result for MCP response.""" extractions = [] for extraction in result.extractions or []: extractions.append({ "extraction_class": extraction.extraction_class, "extraction_text": extraction.extraction_text, "attributes": extraction.attributes, "start_char": getattr(extraction, 'start_char', None), "end_char": getattr(extraction, 'end_char', None), }) response = { "document_id": result.document_id if result.document_id else "anonymous", "total_extractions": len(extractions), "extractions": extractions, "metadata": { "model_id": config.model_id, "extraction_passes": config.extraction_passes, "max_char_buffer": config.max_char_buffer, "temperature": config.temperature, } } if source_url: response["source_url"] = source_url return response
- src/langextract_mcp/server.py:319-319 (registration)The @mcp.tool decorator on the extract_from_url function, which registers it as an MCP tool with FastMCP.@mcp.tool