Skip to main content
Glama

LangExtract MCP Server

by larsenweigle

extract_from_url

Extracts structured data from web content by downloading text from a URL and processing it with Large Language Models. Use it to analyze articles, documents, or other HTTP/HTTPS-accessible text with customizable prompts and parameters.

Instructions

Extract structured information from text content at a URL.

Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS.

Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)

Returns: Dictionary containing extracted entities with source locations and metadata

Raises: ToolError: If URL is invalid, download fails, or extraction fails

Input Schema

NameRequiredDescriptionDefault
examplesYes
extraction_passesNo
max_char_bufferNo
max_workersNo
model_idNogemini-2.5-flash
prompt_descriptionYes
temperatureNo
urlYes

Input Schema (JSON Schema)

{ "properties": { "examples": { "items": { "additionalProperties": true, "type": "object" }, "title": "Examples", "type": "array" }, "extraction_passes": { "default": 1, "title": "Extraction Passes", "type": "integer" }, "max_char_buffer": { "default": 1000, "title": "Max Char Buffer", "type": "integer" }, "max_workers": { "default": 10, "title": "Max Workers", "type": "integer" }, "model_id": { "default": "gemini-2.5-flash", "title": "Model Id", "type": "string" }, "prompt_description": { "title": "Prompt Description", "type": "string" }, "temperature": { "default": 0.5, "title": "Temperature", "type": "number" }, "url": { "title": "Url", "type": "string" } }, "required": [ "url", "prompt_description", "examples" ], "type": "object" }

Implementation Reference

  • The main handler function for the 'extract_from_url' tool, decorated with @mcp.tool for registration. It handles input validation, configuration, calls the shared extraction logic for URLs, and formats the output.
    @mcp.tool def extract_from_url( url: str, prompt_description: str, examples: list[dict[str, Any]], model_id: str = "gemini-2.5-flash", max_char_buffer: int = 1000, temperature: float = 0.5, extraction_passes: int = 1, max_workers: int = 10 ) -> dict[str, Any]: """ Extract structured information from text content at a URL. Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS. Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10) Returns: Dictionary containing extracted entities with source locations and metadata Raises: ToolError: If URL is invalid, download fails, or extraction fails """ try: if not url.startswith(('http://', 'https://')): raise ToolError("URL must start with http:// or https://") if not examples: raise ToolError("At least one example is required for reliable extraction") if not prompt_description.strip(): raise ToolError("Prompt description cannot be empty") # Validate that only Gemini models are supported if not model_id.startswith('gemini'): raise ToolError( f"Only Google Gemini models are supported. Got: {model_id}. " f"Use 'list_supported_models' tool to see available options." ) # Create config object from individual parameters config = ExtractionConfig( model_id=model_id, max_char_buffer=max_char_buffer, temperature=temperature, extraction_passes=extraction_passes, max_workers=max_workers ) # Get API key (server-side only for security) api_key = _get_api_key() if not api_key: raise ToolError( "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable." ) # Perform optimized extraction using cached client result = _langextract_client.extract( text_or_url=url, prompt_description=prompt_description, examples=examples, config=config, api_key=api_key ) return _format_extraction_result(result, config, source_url=url) except ValueError as e: raise ToolError(f"Invalid parameters: {str(e)}") except Exception as e: raise ToolError(f"URL extraction failed: {str(e)}")
  • Pydantic BaseModel defining the input configuration schema used by the extraction tools, including model selection and processing parameters.
    class ExtractionConfig(BaseModel): """Configuration for extraction parameters.""" model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use") max_char_buffer: int = Field(default=1000, description="Max characters per chunk") temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)") extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall") max_workers: int = Field(default=10, description="Max parallel workers")
  • Shared helper method in LangExtractClient that performs the actual extraction, handling URL download via langextract.io.download_text_from_url and annotation with caching for efficiency.
    def extract( self, text_or_url: str, prompt_description: str, examples: list[dict[str, Any]], config: ExtractionConfig, api_key: str ) -> lx.data.AnnotatedDocument: """Optimized extraction using cached components.""" # Get or generate schema first schema, examples_hash = self._get_schema(examples, config.model_id) # Get cached components with schema-aware caching language_model = self._get_language_model(config, api_key, schema, examples_hash) resolver = self._get_resolver("JSON") # Convert examples langextract_examples = self._create_langextract_examples(examples) # Create prompt template prompt_template = lx.prompting.PromptTemplateStructured( description=prompt_description ) prompt_template.examples.extend(langextract_examples) # Create annotator annotator = lx.annotation.Annotator( language_model=language_model, prompt_template=prompt_template, format_type=lx.data.FormatType.JSON, fence_output=False, ) # Perform extraction if text_or_url.startswith(('http://', 'https://')): # Download text first text = lx.io.download_text_from_url(text_or_url) else: text = text_or_url return annotator.annotate_text( text=text, resolver=resolver, max_char_buffer=config.max_char_buffer, batch_length=10, additional_context=None, debug=False, # Disable debug for cleaner MCP output extraction_passes=config.extraction_passes, )
  • Helper function that formats the raw langextract results into the standardized MCP tool response dictionary, including source URL if provided.
    def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]: """Format langextract result for MCP response.""" extractions = [] for extraction in result.extractions or []: extractions.append({ "extraction_class": extraction.extraction_class, "extraction_text": extraction.extraction_text, "attributes": extraction.attributes, "start_char": getattr(extraction, 'start_char', None), "end_char": getattr(extraction, 'end_char', None), }) response = { "document_id": result.document_id if result.document_id else "anonymous", "total_extractions": len(extractions), "extractions": extractions, "metadata": { "model_id": config.model_id, "extraction_passes": config.extraction_passes, "max_char_buffer": config.max_char_buffer, "temperature": config.temperature, } } if source_url: response["source_url"] = source_url return response
  • The @mcp.tool decorator on the extract_from_url function, which registers it as an MCP tool with FastMCP.
    @mcp.tool

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/larsenweigle/langextract-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server