LangExtract MCP Server

extract_from_url

Extract structured data from web content using AI. Specify a URL and extraction instructions to retrieve organized information from articles, documents, or text-based web pages.

Instructions

Extract structured information from text content at a URL.

Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS.

Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)

Returns: Dictionary containing extracted entities with source locations and metadata

Raises: ToolError: If URL is invalid, download fails, or extraction fails

Input Schema

TableJSON Schema

Name	Required	Default
`url`	Yes
`prompt_description`	Yes
`examples`	Yes
`model_id`	No	gemini-2.5-flash
`max_char_buffer`	No
`temperature`	No
`extraction_passes`	No
`max_workers`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Implementation Reference

src/langextract_mcp/server.py:319-402 (handler)

The primary handler function for the 'extract_from_url' MCP tool. Decorated with @mcp.tool for automatic registration. Validates URL and parameters, creates ExtractionConfig, retrieves API key, calls _langextract_client.extract(url), and returns formatted results using _format_extraction_result.

@mcp.tool
def extract_from_url(
    url: str,
    prompt_description: str,
    examples: list[dict[str, Any]],
    model_id: str = "gemini-2.5-flash",
    max_char_buffer: int = 1000,
    temperature: float = 0.5,
    extraction_passes: int = 1,
    max_workers: int = 10
) -> dict[str, Any]:
    """
    Extract structured information from text content at a URL.
    
    Downloads text from the specified URL and extracts structured information
    using Large Language Models. Ideal for processing web articles, documents,
    or any text content accessible via HTTP/HTTPS.
    
    Args:
        url: URL to download text from (must start with http:// or https://)
        prompt_description: Clear instructions for what to extract
        examples: List of example extractions to guide the model
        model_id: LLM model to use (default: "gemini-2.5-flash")
        max_char_buffer: Max characters per chunk (default: 1000)
        temperature: Sampling temperature 0.0-1.0 (default: 0.5)
        extraction_passes: Number of extraction passes for better recall (default: 1)
        max_workers: Max parallel workers (default: 10)
        
    Returns:
        Dictionary containing extracted entities with source locations and metadata
        
    Raises:
        ToolError: If URL is invalid, download fails, or extraction fails
    """
    try:
        if not url.startswith(('http://', 'https://')):
            raise ToolError("URL must start with http:// or https://")
            
        if not examples:
            raise ToolError("At least one example is required for reliable extraction")
        
        if not prompt_description.strip():
            raise ToolError("Prompt description cannot be empty")
        
        # Validate that only Gemini models are supported
        if not model_id.startswith('gemini'):
            raise ToolError(
                f"Only Google Gemini models are supported. Got: {model_id}. "
                f"Use 'list_supported_models' tool to see available options."
            )
        
        # Create config object from individual parameters
        config = ExtractionConfig(
            model_id=model_id,
            max_char_buffer=max_char_buffer,
            temperature=temperature,
            extraction_passes=extraction_passes,
            max_workers=max_workers
        )
        
        # Get API key (server-side only for security)
        api_key = _get_api_key()
        if not api_key:
            raise ToolError(
                "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
            )
        
        # Perform optimized extraction using cached client
        result = _langextract_client.extract(
            text_or_url=url,
            prompt_description=prompt_description,
            examples=examples,
            config=config,
            api_key=api_key
        )
        
        return _format_extraction_result(result, config, source_url=url)
        
    except ValueError as e:
        raise ToolError(f"Invalid parameters: {str(e)}")
    except Exception as e:
        raise ToolError(f"URL extraction failed: {str(e)}")

src/langextract_mcp/server.py:21-28 (schema)

Pydantic ExtractionConfig model defining input parameters for the extraction process, used by both extract_from_text and extract_from_url tools.

class ExtractionConfig(BaseModel):
    """Configuration for extraction parameters."""
    model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
    max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
    temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
    extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
    max_workers: int = Field(default=10, description="Max parallel workers")

src/langextract_mcp/server.py:38-190 (helper)

LangExtractClient class implementing the core extraction logic delegated to by the tool handlers. Provides caching for language models, schemas, resolvers, and performs the actual annotation using langextract library.

class LangExtractClient:
    """Optimized langextract client for MCP server usage.
    
    This client maintains persistent connections and caches expensive operations
    like schema generation and prompt templates for better performance in a
    long-running MCP server context.
    """
    
    def __init__(self):
        self._language_models: dict[str, Any] = {}
        self._schema_cache: dict[str, Any] = {}
        self._prompt_template_cache: dict[str, Any] = {}
        self._resolver_cache: dict[str, Any] = {}
        
    def _get_examples_hash(self, examples: list[dict[str, Any]]) -> str:
        """Generate a hash for caching based on examples."""
        examples_str = json.dumps(examples, sort_keys=True)
        return hashlib.md5(examples_str.encode()).hexdigest()
    
    def _get_language_model(self, config: ExtractionConfig, api_key: str, schema: Any | None = None, schema_hash: str | None = None) -> Any:
        """Get or create a cached language model instance."""
        # Include schema hash in cache key to prevent schema mutation conflicts
        model_key = f"{config.model_id}_{config.temperature}_{config.max_workers}_{schema_hash or 'no_schema'}"
        
        if model_key not in self._language_models:
            # Validate that only Gemini models are supported
            if not config.model_id.startswith('gemini'):
                raise ValueError(f"Only Gemini models are supported. Got: {config.model_id}")
                
            language_model = lx.inference.GeminiLanguageModel(
                model_id=config.model_id,
                api_key=api_key,
                temperature=config.temperature,
                max_workers=config.max_workers,
                gemini_schema=schema
            )
            self._language_models[model_key] = language_model
            
        return self._language_models[model_key]
    
    def _get_schema(self, examples: list[dict[str, Any]], model_id: str) -> tuple[Any, str]:
        """Get or create a cached schema for the examples.
        
        Returns:
            Tuple of (schema, examples_hash) for use in caching language models
        """
        if not model_id.startswith('gemini'):
            return None, ""
            
        examples_hash = self._get_examples_hash(examples)
        schema_key = f"{model_id}_{examples_hash}"
        
        if schema_key not in self._schema_cache:
            # Convert examples to langextract format
            langextract_examples = self._create_langextract_examples(examples)
            
            # Create prompt template to generate schema
            prompt_template = lx.prompting.PromptTemplateStructured(description="Schema generation")
            prompt_template.examples.extend(langextract_examples)
            
            # Generate schema
            schema = lx.schema.GeminiSchema.from_examples(prompt_template.examples)
            self._schema_cache[schema_key] = schema
            
        return self._schema_cache[schema_key], examples_hash
    
    def _get_resolver(self, format_type: str = "JSON") -> Any:
        """Get or create a cached resolver."""
        if format_type not in self._resolver_cache:
            resolver = lx.resolver.Resolver(
                fence_output=False,
                format_type=lx.data.FormatType.JSON if format_type == "JSON" else lx.data.FormatType.YAML,
                extraction_attributes_suffix="_attributes",
                extraction_index_suffix=None,
            )
            self._resolver_cache[format_type] = resolver
            
        return self._resolver_cache[format_type]
    
    def _create_langextract_examples(self, examples: list[dict[str, Any]]) -> list[lx.data.ExampleData]:
        """Convert dictionary examples to langextract ExampleData objects."""
        langextract_examples = []
        
        for example in examples:
            extractions = []
            for extraction_data in example["extractions"]:
                extractions.append(
                    lx.data.Extraction(
                        extraction_class=extraction_data["extraction_class"],
                        extraction_text=extraction_data["extraction_text"],
                        attributes=extraction_data.get("attributes", {})
                    )
                )
            
            langextract_examples.append(
                lx.data.ExampleData(
                    text=example["text"],
                    extractions=extractions
                )
            )
        
        return langextract_examples
    
    def extract(
        self, 
        text_or_url: str,
        prompt_description: str,
        examples: list[dict[str, Any]],
        config: ExtractionConfig,
        api_key: str
    ) -> lx.data.AnnotatedDocument:
        """Optimized extraction using cached components."""
        # Get or generate schema first
        schema, examples_hash = self._get_schema(examples, config.model_id)
        
        # Get cached components with schema-aware caching
        language_model = self._get_language_model(config, api_key, schema, examples_hash)
        resolver = self._get_resolver("JSON")
        
        # Convert examples
        langextract_examples = self._create_langextract_examples(examples)
        
        # Create prompt template
        prompt_template = lx.prompting.PromptTemplateStructured(
            description=prompt_description
        )
        prompt_template.examples.extend(langextract_examples)
        
        # Create annotator
        annotator = lx.annotation.Annotator(
            language_model=language_model,
            prompt_template=prompt_template,
            format_type=lx.data.FormatType.JSON,
            fence_output=False,
        )
        
        # Perform extraction
        if text_or_url.startswith(('http://', 'https://')):
            # Download text first
            text = lx.io.download_text_from_url(text_or_url)
        else:
            text = text_or_url
            
        return annotator.annotate_text(
            text=text,
            resolver=resolver,
            max_char_buffer=config.max_char_buffer,
            batch_length=10,
            additional_context=None,
            debug=False,  # Disable debug for cleaner MCP output
            extraction_passes=config.extraction_passes,
        )

src/langextract_mcp/server.py:201-230 (helper)

_format_extraction_result helper function that converts the langextract AnnotatedDocument result into the dictionary format returned by the tool.

def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
    """Format langextract result for MCP response."""
    extractions = []
    
    for extraction in result.extractions or []:
        extractions.append({
            "extraction_class": extraction.extraction_class,
            "extraction_text": extraction.extraction_text,
            "attributes": extraction.attributes,
            "start_char": getattr(extraction, 'start_char', None),
            "end_char": getattr(extraction, 'end_char', None),
        })
    
    response = {
        "document_id": result.document_id if result.document_id else "anonymous",
        "total_extractions": len(extractions),
        "extractions": extractions,
        "metadata": {
            "model_id": config.model_id,
            "extraction_passes": config.extraction_passes,
            "max_char_buffer": config.max_char_buffer,
            "temperature": config.temperature,
        }
    }
    
    if source_url:
        response["source_url"] = source_url
        
    return response

src/langextract_mcp/server.py:196-199 (helper)

_get_api_key helper function that retrieves the required Google Gemini API key from environment variable.

def _get_api_key() -> str | None:
    """Get API key from environment (server-side only for security)."""
    return os.environ.get("LANGEXTRACT_API_KEY")

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does well by disclosing key behavioral traits: it downloads text from URLs, uses LLMs for extraction, mentions error conditions (invalid URL, download failure, extraction failure), describes the return format (dictionary with entities, source locations, metadata), and mentions parallel processing capability (max_workers). It doesn't cover rate limits or authentication requirements, but provides substantial behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (purpose, ideal use cases, Args, Returns, Raises) and front-loads the core functionality. While comprehensive, some sentences could be more concise (e.g., the second sentence could be merged with the first). Overall, it's appropriately sized for an 8-parameter tool with no annotations.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (8 parameters, no annotations, but has output schema), the description is remarkably complete. It covers purpose, usage context, all parameter semantics, return format, error conditions, and behavioral details. The presence of an output schema means the description doesn't need to exhaustively document return values, and it provides everything else needed for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for 8 parameters, the description fully compensates by providing detailed semantic explanations for every parameter in the Args section. Each parameter gets clear meaning beyond just the schema's type information, explaining what 'prompt_description', 'examples', 'model_id', 'max_char_buffer', 'temperature', 'extraction_passes', and 'max_workers' actually do in the extraction context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('extract structured information from text content at a URL') and distinguishes it from sibling tools by specifying it works with URLs rather than raw text (vs extract_from_text). It identifies the resource (text content at a URL) and method (using LLMs).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool ('ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS'), which implicitly distinguishes it from extract_from_text that works with raw text. However, it doesn't explicitly state when NOT to use it or name specific alternatives beyond the sibling tool names provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/larsenweigle/langextract-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server