Skip to main content
Glama
larsenweigle

LangExtract MCP Server

by larsenweigle

extract_from_url

Extract structured data from web content using AI. Specify a URL and extraction instructions to retrieve organized information from articles, documents, or text-based web pages.

Instructions

Extract structured information from text content at a URL.

Downloads text from the specified URL and extracts structured information using Large Language Models. Ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS.

Args: url: URL to download text from (must start with http:// or https://) prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)

Returns: Dictionary containing extracted entities with source locations and metadata

Raises: ToolError: If URL is invalid, download fails, or extraction fails

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
prompt_descriptionYes
examplesYes
model_idNogemini-2.5-flash
max_char_bufferNo
temperatureNo
extraction_passesNo
max_workersNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The primary handler function for the 'extract_from_url' MCP tool. Decorated with @mcp.tool for automatic registration. Validates URL and parameters, creates ExtractionConfig, retrieves API key, calls _langextract_client.extract(url), and returns formatted results using _format_extraction_result.
    @mcp.tool
    def extract_from_url(
        url: str,
        prompt_description: str,
        examples: list[dict[str, Any]],
        model_id: str = "gemini-2.5-flash",
        max_char_buffer: int = 1000,
        temperature: float = 0.5,
        extraction_passes: int = 1,
        max_workers: int = 10
    ) -> dict[str, Any]:
        """
        Extract structured information from text content at a URL.
        
        Downloads text from the specified URL and extracts structured information
        using Large Language Models. Ideal for processing web articles, documents,
        or any text content accessible via HTTP/HTTPS.
        
        Args:
            url: URL to download text from (must start with http:// or https://)
            prompt_description: Clear instructions for what to extract
            examples: List of example extractions to guide the model
            model_id: LLM model to use (default: "gemini-2.5-flash")
            max_char_buffer: Max characters per chunk (default: 1000)
            temperature: Sampling temperature 0.0-1.0 (default: 0.5)
            extraction_passes: Number of extraction passes for better recall (default: 1)
            max_workers: Max parallel workers (default: 10)
            
        Returns:
            Dictionary containing extracted entities with source locations and metadata
            
        Raises:
            ToolError: If URL is invalid, download fails, or extraction fails
        """
        try:
            if not url.startswith(('http://', 'https://')):
                raise ToolError("URL must start with http:// or https://")
                
            if not examples:
                raise ToolError("At least one example is required for reliable extraction")
            
            if not prompt_description.strip():
                raise ToolError("Prompt description cannot be empty")
            
            # Validate that only Gemini models are supported
            if not model_id.startswith('gemini'):
                raise ToolError(
                    f"Only Google Gemini models are supported. Got: {model_id}. "
                    f"Use 'list_supported_models' tool to see available options."
                )
            
            # Create config object from individual parameters
            config = ExtractionConfig(
                model_id=model_id,
                max_char_buffer=max_char_buffer,
                temperature=temperature,
                extraction_passes=extraction_passes,
                max_workers=max_workers
            )
            
            # Get API key (server-side only for security)
            api_key = _get_api_key()
            if not api_key:
                raise ToolError(
                    "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
                )
            
            # Perform optimized extraction using cached client
            result = _langextract_client.extract(
                text_or_url=url,
                prompt_description=prompt_description,
                examples=examples,
                config=config,
                api_key=api_key
            )
            
            return _format_extraction_result(result, config, source_url=url)
            
        except ValueError as e:
            raise ToolError(f"Invalid parameters: {str(e)}")
        except Exception as e:
            raise ToolError(f"URL extraction failed: {str(e)}")
  • Pydantic ExtractionConfig model defining input parameters for the extraction process, used by both extract_from_text and extract_from_url tools.
    class ExtractionConfig(BaseModel):
        """Configuration for extraction parameters."""
        model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
        max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
        temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
        extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
        max_workers: int = Field(default=10, description="Max parallel workers")
  • LangExtractClient class implementing the core extraction logic delegated to by the tool handlers. Provides caching for language models, schemas, resolvers, and performs the actual annotation using langextract library.
    class LangExtractClient:
        """Optimized langextract client for MCP server usage.
        
        This client maintains persistent connections and caches expensive operations
        like schema generation and prompt templates for better performance in a
        long-running MCP server context.
        """
        
        def __init__(self):
            self._language_models: dict[str, Any] = {}
            self._schema_cache: dict[str, Any] = {}
            self._prompt_template_cache: dict[str, Any] = {}
            self._resolver_cache: dict[str, Any] = {}
            
        def _get_examples_hash(self, examples: list[dict[str, Any]]) -> str:
            """Generate a hash for caching based on examples."""
            examples_str = json.dumps(examples, sort_keys=True)
            return hashlib.md5(examples_str.encode()).hexdigest()
        
        def _get_language_model(self, config: ExtractionConfig, api_key: str, schema: Any | None = None, schema_hash: str | None = None) -> Any:
            """Get or create a cached language model instance."""
            # Include schema hash in cache key to prevent schema mutation conflicts
            model_key = f"{config.model_id}_{config.temperature}_{config.max_workers}_{schema_hash or 'no_schema'}"
            
            if model_key not in self._language_models:
                # Validate that only Gemini models are supported
                if not config.model_id.startswith('gemini'):
                    raise ValueError(f"Only Gemini models are supported. Got: {config.model_id}")
                    
                language_model = lx.inference.GeminiLanguageModel(
                    model_id=config.model_id,
                    api_key=api_key,
                    temperature=config.temperature,
                    max_workers=config.max_workers,
                    gemini_schema=schema
                )
                self._language_models[model_key] = language_model
                
            return self._language_models[model_key]
        
        def _get_schema(self, examples: list[dict[str, Any]], model_id: str) -> tuple[Any, str]:
            """Get or create a cached schema for the examples.
            
            Returns:
                Tuple of (schema, examples_hash) for use in caching language models
            """
            if not model_id.startswith('gemini'):
                return None, ""
                
            examples_hash = self._get_examples_hash(examples)
            schema_key = f"{model_id}_{examples_hash}"
            
            if schema_key not in self._schema_cache:
                # Convert examples to langextract format
                langextract_examples = self._create_langextract_examples(examples)
                
                # Create prompt template to generate schema
                prompt_template = lx.prompting.PromptTemplateStructured(description="Schema generation")
                prompt_template.examples.extend(langextract_examples)
                
                # Generate schema
                schema = lx.schema.GeminiSchema.from_examples(prompt_template.examples)
                self._schema_cache[schema_key] = schema
                
            return self._schema_cache[schema_key], examples_hash
        
        def _get_resolver(self, format_type: str = "JSON") -> Any:
            """Get or create a cached resolver."""
            if format_type not in self._resolver_cache:
                resolver = lx.resolver.Resolver(
                    fence_output=False,
                    format_type=lx.data.FormatType.JSON if format_type == "JSON" else lx.data.FormatType.YAML,
                    extraction_attributes_suffix="_attributes",
                    extraction_index_suffix=None,
                )
                self._resolver_cache[format_type] = resolver
                
            return self._resolver_cache[format_type]
        
        def _create_langextract_examples(self, examples: list[dict[str, Any]]) -> list[lx.data.ExampleData]:
            """Convert dictionary examples to langextract ExampleData objects."""
            langextract_examples = []
            
            for example in examples:
                extractions = []
                for extraction_data in example["extractions"]:
                    extractions.append(
                        lx.data.Extraction(
                            extraction_class=extraction_data["extraction_class"],
                            extraction_text=extraction_data["extraction_text"],
                            attributes=extraction_data.get("attributes", {})
                        )
                    )
                
                langextract_examples.append(
                    lx.data.ExampleData(
                        text=example["text"],
                        extractions=extractions
                    )
                )
            
            return langextract_examples
        
        def extract(
            self, 
            text_or_url: str,
            prompt_description: str,
            examples: list[dict[str, Any]],
            config: ExtractionConfig,
            api_key: str
        ) -> lx.data.AnnotatedDocument:
            """Optimized extraction using cached components."""
            # Get or generate schema first
            schema, examples_hash = self._get_schema(examples, config.model_id)
            
            # Get cached components with schema-aware caching
            language_model = self._get_language_model(config, api_key, schema, examples_hash)
            resolver = self._get_resolver("JSON")
            
            # Convert examples
            langextract_examples = self._create_langextract_examples(examples)
            
            # Create prompt template
            prompt_template = lx.prompting.PromptTemplateStructured(
                description=prompt_description
            )
            prompt_template.examples.extend(langextract_examples)
            
            # Create annotator
            annotator = lx.annotation.Annotator(
                language_model=language_model,
                prompt_template=prompt_template,
                format_type=lx.data.FormatType.JSON,
                fence_output=False,
            )
            
            # Perform extraction
            if text_or_url.startswith(('http://', 'https://')):
                # Download text first
                text = lx.io.download_text_from_url(text_or_url)
            else:
                text = text_or_url
                
            return annotator.annotate_text(
                text=text,
                resolver=resolver,
                max_char_buffer=config.max_char_buffer,
                batch_length=10,
                additional_context=None,
                debug=False,  # Disable debug for cleaner MCP output
                extraction_passes=config.extraction_passes,
            )
  • _format_extraction_result helper function that converts the langextract AnnotatedDocument result into the dictionary format returned by the tool.
    def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
        """Format langextract result for MCP response."""
        extractions = []
        
        for extraction in result.extractions or []:
            extractions.append({
                "extraction_class": extraction.extraction_class,
                "extraction_text": extraction.extraction_text,
                "attributes": extraction.attributes,
                "start_char": getattr(extraction, 'start_char', None),
                "end_char": getattr(extraction, 'end_char', None),
            })
        
        response = {
            "document_id": result.document_id if result.document_id else "anonymous",
            "total_extractions": len(extractions),
            "extractions": extractions,
            "metadata": {
                "model_id": config.model_id,
                "extraction_passes": config.extraction_passes,
                "max_char_buffer": config.max_char_buffer,
                "temperature": config.temperature,
            }
        }
        
        if source_url:
            response["source_url"] = source_url
            
        return response
  • _get_api_key helper function that retrieves the required Google Gemini API key from environment variable.
    def _get_api_key() -> str | None:
        """Get API key from environment (server-side only for security)."""
        return os.environ.get("LANGEXTRACT_API_KEY")
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does well by disclosing key behavioral traits: it downloads text from URLs, uses LLMs for extraction, mentions error conditions (invalid URL, download failure, extraction failure), describes the return format (dictionary with entities, source locations, metadata), and mentions parallel processing capability (max_workers). It doesn't cover rate limits or authentication requirements, but provides substantial behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (purpose, ideal use cases, Args, Returns, Raises) and front-loads the core functionality. While comprehensive, some sentences could be more concise (e.g., the second sentence could be merged with the first). Overall, it's appropriately sized for an 8-parameter tool with no annotations.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (8 parameters, no annotations, but has output schema), the description is remarkably complete. It covers purpose, usage context, all parameter semantics, return format, error conditions, and behavioral details. The presence of an output schema means the description doesn't need to exhaustively document return values, and it provides everything else needed for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for 8 parameters, the description fully compensates by providing detailed semantic explanations for every parameter in the Args section. Each parameter gets clear meaning beyond just the schema's type information, explaining what 'prompt_description', 'examples', 'model_id', 'max_char_buffer', 'temperature', 'extraction_passes', and 'max_workers' actually do in the extraction context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('extract structured information from text content at a URL') and distinguishes it from sibling tools by specifying it works with URLs rather than raw text (vs extract_from_text). It identifies the resource (text content at a URL) and method (using LLMs).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool ('ideal for processing web articles, documents, or any text content accessible via HTTP/HTTPS'), which implicitly distinguishes it from extract_from_text that works with raw text. However, it doesn't explicitly state when NOT to use it or name specific alternatives beyond the sibling tool names provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/larsenweigle/langextract-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server