LangExtract MCP Server

extract_from_text

Extract structured information from unstructured text using Large Language Models with user-defined instructions and examples, mapping each extraction to its exact source location for precise grounding.

Instructions

Extract structured information from text using langextract.

Uses Large Language Models to extract structured information from unstructured text based on user-defined instructions and examples. Each extraction is mapped to its exact location in the source text for precise source grounding.

Args: text: The text to extract information from prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)

Returns: Dictionary containing extracted entities with source locations and metadata

Raises: ToolError: If extraction fails due to invalid parameters or API issues

Input Schema

TableJSON Schema

Name	Required	Default
`text`	Yes
`prompt_description`	Yes
`examples`	Yes
`model_id`	No	gemini-2.5-flash
`max_char_buffer`	No
`temperature`	No
`extraction_passes`	No
`max_workers`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Implementation Reference

src/langextract_mcp/server.py:235-317 (handler)

The @mcp.tool decorated handler function that implements the core logic for the 'extract_from_text' tool. It validates inputs, retrieves API key, calls the cached LangExtractClient for extraction, formats results, and handles errors.

@mcp.tool
def extract_from_text(
    text: str,
    prompt_description: str,
    examples: list[dict[str, Any]],
    model_id: str = "gemini-2.5-flash",
    max_char_buffer: int = 1000,
    temperature: float = 0.5,
    extraction_passes: int = 1,
    max_workers: int = 10
) -> dict[str, Any]:
    """
    Extract structured information from text using langextract.
    
    Uses Large Language Models to extract structured information from unstructured text
    based on user-defined instructions and examples. Each extraction is mapped to its
    exact location in the source text for precise source grounding.
    
    Args:
        text: The text to extract information from
        prompt_description: Clear instructions for what to extract
        examples: List of example extractions to guide the model
        model_id: LLM model to use (default: "gemini-2.5-flash")
        max_char_buffer: Max characters per chunk (default: 1000)
        temperature: Sampling temperature 0.0-1.0 (default: 0.5)
        extraction_passes: Number of extraction passes for better recall (default: 1)
        max_workers: Max parallel workers (default: 10)
        
    Returns:
        Dictionary containing extracted entities with source locations and metadata
        
    Raises:
        ToolError: If extraction fails due to invalid parameters or API issues
    """
    try:
        if not examples:
            raise ToolError("At least one example is required for reliable extraction")
        
        if not prompt_description.strip():
            raise ToolError("Prompt description cannot be empty")
            
        if not text.strip():
            raise ToolError("Input text cannot be empty")
        
        # Validate that only Gemini models are supported
        if not model_id.startswith('gemini'):
            raise ToolError(
                f"Only Google Gemini models are supported. Got: {model_id}. "
                f"Use 'list_supported_models' tool to see available options."
            )
        
        # Create config object from individual parameters
        config = ExtractionConfig(
            model_id=model_id,
            max_char_buffer=max_char_buffer,
            temperature=temperature,
            extraction_passes=extraction_passes,
            max_workers=max_workers
        )
        
        # Get API key (server-side only for security)
        api_key = _get_api_key()
        if not api_key:
            raise ToolError(
                "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
            )
        
        # Perform optimized extraction using cached client
        result = _langextract_client.extract(
            text_or_url=text,
            prompt_description=prompt_description,
            examples=examples,
            config=config,
            api_key=api_key
        )
        
        return _format_extraction_result(result, config)
        
    except ValueError as e:
        raise ToolError(f"Invalid parameters: {str(e)}")
    except Exception as e:
        raise ToolError(f"Extraction failed: {str(e)}")

src/langextract_mcp/server.py:21-28 (schema)

Pydantic BaseModel defining ExtractionConfig, which structures the tool's configurable parameters like model_id, temperature, etc., used internally in the handler.

class ExtractionConfig(BaseModel):
    """Configuration for extraction parameters."""
    model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
    max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
    temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
    extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
    max_workers: int = Field(default=10, description="Max parallel workers")

src/langextract_mcp/server.py:141-190 (helper)

The extract method of the global LangExtractClient instance, containing the core langextract library integration logic delegated from the tool handler, including schema caching, model instantiation, and annotation.

def extract(
    self, 
    text_or_url: str,
    prompt_description: str,
    examples: list[dict[str, Any]],
    config: ExtractionConfig,
    api_key: str
) -> lx.data.AnnotatedDocument:
    """Optimized extraction using cached components."""
    # Get or generate schema first
    schema, examples_hash = self._get_schema(examples, config.model_id)
    
    # Get cached components with schema-aware caching
    language_model = self._get_language_model(config, api_key, schema, examples_hash)
    resolver = self._get_resolver("JSON")
    
    # Convert examples
    langextract_examples = self._create_langextract_examples(examples)
    
    # Create prompt template
    prompt_template = lx.prompting.PromptTemplateStructured(
        description=prompt_description
    )
    prompt_template.examples.extend(langextract_examples)
    
    # Create annotator
    annotator = lx.annotation.Annotator(
        language_model=language_model,
        prompt_template=prompt_template,
        format_type=lx.data.FormatType.JSON,
        fence_output=False,
    )
    
    # Perform extraction
    if text_or_url.startswith(('http://', 'https://')):
        # Download text first
        text = lx.io.download_text_from_url(text_or_url)
    else:
        text = text_or_url
        
    return annotator.annotate_text(
        text=text,
        resolver=resolver,
        max_char_buffer=config.max_char_buffer,
        batch_length=10,
        additional_context=None,
        debug=False,  # Disable debug for cleaner MCP output
        extraction_passes=config.extraction_passes,
    )

src/langextract_mcp/server.py:201-229 (helper)

Helper function _format_extraction_result that converts the langextract AnnotatedDocument to the standardized dictionary response format returned by the tool.

def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
    """Format langextract result for MCP response."""
    extractions = []
    
    for extraction in result.extractions or []:
        extractions.append({
            "extraction_class": extraction.extraction_class,
            "extraction_text": extraction.extraction_text,
            "attributes": extraction.attributes,
            "start_char": getattr(extraction, 'start_char', None),
            "end_char": getattr(extraction, 'end_char', None),
        })
    
    response = {
        "document_id": result.document_id if result.document_id else "anonymous",
        "total_extractions": len(extractions),
        "extractions": extractions,
        "metadata": {
            "model_id": config.model_id,
            "extraction_passes": config.extraction_passes,
            "max_char_buffer": config.max_char_buffer,
            "temperature": config.temperature,
        }
    }
    
    if source_url:
        response["source_url"] = source_url
        
    return response

src/langextract_mcp/server.py:235-235 (registration)
The @mcp.tool decorator line that registers the extract_from_text function as an MCP tool with the FastMCP server.
```
@mcp.tool
```

Tool Definition Quality

A4.1/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It adds useful context such as mapping extractions to source locations for grounding, default values for parameters, and error handling (Raises: ToolError). However, it does not cover aspects like rate limits, authentication needs, or performance characteristics, leaving some behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a clear purpose statement, parameter explanations, and return/error details. It is appropriately sized and front-loaded, with most critical information (purpose and key parameters) presented early. Some minor verbosity exists in parameter descriptions, but overall it earns its place efficiently.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (8 parameters, no annotations, but has output schema), the description is largely complete. It covers purpose, parameters, returns, and errors. With an output schema present, it need not explain return values in detail, but it could improve by addressing sibling tool differentiation more explicitly. The gaps are minor relative to the tool's scope.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It provides detailed semantics for all 8 parameters, explaining their purposes (e.g., 'text: The text to extract information from'), default values, and ranges (e.g., 'temperature: Sampling temperature 0.0-1.0'). This adds significant meaning beyond the basic schema, though it could be more explicit about parameter interactions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as extracting structured information from unstructured text using Large Language Models, specifying the method (langextract), and distinguishing it from sibling tools like extract_from_url by focusing on text input rather than URLs. It provides a specific verb ('extract') and resource ('structured information from text').

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context by mentioning 'user-defined instructions and examples' and 'unstructured text,' but does not explicitly state when to use this tool versus alternatives like extract_from_url or generate_visualization. It provides clear context for extraction tasks but lacks explicit exclusions or comparisons with siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/larsenweigle/langextract-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server