Skip to main content
Glama
larsenweigle

LangExtract MCP Server

by larsenweigle

extract_from_text

Extract structured information from unstructured text using Large Language Models with user-defined instructions and examples, mapping each extraction to its exact source location for precise grounding.

Instructions

Extract structured information from text using langextract.

Uses Large Language Models to extract structured information from unstructured text based on user-defined instructions and examples. Each extraction is mapped to its exact location in the source text for precise source grounding.

Args: text: The text to extract information from prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)

Returns: Dictionary containing extracted entities with source locations and metadata

Raises: ToolError: If extraction fails due to invalid parameters or API issues

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYes
prompt_descriptionYes
examplesYes
model_idNogemini-2.5-flash
max_char_bufferNo
temperatureNo
extraction_passesNo
max_workersNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The @mcp.tool decorated handler function that implements the core logic for the 'extract_from_text' tool. It validates inputs, retrieves API key, calls the cached LangExtractClient for extraction, formats results, and handles errors.
    @mcp.tool
    def extract_from_text(
        text: str,
        prompt_description: str,
        examples: list[dict[str, Any]],
        model_id: str = "gemini-2.5-flash",
        max_char_buffer: int = 1000,
        temperature: float = 0.5,
        extraction_passes: int = 1,
        max_workers: int = 10
    ) -> dict[str, Any]:
        """
        Extract structured information from text using langextract.
        
        Uses Large Language Models to extract structured information from unstructured text
        based on user-defined instructions and examples. Each extraction is mapped to its
        exact location in the source text for precise source grounding.
        
        Args:
            text: The text to extract information from
            prompt_description: Clear instructions for what to extract
            examples: List of example extractions to guide the model
            model_id: LLM model to use (default: "gemini-2.5-flash")
            max_char_buffer: Max characters per chunk (default: 1000)
            temperature: Sampling temperature 0.0-1.0 (default: 0.5)
            extraction_passes: Number of extraction passes for better recall (default: 1)
            max_workers: Max parallel workers (default: 10)
            
        Returns:
            Dictionary containing extracted entities with source locations and metadata
            
        Raises:
            ToolError: If extraction fails due to invalid parameters or API issues
        """
        try:
            if not examples:
                raise ToolError("At least one example is required for reliable extraction")
            
            if not prompt_description.strip():
                raise ToolError("Prompt description cannot be empty")
                
            if not text.strip():
                raise ToolError("Input text cannot be empty")
            
            # Validate that only Gemini models are supported
            if not model_id.startswith('gemini'):
                raise ToolError(
                    f"Only Google Gemini models are supported. Got: {model_id}. "
                    f"Use 'list_supported_models' tool to see available options."
                )
            
            # Create config object from individual parameters
            config = ExtractionConfig(
                model_id=model_id,
                max_char_buffer=max_char_buffer,
                temperature=temperature,
                extraction_passes=extraction_passes,
                max_workers=max_workers
            )
            
            # Get API key (server-side only for security)
            api_key = _get_api_key()
            if not api_key:
                raise ToolError(
                    "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
                )
            
            # Perform optimized extraction using cached client
            result = _langextract_client.extract(
                text_or_url=text,
                prompt_description=prompt_description,
                examples=examples,
                config=config,
                api_key=api_key
            )
            
            return _format_extraction_result(result, config)
            
        except ValueError as e:
            raise ToolError(f"Invalid parameters: {str(e)}")
        except Exception as e:
            raise ToolError(f"Extraction failed: {str(e)}")
  • Pydantic BaseModel defining ExtractionConfig, which structures the tool's configurable parameters like model_id, temperature, etc., used internally in the handler.
    class ExtractionConfig(BaseModel):
        """Configuration for extraction parameters."""
        model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
        max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
        temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
        extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
        max_workers: int = Field(default=10, description="Max parallel workers")
  • The extract method of the global LangExtractClient instance, containing the core langextract library integration logic delegated from the tool handler, including schema caching, model instantiation, and annotation.
    def extract(
        self, 
        text_or_url: str,
        prompt_description: str,
        examples: list[dict[str, Any]],
        config: ExtractionConfig,
        api_key: str
    ) -> lx.data.AnnotatedDocument:
        """Optimized extraction using cached components."""
        # Get or generate schema first
        schema, examples_hash = self._get_schema(examples, config.model_id)
        
        # Get cached components with schema-aware caching
        language_model = self._get_language_model(config, api_key, schema, examples_hash)
        resolver = self._get_resolver("JSON")
        
        # Convert examples
        langextract_examples = self._create_langextract_examples(examples)
        
        # Create prompt template
        prompt_template = lx.prompting.PromptTemplateStructured(
            description=prompt_description
        )
        prompt_template.examples.extend(langextract_examples)
        
        # Create annotator
        annotator = lx.annotation.Annotator(
            language_model=language_model,
            prompt_template=prompt_template,
            format_type=lx.data.FormatType.JSON,
            fence_output=False,
        )
        
        # Perform extraction
        if text_or_url.startswith(('http://', 'https://')):
            # Download text first
            text = lx.io.download_text_from_url(text_or_url)
        else:
            text = text_or_url
            
        return annotator.annotate_text(
            text=text,
            resolver=resolver,
            max_char_buffer=config.max_char_buffer,
            batch_length=10,
            additional_context=None,
            debug=False,  # Disable debug for cleaner MCP output
            extraction_passes=config.extraction_passes,
        )
  • Helper function _format_extraction_result that converts the langextract AnnotatedDocument to the standardized dictionary response format returned by the tool.
    def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
        """Format langextract result for MCP response."""
        extractions = []
        
        for extraction in result.extractions or []:
            extractions.append({
                "extraction_class": extraction.extraction_class,
                "extraction_text": extraction.extraction_text,
                "attributes": extraction.attributes,
                "start_char": getattr(extraction, 'start_char', None),
                "end_char": getattr(extraction, 'end_char', None),
            })
        
        response = {
            "document_id": result.document_id if result.document_id else "anonymous",
            "total_extractions": len(extractions),
            "extractions": extractions,
            "metadata": {
                "model_id": config.model_id,
                "extraction_passes": config.extraction_passes,
                "max_char_buffer": config.max_char_buffer,
                "temperature": config.temperature,
            }
        }
        
        if source_url:
            response["source_url"] = source_url
            
        return response
  • The @mcp.tool decorator line that registers the extract_from_text function as an MCP tool with the FastMCP server.
    @mcp.tool
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It adds useful context such as mapping extractions to source locations for grounding, default values for parameters, and error handling (Raises: ToolError). However, it does not cover aspects like rate limits, authentication needs, or performance characteristics, leaving some behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a clear purpose statement, parameter explanations, and return/error details. It is appropriately sized and front-loaded, with most critical information (purpose and key parameters) presented early. Some minor verbosity exists in parameter descriptions, but overall it earns its place efficiently.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (8 parameters, no annotations, but has output schema), the description is largely complete. It covers purpose, parameters, returns, and errors. With an output schema present, it need not explain return values in detail, but it could improve by addressing sibling tool differentiation more explicitly. The gaps are minor relative to the tool's scope.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It provides detailed semantics for all 8 parameters, explaining their purposes (e.g., 'text: The text to extract information from'), default values, and ranges (e.g., 'temperature: Sampling temperature 0.0-1.0'). This adds significant meaning beyond the basic schema, though it could be more explicit about parameter interactions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as extracting structured information from unstructured text using Large Language Models, specifying the method (langextract), and distinguishing it from sibling tools like extract_from_url by focusing on text input rather than URLs. It provides a specific verb ('extract') and resource ('structured information from text').

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context by mentioning 'user-defined instructions and examples' and 'unstructured text,' but does not explicitly state when to use this tool versus alternatives like extract_from_url or generate_visualization. It provides clear context for extraction tasks but lacks explicit exclusions or comparisons with siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/larsenweigle/langextract-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server