extract_from_text
Extract structured data from unstructured text using Large Language Models. Define extraction instructions and examples to identify entities, map them to source locations, and retrieve precise metadata for accurate grounding.
Instructions
Extract structured information from text using langextract.
Uses Large Language Models to extract structured information from unstructured text based on user-defined instructions and examples. Each extraction is mapped to its exact location in the source text for precise source grounding.
Args: text: The text to extract information from prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10)
Returns: Dictionary containing extracted entities with source locations and metadata
Raises: ToolError: If extraction fails due to invalid parameters or API issues
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| examples | Yes | ||
| extraction_passes | No | ||
| max_char_buffer | No | ||
| max_workers | No | ||
| model_id | No | gemini-2.5-flash | |
| prompt_description | Yes | ||
| temperature | No | ||
| text | Yes |
Input Schema (JSON Schema)
Implementation Reference
- src/langextract_mcp/server.py:235-317 (handler)The @mcp.tool decorated function that implements the core extraction logic from text using the LangExtractClient and langextract library. Handles input validation, configuration, API key retrieval, extraction, and result formatting.@mcp.tool def extract_from_text( text: str, prompt_description: str, examples: list[dict[str, Any]], model_id: str = "gemini-2.5-flash", max_char_buffer: int = 1000, temperature: float = 0.5, extraction_passes: int = 1, max_workers: int = 10 ) -> dict[str, Any]: """ Extract structured information from text using langextract. Uses Large Language Models to extract structured information from unstructured text based on user-defined instructions and examples. Each extraction is mapped to its exact location in the source text for precise source grounding. Args: text: The text to extract information from prompt_description: Clear instructions for what to extract examples: List of example extractions to guide the model model_id: LLM model to use (default: "gemini-2.5-flash") max_char_buffer: Max characters per chunk (default: 1000) temperature: Sampling temperature 0.0-1.0 (default: 0.5) extraction_passes: Number of extraction passes for better recall (default: 1) max_workers: Max parallel workers (default: 10) Returns: Dictionary containing extracted entities with source locations and metadata Raises: ToolError: If extraction fails due to invalid parameters or API issues """ try: if not examples: raise ToolError("At least one example is required for reliable extraction") if not prompt_description.strip(): raise ToolError("Prompt description cannot be empty") if not text.strip(): raise ToolError("Input text cannot be empty") # Validate that only Gemini models are supported if not model_id.startswith('gemini'): raise ToolError( f"Only Google Gemini models are supported. Got: {model_id}. " f"Use 'list_supported_models' tool to see available options." ) # Create config object from individual parameters config = ExtractionConfig( model_id=model_id, max_char_buffer=max_char_buffer, temperature=temperature, extraction_passes=extraction_passes, max_workers=max_workers ) # Get API key (server-side only for security) api_key = _get_api_key() if not api_key: raise ToolError( "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable." ) # Perform optimized extraction using cached client result = _langextract_client.extract( text_or_url=text, prompt_description=prompt_description, examples=examples, config=config, api_key=api_key ) return _format_extraction_result(result, config) except ValueError as e: raise ToolError(f"Invalid parameters: {str(e)}") except Exception as e: raise ToolError(f"Extraction failed: {str(e)}")
- src/langextract_mcp/server.py:21-27 (schema)Pydantic BaseModel defining the configuration schema for extraction parameters, used within the tool handler.class ExtractionConfig(BaseModel): """Configuration for extraction parameters.""" model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use") max_char_buffer: int = Field(default=1000, description="Max characters per chunk") temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)") extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall") max_workers: int = Field(default=10, description="Max parallel workers")
- Helper function that formats the raw langextract AnnotatedDocument result into the structured dictionary returned by the tool.def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]: """Format langextract result for MCP response.""" extractions = [] for extraction in result.extractions or []: extractions.append({ "extraction_class": extraction.extraction_class, "extraction_text": extraction.extraction_text, "attributes": extraction.attributes, "start_char": getattr(extraction, 'start_char', None), "end_char": getattr(extraction, 'end_char', None), }) response = { "document_id": result.document_id if result.document_id else "anonymous", "total_extractions": len(extractions), "extractions": extractions, "metadata": { "model_id": config.model_id, "extraction_passes": config.extraction_passes, "max_char_buffer": config.max_char_buffer, "temperature": config.temperature, } } if source_url: response["source_url"] = source_url return response
- Key method in the LangExtractClient class that orchestrates the actual langextract annotation process using cached components for efficiency.def extract( self, text_or_url: str, prompt_description: str, examples: list[dict[str, Any]], config: ExtractionConfig, api_key: str ) -> lx.data.AnnotatedDocument: """Optimized extraction using cached components.""" # Get or generate schema first schema, examples_hash = self._get_schema(examples, config.model_id) # Get cached components with schema-aware caching language_model = self._get_language_model(config, api_key, schema, examples_hash) resolver = self._get_resolver("JSON") # Convert examples langextract_examples = self._create_langextract_examples(examples) # Create prompt template prompt_template = lx.prompting.PromptTemplateStructured( description=prompt_description ) prompt_template.examples.extend(langextract_examples) # Create annotator annotator = lx.annotation.Annotator( language_model=language_model, prompt_template=prompt_template, format_type=lx.data.FormatType.JSON, fence_output=False, ) # Perform extraction if text_or_url.startswith(('http://', 'https://')): # Download text first text = lx.io.download_text_from_url(text_or_url) else: text = text_or_url return annotator.annotate_text( text=text, resolver=resolver, max_char_buffer=config.max_char_buffer, batch_length=10, additional_context=None, debug=False, # Disable debug for cleaner MCP output extraction_passes=config.extraction_passes, )
- src/langextract_mcp/server.py:235-236 (registration)The @mcp.tool decorator registers the extract_from_text function as an MCP tool.@mcp.tool def extract_from_text(