partition_local_file
Convert local files into structured JSON or Markdown data using AI-powered transformation strategies for documents, PDFs, and images.
Instructions
Transform a local file into structured data using the Unstructured API.
Args:
input_file_path: The absolute path to the file.
output_file_dir: The absolute path to the directory where the output file should be saved.
strategy: The strategy for transformation.
Available strategies:
VLM - most advanced transformation suitable for difficult PDFs and Images
hi_res - high resolution transformation suitable for most document types
fast - fast transformation suitable for PDFs with extractable text
auto - automatically choose the best strategy based on the input file
vlm_model: The VLM model to use for the transformation.
vlm_model_provider: The VLM model provider to use for the transformation.
output_type: The type of output to generate. Options: 'json' for json
or 'md' for markdown.
Returns:
A string containing the structured data or a message indicating the output file
path with the structured data.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| input_file_path | Yes | ||
| output_file_dir | Yes | ||
| strategy | No | vlm | |
| vlm_model | No | claude-3-5-sonnet-20241022 | |
| vlm_model_provider | No | anthropic | |
| output_type | No | json |
Implementation Reference
- The main handler function for the 'partition_local_file' tool. It processes a local file using the Unstructured API, handles parameters like strategy and model, calls the API, and saves the output as JSON or Markdown.async def partition_local_file( input_file_path: str, output_file_dir: str, strategy: Strategy = Strategy.VLM, vlm_model: VLMModel = VLMModel.CLAUDE_3_5_SONNET_20241022, vlm_model_provider: VLMModelProvider = VLMModelProvider.ANTHROPIC, output_type: Literal["json", "md"] = "json", ) -> str: """ Transform a local file into structured data using the Unstructured API. Args: input_file_path: The absolute path to the file. output_file_dir: The absolute path to the directory where the output file should be saved. strategy: The strategy for transformation. Available strategies: VLM - most advanced transformation suitable for difficult PDFs and Images hi_res - high resolution transformation suitable for most document types fast - fast transformation suitable for PDFs with extractable text auto - automatically choose the best strategy based on the input file vlm_model: The VLM model to use for the transformation. vlm_model_provider: The VLM model provider to use for the transformation. output_type: The type of output to generate. Options: 'json' for json or 'md' for markdown. Returns: A string containing the structured data or a message indicating the output file path with the structured data. """ input_path = Path(input_file_path) output_dir_path = Path(output_file_dir) if output_type not in ["json", "md"]: return f"Invalid output type '{output_type}'. Must be 'json' or 'md'." try: with input_path.open("rb") as content: partition_params = PartitionParameters( files=Files( content=content, file_name=input_path.name, ), strategy=strategy, vlm_model=vlm_model, vlm_model_provider=vlm_model_provider, ) response = await call_api(partition_params) except Exception as e: return f"Failed to partition file: {e}" output_dir_path.mkdir(parents=True, exist_ok=True) output_file = output_dir_path / input_path.with_suffix(f".{output_type}").name if output_type == "json": json_elements_as_str = json.dumps(response, indent=2) output_file.write_text(json_elements_as_str, encoding="utf-8") else: markdown = construct_markdown(response, input_path.name) output_file.write_text(markdown, encoding="utf-8") return f"Partitioned file {input_file_path} to {output_file} successfully."
- uns_mcp/connectors/unstructured_api/__init__.py:6-7 (registration)Registers the partition_local_file tool with the MCP server using the FastMCP tool decorator.def register_unstructured_api_tools(mcp: FastMCP): mcp.tool()(partition_local_file)
- Helper function that prepares parameters and calls the Unstructured API's partition_async endpoint.async def call_api(partition_params: PartitionParameters) -> list[dict]: partition_params.split_pdf_page = True partition_params.split_pdf_allow_failed = True partition_params.split_pdf_concurrency_level = 15 request = PartitionRequest(partition_parameters=partition_params) res = await client.general.partition_async(request=request) return res.elements
- Helper function to convert the API response elements into a Markdown formatted string.def construct_markdown(elements_list: list[dict[str, Any]], file_name: str) -> str: """ Constructs a markdown representation from the response data. Args: elements_list: The response data from the API call as a list of elements. file_name: The name of the input file. Returns: A markdown string. """ markdown = f"# {file_name}\n\n" for element in elements_list: element_type = element.get("type", "") text = element.get("text", "") if element_type == "Table": text_as_html = element.get("metadata", {}).get("text_as_html", "<></>") markdown += f"{text_as_html}\n\n" elif element_type == "Header": markdown += f"## {text}\n\n" else: markdown += f"{text}\n\n" return markdown