Skip to main content
Glama
MKhalusova

Unstructured Document Processor MCP

by MKhalusova

process_document

Extract and use content from unstructured documents across various file formats by processing them with Unstructured.

Instructions

Sends document to process with Unstructured, return the content of the document Args: filepath: path to the document

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
filepathYes

Implementation Reference

  • Handler function for the 'process_document' MCP tool. Decorated with @mcp.tool() for registration. Validates file existence and extension, uses UnstructuredClient to partition the document, saves elements as JSON, converts to formatted text using helper, handles exceptions.
    @mcp.tool() async def process_document(ctx: Context, filepath: str) -> str: """ Sends document to process with Unstructured, return the content of the document Args: filepath: path to the document """ if not os.path.isfile(filepath): return "File does not exist" # Check is file extension is supported _, ext = os.path.splitext(filepath) supported_extensions = {".abw", ".bmp", ".csv", ".cwk", ".dbf", ".dif", ".doc", ".docm", ".docx", ".dot", ".dotm", ".eml", ".epub", ".et", ".eth", ".fods", ".gif", ".heic", ".htm", ".html", ".hwp", ".jpeg", ".jpg", ".md", ".mcw", ".mw", ".odt", ".org", ".p7s", ".pages", ".pbd", ".pdf", ".png", ".pot", ".potm", ".ppt", ".pptm", ".pptx", ".prn", ".rst", ".rtf", ".sdp", ".sgl", ".svg", ".sxg", ".tiff", ".txt", ".tsv", ".uof", ".uos1", ".uos2", ".web", ".webp", ".wk2", ".xls", ".xlsb", ".xlsm", ".xlsx", ".xlw", ".xml", ".zabw"} if ext.lower() not in supported_extensions: return "File extension not supported by Unstructured" client = ctx.request_context.lifespan_context.client file_basename = os.path.basename(filepath) req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=shared.Files( content=open(filepath, "rb"), file_name=filepath, ), strategy=shared.Strategy.AUTO, ), ) os.makedirs(PROCESSED_FILES_FOLDER, exist_ok=True) try: res = client.general.partition(request=req) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) output_json_file_path = os.path.join(PROCESSED_FILES_FOLDER, f"{file_basename}.json") with open(output_json_file_path, "w") as file: file.write(json_elements) return json_to_text(output_json_file_path) except Exception as e: return f"The following exception happened during file processing: {e}"
  • Helper function used by process_document to convert Unstructured JSON output to formatted HTML-like text based on element types.
    def json_to_text(file_path) -> str: with open(file_path, 'r') as file: elements = json.load(file) doc_texts = [] for element in elements: text = element.get("text", "").strip() element_type = element.get("type", "") metadata = element.get("metadata", {}) if element_type == "Title": doc_texts.append(f"<h1> {text}</h1><br>") elif element_type == "Header": doc_texts.append(f"<h2> {text}</h2><br/>") elif element_type == "NarrativeText" or element_type == "UncategorizedText": doc_texts.append(f"<p>{text}</p>") elif element_type == "ListItem": doc_texts.append(f"<li>{text}</li>") elif element_type == "PageNumber": doc_texts.append(f"Page number: {text}") elif element_type == "Table": table_html = metadata.get("text_as_html", "") doc_texts.append(table_html) # Keep the table as HTML else: doc_texts.append(text) return " ".join(doc_texts)

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MKhalusova/unstructured-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server