Unstructured API MCP Server

Official

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
FIRECRAWL_API_KEYNoAPI key for using Firecrawl web crawling API features
UNSTRUCTURED_API_KEYYesYour Unstructured API key fetched from https://platform.unstructured.io/app/account/api-keys

Schema

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Tools

Functions exposed to the LLM to take actions

NameDescription
create_s3_source

Create an S3 source connector.

Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder (e.g., s3://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_s3_source

Update an S3 source connector.

Args: source_id: ID of the source connector to update remote_url: The S3 URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_s3_source

Delete an S3 source connector.

Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_azure_source

Create an Azure source connector.

Args: name: A unique name for this connector remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_azure_source

Update an azure source connector.

Args: source_id: ID of the source connector to update remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_azure_source

Delete an azure source connector.

Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_gdrive_source

Create an gdrive source connector.

Args: name: A unique name for this connector remote_url: The gdrive URI to the bucket or folder (e.g., gdrive://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_gdrive_source

Update an gdrive source connector.

Args: source_id: ID of the source connector to update remote_url: The gdrive URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_gdrive_source

Delete an gdrive source connector.

Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_s3_destination

Create an S3 destination connector.

Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder key: The AWS access key ID secret: The AWS secret access key token: The AWS STS session token for temporary access (optional) endpoint_url: Custom URL if connecting to a non-AWS S3 bucket Returns: String containing the created destination connector information
update_s3_destination

Update an S3 destination connector.

Args: destination_id: ID of the destination connector to update remote_url: The S3 URI to the bucket or folder Returns: String containing the updated destination connector information
delete_s3_destination

Delete an S3 destination connector.

Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_weaviate_destination

Create an weaviate vector database destination connector.

Args: cluster_url: URL of the weaviate cluster collection : Name of the collection to use in the weaviate cluster Note: The collection is a table in the weaviate cluster. In platform, there are dedicated code to generate collection for users here, due to the simplicity of the server, we are not generating it for users. Returns: String containing the created destination connector information
update_weaviate_destination

Update an weaviate destination connector.

Args: destination_id: ID of the destination connector to update cluster_url (optional): URL of the weaviate cluster collection (optional): Name of the collection(like a file) to use in the weaviate cluster Returns: String containing the updated destination connector information
delete_weaviate_destination

Delete an weaviate destination connector.

Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_astradb_destination

Create an AstraDB destination connector.

Args: name: A unique name for this connector collection_name: The name of the collection to use keyspace: The AstraDB keyspace batch_size: The batch size for inserting documents, must be positive (default: 20) Note: A collection in AstraDB is a schemaless document store optimized for NoSQL workloads, equivalent to a table in traditional databases. A keyspace is the top-level namespace in AstraDB that groups multiple collections. We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the created destination connector information
update_astradb_destination

Update an AstraDB destination connector.

Args: destination_id: ID of the destination connector to update collection_name: The name of the collection to use (optional) keyspace: The AstraDB keyspace (optional) batch_size: The batch size for inserting documents (optional) Note: We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the updated destination connector information
delete_astradb_destination

Delete an AstraDB destination connector.

Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_neo4j_destination

Create an neo4j destination connector.

Args: name: A unique name for this connector database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the created destination connector information
update_neo4j_destination

Update an neo4j destination connector.

Args: destination_id: ID of the destination connector to update database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the updated destination connector information
delete_neo4j_destination

Delete an neo4j destination connector.

Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
invoke_firecrawl_crawlhtml

Start an asynchronous web crawl job using Firecrawl to retrieve HTML content.

Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded limit: Maximum number of pages to crawl (default: 100) Returns: Dictionary with crawl job information including the job ID
check_crawlhtml_status

Check the status of an existing Firecrawl HTML crawl job.

Args: crawl_id: ID of the crawl job to check Returns: Dictionary containing the current status of the crawl job
invoke_firecrawl_llmtxt

Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10)

Returns: Dictionary with job information including the job ID
check_llmtxt_status

Check the status of an existing llmfull.txt generation job.

Args: job_id: ID of the llmfull.txt generation job to check Returns: Dictionary containing the current status of the job and text content if completed
cancel_crawlhtml_job

Cancel an in-progress Firecrawl HTML crawl job.

Args: crawl_id: ID of the crawl job to cancel Returns: Dictionary containing the result of the cancellation
list_sources
List available sources from the Unstructured API. Args: source_type: Optional source connector type to filter by Returns: String containing the list of sources
get_source_info

Get detailed information about a specific source connector.

Args: source_id: ID of the source connector to get information for, should be valid UUID Returns: String containing the source connector information
list_destinations

List available destinations from the Unstructured API.

Args: destination_type: Optional destination connector type to filter by Returns: String containing the list of destinations
get_destination_info

Get detailed information about a specific destination connector.

Args: destination_id: ID of the destination connector to get information for Returns: String containing the destination connector information
list_workflows
List workflows from the Unstructured API. Args: destination_id: Optional destination connector ID to filter by source_id: Optional source connector ID to filter by status: Optional workflow status to filter by Returns: String containing the list of workflows
get_workflow_info

Get detailed information about a specific workflow.

Args: workflow_id: ID of the workflow to get information for Returns: String containing the workflow information
create_workflow

Create a new workflow.

Args: workflow_config: A Typed Dictionary containing required fields (destination_id - should be a valid UUID, name, source_id - should be a valid UUID, workflow_type) and non-required fields (schedule, and workflow_nodes). Note workflow_nodes is only enabled when workflow_type is `custom` and is a list of WorkflowNodeTypedDict: partition, prompter,chunk, embed Below is an example of a partition workflow node: { "name": "vlm-partition", "type": "partition", "sub_type": "vlm", "settings": { "provider": "your favorite provider", "model": "your favorite model" } } Returns: String containing the created workflow information

Custom workflow DAG nodes

  • If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array.
  • A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array.
  • A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array.
  • You can specify Partitioner, Chunker, Prompter, and Embedder nodes.
  • The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array.
  • Be sure to specify nodes in the allowed order. The following DAG placements are all allowed:
    • Source -> Partitioner -> Destination,
    • Source -> Partitioner -> Chunker -> Destination,
    • Source -> Partitioner -> Chunker -> Embedder -> Destination,
    • Source -> Partitioner -> Prompter -> Chunker -> Destination,
    • Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination

Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast.

Examples:

  • auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } }
  • vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o"
  • hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } }
  • fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } }

Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title.

  • chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided

set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } }

  • chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided

set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } }

Prompter node An Prompter node has a type of prompter and subtype of:

  • openai_image_description,
  • anthropic_image_description,
  • bedrock_image_description,
  • vertexai_image_description,
  • openai_table_description,
  • anthropic_table_description,
  • bedrock_table_description,
  • vertexai_table_description,
  • openai_table2html,
  • openai_ner

Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} }

Embedder node An Embedder node has a type of embed

Allowed values for subtype and model_name include:

  • "subtype": "azure_openai"
    • "model_name": "text-embedding-3-small"
    • "model_name": "text-embedding-3-large"
    • "model_name": "text-embedding-ada-002"
  • "subtype": "bedrock"
    • "model_name": "amazon.titan-embed-text-v2:0"
    • "model_name": "amazon.titan-embed-text-v1"
    • "model_name": "amazon.titan-embed-image-v1"
    • "model_name": "cohere.embed-english-v3"
    • "model_name": "cohere.embed-multilingual-v3"
  • "subtype": "togetherai":
    • "model_name": "togethercomputer/m2-bert-80M-2k-retrieval"
    • "model_name": "togethercomputer/m2-bert-80M-8k-retrieval"
    • "model_name": "togethercomputer/m2-bert-80M-32k-retrieval"

Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }

run_workflow

Run a specific workflow.

Args: workflow_id: ID of the workflow to run Returns: String containing the response from the workflow execution
update_workflow

Update an existing workflow.

Args: workflow_id: ID of the workflow to update workflow_config: A Typed Dictionary containing required fields (destination_id, name, source_id, workflow_type) and non-required fields (schedule, and workflow_nodes) Returns: String containing the updated workflow information

Custom workflow DAG nodes

  • If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array.
  • A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array.
  • A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array.
  • You can specify Partitioner, Chunker, Prompter, and Embedder nodes.
  • The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array.
  • Be sure to specify nodes in the allowed order. The following DAG placements are all allowed:
    • Source -> Partitioner -> Destination,
    • Source -> Partitioner -> Chunker -> Destination,
    • Source -> Partitioner -> Chunker -> Embedder -> Destination,
    • Source -> Partitioner -> Prompter -> Chunker -> Destination,
    • Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination

Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast.

Examples:

  • auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } }
  • vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o"
  • hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } }
  • fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } }

Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title.

  • chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided

set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } }

  • chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided

set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } }

Prompter node An Prompter node has a type of prompter and subtype of:

  • openai_image_description,
  • anthropic_image_description,
  • bedrock_image_description,
  • vertexai_image_description,
  • openai_table_description,
  • anthropic_table_description,
  • bedrock_table_description,
  • vertexai_table_description,
  • openai_table2html,
  • openai_ner

Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} }

Embedder node An Embedder node has a type of embed

Allowed values for subtype and model_name include:

  • "subtype": "azure_openai"
    • "model_name": "text-embedding-3-small"
    • "model_name": "text-embedding-3-large"
    • "model_name": "text-embedding-ada-002"
  • "subtype": "bedrock"
    • "model_name": "amazon.titan-embed-text-v2:0"
    • "model_name": "amazon.titan-embed-text-v1"
    • "model_name": "amazon.titan-embed-image-v1"
    • "model_name": "cohere.embed-english-v3"
    • "model_name": "cohere.embed-multilingual-v3"
  • "subtype": "togetherai":
    • "model_name": "togethercomputer/m2-bert-80M-2k-retrieval"
    • "model_name": "togethercomputer/m2-bert-80M-8k-retrieval"
    • "model_name": "togethercomputer/m2-bert-80M-32k-retrieval"

Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }

delete_workflow

Delete a specific workflow.

Args: workflow_id: ID of the workflow to delete Returns: String containing the response from the workflow deletion
list_jobs
List jobs via the Unstructured API. Args: workflow_id: Optional workflow ID to filter by status: Optional job status to filter by Returns: String containing the list of jobs
get_job_info

Get detailed information about a specific job.

Args: job_id: ID of the job to get information for Returns: String containing the job information
cancel_job

Delete a specific job.

Args: job_id: ID of the job to cancel Returns: String containing the response from the job cancellation