| create_source_connector | Create a source connector based on type.
Args:
ctx: Context object with the request and lifespan context
name: A unique name for this connector
source_type: The type of source being created (e.g., 'azure', 'onedrive',
'salesforce', 'gdrive', 's3', 'sharepoint') type_specific_config:
azure:
remote_url: The Azure Storage remote URL with the format
az://<container-name>/<path/to/file/or/folder/in/container/as/needed>
recursive: (Optional[bool]) Whether to access subfolders
gdrive:
drive_id: The Drive ID for the Google Drive source
recursive: (Optional[bool]) Whether to access subfolders
extensions: (Optional[list[str]]) File extensions to filter
onedrive:
path: The path to the target folder in the OneDrive account
user_pname: The User Principal Name (UPN) for the OneDrive user account
recursive: (Optional[bool]) Whether to access subfolders
authority_url: (Optional[str]) The authentication token provider URL
s3:
remote_url: The S3 URI to the bucket or folder (e.g., s3://my-bucket/)
recursive: (Optional[bool]) Whether to access subfolders
salesforce:
username: The Salesforce username
categories: (Optional[list[str]]) Optional Salesforce domain,the names of the
Salesforce categories (objects) that you want to access, specified as
a comma-separated list. Available categories include Account, Campaign,
Case, EmailMessage, and Lead.
sharepoint:
site: The SharePoint site to connect to
user_pname: The username for the SharePoint site
path: (Optional) The path within the SharePoint site
recursive: (Optional[bool]) Whether to access subfolders
authority_url: (Optional[str]) The authority URL for authentication
Returns:
String containing the created source connector information
|
| update_source_connector | Update a source connector based on type. Args:
ctx: Context object with the request and lifespan context
source_id: ID of the source connector to update
source_type: The type of source being updated (e.g., 'azure', 'onedrive',
'salesforce', 'gdrive', 's3', 'sharepoint')
type_specific_config:
azure:
remote_url: (Optional[str]) The Azure Storage remote URL with the format
az://<container-name>/<path/to/file/or/folder/in/container/as/needed>
recursive: (Optional[bool]) Whether to access subfolders
gdrive:
drive_id: (Optional[str]) The Drive ID for the Google Drive source
recursive: (Optional[bool]) Whether to access subfolders
extensions: (Optional[list[str]]) File extensions to filter
onedrive:
path: (Optional[str]) The path to the target folder in the OneDrive account
user_pname: (Optional[str]) The User Principal Name (UPN) for the OneDrive
user account
recursive: (Optional[bool]) Whether to access subfolders
authority_url: (Optional[str]) The authentication token provider URL
s3:
remote_url: (Optional[str]) The S3 URI to the bucket or folder
(e.g., s3://my-bucket/)
recursive: (Optional[bool]) Whether to access subfolders
salesforce:
username: (Optional[str]) The Salesforce username
categories: (Optional[list[str]]) Optional Salesforce domain,the names of the
Salesforce categories (objects) that you want to access, specified as
a comma-separated list. Available categories include Account, Campaign,
Case, EmailMessage, and Lead.
sharepoint:
site: Optional([str]) The SharePoint site to connect to
user_pname: Optional([str]) The username for the SharePoint site
path: (Optional) The path within the SharePoint site
recursive: (Optional[bool]) Whether to access subfolders
authority_url: (Optional[str]) The authority URL for authentication
Returns:
String containing the updated source connector information
|
| delete_source_connector | Delete a source connector. Args:
source_id: ID of the source connector to delete
Returns:
String containing the result of the deletion
|
| create_destination_connector | Create a destination connector based on type. Args:
ctx: Context object with the request and lifespan context
name: A unique name for this connector
destination_type: The type of destination being created
type_specific_config:
astradb:
collection_name: The AstraDB collection name
keyspace: The AstraDB keyspace
batch_size: (Optional[int]) The batch size for inserting documents
databricks_delta_table:
catalog: Name of the catalog in Databricks Unity Catalog
database: The database in Unity Catalog
http_path: The cluster’s or SQL warehouse’s HTTP Path value
server_hostname: The Databricks cluster’s or SQL warehouse’s Server Hostname value
table_name: The name of the table in the schema
volume: Name of the volume associated with the schema.
schema: (Optional[str]) Name of the schema associated with the volume
volume_path: (Optional[str]) Any target folder path within the volume, starting
from the root of the volume.
databricks_volumes:
catalog: Name of the catalog in Databricks
host: The Databricks host URL
volume: Name of the volume associated with the schema
schema: (Optional[str]) Name of the schema associated with the volume. The default
value is "default".
volume_path: (Optional[str]) Any target folder path within the volume,
starting from the root of the volume.
mongodb:
database: The name of the MongoDB database
collection: The name of the MongoDB collection
neo4j:
database: The Neo4j database, e.g. "neo4j"
uri: The Neo4j URI e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io
batch_size: (Optional[int]) The batch size for the connector
pinecone:
index_name: The Pinecone index name
namespace: (Optional[str]) The pinecone namespace, a folder inside the
pinecone index
batch_size: (Optional[int]) The batch size
s3:
remote_url: The S3 URI to the bucket or folder
weaviate:
cluster_url: URL of the Weaviate cluster
collection: Name of the collection in the Weaviate cluster
Note: Minimal schema is required for the collection, e.g. record_id: Text
Returns:
String containing the created destination connector information
|
| update_destination_connector | Update a destination connector based on type. Args:
ctx: Context object with the request and lifespan context
destination_id: ID of the destination connector to update
destination_type: The type of destination being updated
type_specific_config:
astradb:
collection_name: (Optional[str]): The AstraDB collection name
keyspace: (Optional[str]): The AstraDB keyspace
batch_size: (Optional[int]) The batch size for inserting documents
databricks_delta_table:
catalog: (Optional[str]): Name of the catalog in Databricks Unity Catalog
database: (Optional[str]): The database in Unity Catalog
http_path: (Optional[str]): The cluster’s or SQL warehouse’s HTTP Path value
server_hostname: (Optional[str]): The Databricks cluster’s or SQL warehouse’s
Server Hostname value
table_name: (Optional[str]): The name of the table in the schema
volume: (Optional[str]): Name of the volume associated with the schema.
schema: (Optional[str]) Name of the schema associated with the volume
volume_path: (Optional[str]) Any target folder path within the volume, starting
from the root of the volume.
databricks_volumes:
catalog: (Optional[str]): Name of the catalog in Databricks
host: (Optional[str]): The Databricks host URL
volume: (Optional[str]): Name of the volume associated with the schema
schema: (Optional[str]) Name of the schema associated with the volume. The default
value is "default".
volume_path: (Optional[str]) Any target folder path within the volume,
starting from the root of the volume.
mongodb:
database: (Optional[str]): The name of the MongoDB database
collection: (Optional[str]): The name of the MongoDB collection
neo4j:
database: (Optional[str]): The Neo4j database, e.g. "neo4j"
uri: (Optional[str]): The Neo4j URI
e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io
batch_size: (Optional[int]) The batch size for the connector
pinecone:
index_name: (Optional[str]): The Pinecone index name
namespace: (Optional[str]) The pinecone namespace, a folder inside the
pinecone index
batch_size: (Optional[int]) The batch size
s3:
remote_url: (Optional[str]): The S3 URI to the bucket or folder
weaviate:
cluster_url: (Optional[str]): URL of the Weaviate cluster
collection: (Optional[str]): Name of the collection in the Weaviate cluster
Note: Minimal schema is required for the collection, e.g. record_id: Text
Returns:
String containing the updated destination connector information
|
| delete_destination_connector | Delete a destination connector. Args:
destination_id: ID of the destination connector to delete
Returns:
String containing the result of the deletion
|
| invoke_firecrawl_crawlhtml | Start an asynchronous web crawl job using Firecrawl to retrieve HTML content. Args:
url: URL to crawl
s3_uri: S3 URI where results will be uploaded
limit: Maximum number of pages to crawl (default: 100)
Returns:
Dictionary with crawl job information including the job ID
|
| check_crawlhtml_status | Check the status of an existing Firecrawl HTML crawl job. Args:
crawl_id: ID of the crawl job to check
Returns:
Dictionary containing the current status of the crawl job
|
| invoke_firecrawl_llmtxt | Start an asynchronous llmfull.txt generation job using Firecrawl.
This file is a standardized markdown file containing information to help LLMs
use a website at inference time.
The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data
using gpt-4o-mini
Args:
url: URL to crawl
s3_uri: S3 URI where results will be uploaded
max_urls: Maximum number of pages to crawl (1-100, default: 10) Returns:
Dictionary with job information including the job ID
|
| check_llmtxt_status | Check the status of an existing llmfull.txt generation job. Args:
job_id: ID of the llmfull.txt generation job to check
Returns:
Dictionary containing the current status of the job and text content if completed
|
| cancel_crawlhtml_job | Cancel an in-progress Firecrawl HTML crawl job. Args:
crawl_id: ID of the crawl job to cancel
Returns:
Dictionary containing the result of the cancellation
|
| partition_local_file | Transform a local file into structured data using the Unstructured API.
Args:
input_file_path: The absolute path to the file.
output_file_dir: The absolute path to the directory where the output file should be saved.
strategy: The strategy for transformation.
Available strategies:
VLM - most advanced transformation suitable for difficult PDFs and Images
hi_res - high resolution transformation suitable for most document types
fast - fast transformation suitable for PDFs with extractable text
auto - automatically choose the best strategy based on the input file
vlm_model: The VLM model to use for the transformation.
vlm_model_provider: The VLM model provider to use for the transformation.
output_type: The type of output to generate. Options: 'json' for json
or 'md' for markdown.
Returns:
A string containing the structured data or a message indicating the output file
path with the structured data.
|
| list_sources | List available sources from the Unstructured API.
Args:
source_type: Optional source connector type to filter by
Returns:
String containing the list of sources
|
| get_source_info | Get detailed information about a specific source connector. Args:
source_id: ID of the source connector to get information for, should be valid UUID
Returns:
String containing the source connector information
|
| list_destinations | List available destinations from the Unstructured API. Args:
destination_type: Optional destination connector type to filter by
Returns:
String containing the list of destinations
|
| get_destination_info | Get detailed information about a specific destination connector. Args:
destination_id: ID of the destination connector to get information for
Returns:
String containing the destination connector information
|
| list_workflows | List workflows from the Unstructured API.
Args:
destination_id: Optional destination connector ID to filter by
source_id: Optional source connector ID to filter by
status: Optional workflow status to filter by
Returns:
String containing the list of workflows
|
| get_workflow_info | Get detailed information about a specific workflow. Args:
workflow_id: ID of the workflow to get information for
Returns:
String containing the workflow information
|
| create_workflow | Create a new workflow. Args:
workflow_config: A Typed Dictionary containing required fields (destination_id - should be a
valid UUID, name, source_id - should be a valid UUID, workflow_type) and non-required fields
(schedule, and workflow_nodes). Note workflow_nodes is only enabled when workflow_type
is `custom` and is a list of WorkflowNodeTypedDict: partition, prompter,chunk, embed
Below is an example of a partition workflow node:
{
"name": "vlm-partition",
"type": "partition",
"sub_type": "vlm",
"settings": {
"provider": "your favorite provider",
"model": "your favorite model"
}
}
Returns:
String containing the created workflow information
Custom workflow DAG nodes If WorkflowType is set to custom, you must also specify the settings for the workflow’s
directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array. A Source node is automatically created when you specify the source_id value outside of the
workflow_nodes array. A Destination node is automatically created when you specify the destination_id value outside
of the workflow_nodes array. You can specify Partitioner, Chunker, Prompter, and Embedder nodes. The order of the nodes in the workflow_nodes array will be the same order that these nodes appear
in the DAG, with the first node in the array added directly after the Source node.
The Destination node follows the last node in the array. Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: Source -> Partitioner -> Destination, Source -> Partitioner -> Chunker -> Destination, Source -> Partitioner -> Chunker -> Embedder -> Destination, Source -> Partitioner -> Prompter -> Chunker -> Destination, Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination
Partitioner node
A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast. Examples: auto strategy:
{
"name": "Partitioner",
"type": "partition",
"subtype": "vlm",
"settings": {
"provider": "anthropic", (required)
"model": "claude-sonnet-4-20250514", (required)
"output_format": "text/html",
"user_prompt": null,
"format_html": true,
"unique_element_ids": true,
"is_dynamic": true,
"allow_fast": true
}
} vlm strategy:
Allowed values are provider and model. Below are examples:
- "provider": "anthropic" "model": "claude-sonnet-4-20250514",
- "provider": "openai" "model": "gpt-4o" hi_res strategy:
{
"name": "Partitioner",
"type": "partition",
"subtype": "unstructured_api",
"settings": {
"strategy": "hi_res",
"include_page_breaks": <true|false>,
"pdf_infer_table_structure": <true|false>,
"exclude_elements": [
"",
""
],
"xml_keep_tags": <true|false>,
"encoding": "",
"ocr_languages": [
"",
""
],
"extract_image_block_types": [
"image",
"table"
],
"infer_table_structure": <true|false>
}
} fast strategy
{
"name": "Partitioner",
"type": "partition",
"subtype": "unstructured_api",
"settings": {
"strategy": "fast",
"include_page_breaks": <true|false>,
"pdf_infer_table_structure": <true|false>,
"exclude_elements": [
"",
""
],
"xml_keep_tags": <true|false>,
"encoding": "",
"ocr_languages": [
"",
""
],
"extract_image_block_types": [
"image",
"table"
],
"infer_table_structure": <true|false>
}
}
Chunker node
A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title. chunk_by_character
{
"name": "Chunker",
"type": "chunk",
"subtype": "chunk_by_character",
"settings": {
"include_orig_elements": <true|false>,
"new_after_n_chars": , (required, if not provided
set same as max_characters)
"max_characters": , (required)
"overlap": , (required, if not provided set default to 0)
"overlap_all": <true|false>,
"contextual_chunking_strategy": "v1"
}
} chunk_by_title
{
"name": "Chunker",
"type": "chunk",
"subtype": "chunk_by_title",
"settings": {
"multipage_sections": <true|false>,
"combine_text_under_n_chars": ,
"include_orig_elements": <true|false>,
"new_after_n_chars": , (required, if not provided
set same as max_characters)
"max_characters": , (required)
"overlap": , (required, if not provided set default to 0)
"overlap_all": <true|false>,
"contextual_chunking_strategy": "v1"
}
}
Prompter node
An Prompter node has a type of prompter and subtype of: openai_image_description, anthropic_image_description, bedrock_image_description, vertexai_image_description, openai_table_description, anthropic_table_description, bedrock_table_description, vertexai_table_description, openai_table2html, openai_ner
Example:
{
"name": "Prompter",
"type": "prompter",
"subtype": "",
"settings": {}
} Embedder node
An Embedder node has a type of embed Allowed values for subtype and model_name include: Example:
{
"name": "Embedder",
"type": "embed",
"subtype": "",
"settings": {
"model_name": ""
}
} |
| run_workflow | Run a specific workflow. Args:
workflow_id: ID of the workflow to run
Returns:
String containing the response from the workflow execution
|
| update_workflow | Update an existing workflow. Args:
workflow_id: ID of the workflow to update
workflow_config: A Typed Dictionary containing required fields (destination_id,
name, source_id, workflow_type) and non-required fields (schedule, and workflow_nodes)
Returns:
String containing the updated workflow information
|
| delete_workflow | Delete a specific workflow. Args:
workflow_id: ID of the workflow to delete
Returns:
String containing the response from the workflow deletion
|
| list_jobs | List jobs via the Unstructured API.
Args:
workflow_id: Optional workflow ID to filter by
status: Optional job status to filter by
Returns:
String containing the list of jobs
|
| get_job_info | Get detailed information about a specific job. Args:
job_id: ID of the job to get information for
Returns:
String containing the job information
|
| cancel_job | Delete a specific job. Args:
job_id: ID of the job to cancel
Returns:
String containing the response from the job cancellation
|
| list_workflows_with_finished_jobs | List workflows with finished jobs via the Unstructured API.
Args:
source_type: Optional source connector type to filter by
destination_type: Optional destination connector type to filter by
Returns:
String containing the list of workflows with finished jobs and source and destination
details
|