extract_information
Extract structured data from documents using custom or auto-generated schemas to process various file formats including PDF, images, and Office documents.
Instructions
Extract structured information from documents using Upstage Universal Information Extraction.
This tool can extract key information from any document type without pre-training. You can either provide a schema defining what information to extract, or let the system automatically generate an appropriate schema based on the document content.
Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX Max file size: 50MB Max pages: 100
SCHEMA FORMAT: When auto_generate_schema is false, provide schema in this exact format: { "type": "json_schema", "json_schema": { "name": "document_schema", "schema": { "type": "object", "properties": { "field_name": { "type": "string|number|array|object", "description": "What to extract" } } } } }
Example schema_json: {"type":"json_schema","json_schema":{"name":"document_schema","schema":{"type":"object","properties":{"company_name":{"type":"string","description":"Company name"},"invoice_number":{"type":"string","description":"Invoice number"},"total_amount":{"type":"number","description":"Total amount"}}}}}
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | ||
| schema_path | No | ||
| schema_json | No | ||
| auto_generate_schema | No |