Extract Data from Document
talonic_extractExtract structured, schema-validated data from invoices, contracts, and other documents. Returns requested fields with confidence scores.
Instructions
Extract structured, schema-validated JSON from a document. Returns the requested fields with per-field confidence scores.
USE WHEN: the user wants specific fields pulled from an invoice, contract, certificate, statement, form, scan, or PDF.
NOT FOR: full text (use talonic_to_markdown) · finding documents (use talonic_search / talonic_filter).
BY NAME: if the user names a file, call talonic_search first to get its document_id, then call this.
ARGS: a schema is REQUIRED — pass inline schema (JSON Schema, e.g. {type:'object',properties:{vendor_name:{type:'string'}}}) OR a saved schema_id, not both. Provide EXACTLY ONE document source: document_id (cheapest, a workspace doc), file_url (public URL), or file_data+filename (small local files only).
RETURNS: data (the JSON), confidence.overall and confidence.fields (treat <0.7 as needs review), document metadata, extraction_id.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_data | No | Base64-encoded file bytes. Recommended path when the agent already has the file in memory (e.g., the user attached a PDF to the conversation). Pair with `filename` so MIME type can be inferred. Works regardless of where the file lives on disk. | |
| filename | No | Original filename including extension, e.g. 'invoice.pdf'. Used to infer MIME type when uploading via `file_data`. Required when `file_data` is provided. | |
| file_path | No | Local path to a document file. Only works if the MCP server has read access to that path. In sandboxed chat clients (Claude Desktop, Cowork) where uploads land in a host-owned directory, use `file_data` instead. | |
| file_url | No | URL to a document file. The Talonic API fetches it server-side. Use this for documents already on the public web. | |
| document_id | No | ID of a document already in the workspace, to re-extract with a new schema. | |
| schema | No | Inline schema definition. REQUIRED unless `schema_id` is provided. Recommended: full JSON Schema {type:'object', properties:{...}}. Also accepted: flat key-type map {field_name:'string', amount:'number'}. Mutually exclusive with `schema_id`. | |
| schema_id | No | ID of a saved schema. REQUIRED unless `schema` is provided. Accepts UUID or SCH-XXXXXXXX short id from talonic_list_schemas. Mutually exclusive with `schema`. | |
| instructions | No | Natural-language guidance for the extractor, e.g. 'Focus on the billing section. Amounts are in EUR.' | |
| include_markdown | No | Include OCR-converted markdown in the response alongside structured data. | |
| include_provenance | No | Include per-field provenance (source_text, section, page) showing where each value was found in the document. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| extraction_id | Yes | Stable identifier for this extraction. | |
| request_id | No | Server-assigned request ID for support and debugging. | |
| status | Yes | Extraction status (e.g. 'complete'). | |
| document | Yes | Metadata about the ingested document. | |
| data | Yes | The extracted structured data, shape determined by the schema. | |
| schema | No | Schema metadata: which schema was used and how it can be saved. | |
| confidence | No | Extraction confidence. Treat fields below ~0.7 as needing human review. | |
| provenance | No | Per-field source evidence (source_text, section, page). Present only when `include_provenance: true`. | |
| processing | No | Processing metadata: duration, pages processed, region. | |
| links | No | URLs for self, document, and human-readable dashboard view. | |
| markdown | No | OCR-converted markdown. Present only when `include_markdown: true`. | |
| cost | No | Per-call cost and post-call balance, parsed from the X-Talonic-* response headers. `null` for non-extract calls; not always present on legacy clients. |