Extract plain text from PDF
extract_textExtracts plain text from non-encrypted PDFs by parsing page content operators. Returns a boolean indicating whether text was extractable with a reason if not.
Instructions
Best-effort plain-text extraction from a non-encrypted PDF. Walks each page's content stream and pulls the operands of Tj/'/"/TJ text operators. The result.extractable boolean is FALSE when one or more pages have non-empty content but yielded no text (this is EXPECTED for PDFs using subset fonts without /ToUnicode CMaps — it is not an error). The accompanying extractableReason field explains why. Encrypted PDFs are rejected with EXTRACTION_UNSUPPORTED. Tagged-mode structure-tree extraction (cleaner output for tagged PDFs) is tracked on the roadmap.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| pdfBase64 | Yes | Base64-encoded PDF bytes. | |
| pages | No | Optional 0-based page indices to extract. When omitted, every page is extracted. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| pageCount | Yes | ||
| extractedPageCount | Yes | ||
| extractable | Yes | False when one or more requested pages had a non-empty content stream but yielded no extractable text (likely subset fonts without /ToUnicode). | |
| extractableReason | No | Human-readable explanation when extractable=false. Absent when extractable=true. | |
| pages | Yes | ||
| fullText | Yes |