get_page_text
Extract OCR or TEI text from specific pages in Gallica digital library documents using ARK identifiers and page numbers. Returns text when available, null if not.
Instructions
Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| ark | Yes | ARK identifier | |
| page | Yes | Page number | |
| format | No | Text format (default: plain) |
Implementation Reference
- src/gallica/text.ts:26-43 (handler)Core implementation of getPageText method that dispatches to format-specific handlers (alto, tei, plain) and handles errors gracefully by returning null when text is unavailable
async getPageText( ark: string, page: number, format: 'plain' | 'alto' | 'tei' = 'plain' ): Promise<string | null> { try { if (format === 'alto') { return await this.getAltoText(ark, page); } else if (format === 'tei') { return await this.getTeiText(ark, page); } else { return await this.getPlainText(ark, page); } } catch (error) { logger.debug(`Text not available for ${ark}, page ${page}: ${error instanceof Error ? error.message : String(error)}`); return null; } } - src/tools/items.ts:150-190 (handler)Tool definition with input schema validation using zod and the MCP handler wrapper that calls textClient.getPageText and formats the response with ark, page, format, text, and available fields
export function createGetPageTextTool(textClient: TextClient) { return { name: 'get_page_text', description: 'Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.', inputSchema: { type: 'object', properties: { ark: { type: 'string', description: 'ARK identifier', }, page: { type: 'number', description: 'Page number', }, format: { type: 'string', enum: ['plain', 'alto', 'tei'], description: 'Text format (default: plain)', }, }, required: ['ark', 'page'], }, handler: async (args: unknown) => { const parsed = z.object({ ark: z.string(), page: z.number().int().positive(), format: z.enum(['plain', 'alto', 'tei']).optional(), }).parse(args); const text = await textClient.getPageText(parsed.ark, parsed.page, parsed.format || 'plain'); return { ark: parsed.ark, page: parsed.page, format: parsed.format || 'plain', text: text, available: text !== null, }; }, }; - src/gallica/text.ts:48-66 (helper)Helper method getAltoText that fetches ALTO XML from Gallica API and parses it to extract OCR text
private async getAltoText(ark: string, page: number): Promise<string | null> { try { // Extract ARK identifier const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, ''); const url = `${this.baseUrl}/RequestDigitalElement`; const params = { O: `ark:/12148/${arkId}`, E: 'ALTO', Deb: String(page), }; const xmlBody = await this.httpClient.getXml(url, params); return this.parseAltoXml(xmlBody); } catch (error) { // ALTO not available, return null (not an error) return null; } } - src/gallica/text.ts:141-160 (helper)Helper method getPlainText that fetches plain text from Gallica's texteBrut endpoint as a fallback when ALTO is unavailable
private async getPlainText(ark: string, _page: number): Promise<string | null> { try { // Extract ARK identifier const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, ''); // Try plain text endpoint const url = `${this.baseUrl}/ark:/12148/${arkId}.texteBrut`; const text = await this.httpClient.get(url); if (text.statusCode === 200 && text.body.trim().length > 0) { // If we have page-specific text, extract relevant portion // For now, return full text (page extraction would require parsing) return text.body; } return null; } catch (error) { return null; } } - src/mcpServer.ts:88-88 (registration)Registration of get_page_text tool by calling createGetPageTextTool with the textClient instance
const getPageText = createGetPageTextTool(textClient);