Skip to main content
Glama
ukicar

Gallica/BnF MCP Server

by ukicar

get_page_text

Extract OCR or TEI text from specific pages in Gallica digital library documents using ARK identifiers and page numbers. Returns text when available, null if not.

Instructions

Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
arkYesARK identifier
pageYesPage number
formatNoText format (default: plain)

Implementation Reference

  • Core implementation of getPageText method that dispatches to format-specific handlers (alto, tei, plain) and handles errors gracefully by returning null when text is unavailable
    async getPageText(
      ark: string,
      page: number,
      format: 'plain' | 'alto' | 'tei' = 'plain'
    ): Promise<string | null> {
      try {
        if (format === 'alto') {
          return await this.getAltoText(ark, page);
        } else if (format === 'tei') {
          return await this.getTeiText(ark, page);
        } else {
          return await this.getPlainText(ark, page);
        }
      } catch (error) {
        logger.debug(`Text not available for ${ark}, page ${page}: ${error instanceof Error ? error.message : String(error)}`);
        return null;
      }
    }
  • Tool definition with input schema validation using zod and the MCP handler wrapper that calls textClient.getPageText and formats the response with ark, page, format, text, and available fields
    export function createGetPageTextTool(textClient: TextClient) {
      return {
        name: 'get_page_text',
        description: 'Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.',
        inputSchema: {
          type: 'object',
          properties: {
            ark: {
              type: 'string',
              description: 'ARK identifier',
            },
            page: {
              type: 'number',
              description: 'Page number',
            },
            format: {
              type: 'string',
              enum: ['plain', 'alto', 'tei'],
              description: 'Text format (default: plain)',
            },
          },
          required: ['ark', 'page'],
        },
        handler: async (args: unknown) => {
          const parsed = z.object({
            ark: z.string(),
            page: z.number().int().positive(),
            format: z.enum(['plain', 'alto', 'tei']).optional(),
          }).parse(args);
    
          const text = await textClient.getPageText(parsed.ark, parsed.page, parsed.format || 'plain');
    
          return {
            ark: parsed.ark,
            page: parsed.page,
            format: parsed.format || 'plain',
            text: text,
            available: text !== null,
          };
        },
      };
  • Helper method getAltoText that fetches ALTO XML from Gallica API and parses it to extract OCR text
    private async getAltoText(ark: string, page: number): Promise<string | null> {
      try {
        // Extract ARK identifier
        const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, '');
        
        const url = `${this.baseUrl}/RequestDigitalElement`;
        const params = {
          O: `ark:/12148/${arkId}`,
          E: 'ALTO',
          Deb: String(page),
        };
    
        const xmlBody = await this.httpClient.getXml(url, params);
        return this.parseAltoXml(xmlBody);
      } catch (error) {
        // ALTO not available, return null (not an error)
        return null;
      }
    }
  • Helper method getPlainText that fetches plain text from Gallica's texteBrut endpoint as a fallback when ALTO is unavailable
    private async getPlainText(ark: string, _page: number): Promise<string | null> {
      try {
        // Extract ARK identifier
        const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, '');
        
        // Try plain text endpoint
        const url = `${this.baseUrl}/ark:/12148/${arkId}.texteBrut`;
        const text = await this.httpClient.get(url);
        
        if (text.statusCode === 200 && text.body.trim().length > 0) {
          // If we have page-specific text, extract relevant portion
          // For now, return full text (page extraction would require parsing)
          return text.body;
        }
        
        return null;
      } catch (error) {
        return null;
      }
    }
  • src/mcpServer.ts:88-88 (registration)
    Registration of get_page_text tool by calling createGetPageTextTool with the textClient instance
    const getPageText = createGetPageTextTool(textClient);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ukicar/sweet-bnf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server