Skip to main content
Glama
ukicar

Gallica/BnF MCP Server

by ukicar

get_page_text

Extract OCR or TEI text from specific pages in Gallica digital library documents using ARK identifiers and page numbers. Returns text when available, null if not.

Instructions

Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
arkYesARK identifier
pageYesPage number
formatNoText format (default: plain)

Implementation Reference

  • Core implementation of getPageText method that dispatches to format-specific handlers (alto, tei, plain) and handles errors gracefully by returning null when text is unavailable
    async getPageText(
      ark: string,
      page: number,
      format: 'plain' | 'alto' | 'tei' = 'plain'
    ): Promise<string | null> {
      try {
        if (format === 'alto') {
          return await this.getAltoText(ark, page);
        } else if (format === 'tei') {
          return await this.getTeiText(ark, page);
        } else {
          return await this.getPlainText(ark, page);
        }
      } catch (error) {
        logger.debug(`Text not available for ${ark}, page ${page}: ${error instanceof Error ? error.message : String(error)}`);
        return null;
      }
    }
  • Tool definition with input schema validation using zod and the MCP handler wrapper that calls textClient.getPageText and formats the response with ark, page, format, text, and available fields
    export function createGetPageTextTool(textClient: TextClient) {
      return {
        name: 'get_page_text',
        description: 'Retrieve OCR or TEI text for a specific page when available. Returns null if text is not available.',
        inputSchema: {
          type: 'object',
          properties: {
            ark: {
              type: 'string',
              description: 'ARK identifier',
            },
            page: {
              type: 'number',
              description: 'Page number',
            },
            format: {
              type: 'string',
              enum: ['plain', 'alto', 'tei'],
              description: 'Text format (default: plain)',
            },
          },
          required: ['ark', 'page'],
        },
        handler: async (args: unknown) => {
          const parsed = z.object({
            ark: z.string(),
            page: z.number().int().positive(),
            format: z.enum(['plain', 'alto', 'tei']).optional(),
          }).parse(args);
    
          const text = await textClient.getPageText(parsed.ark, parsed.page, parsed.format || 'plain');
    
          return {
            ark: parsed.ark,
            page: parsed.page,
            format: parsed.format || 'plain',
            text: text,
            available: text !== null,
          };
        },
      };
  • Helper method getAltoText that fetches ALTO XML from Gallica API and parses it to extract OCR text
    private async getAltoText(ark: string, page: number): Promise<string | null> {
      try {
        // Extract ARK identifier
        const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, '');
        
        const url = `${this.baseUrl}/RequestDigitalElement`;
        const params = {
          O: `ark:/12148/${arkId}`,
          E: 'ALTO',
          Deb: String(page),
        };
    
        const xmlBody = await this.httpClient.getXml(url, params);
        return this.parseAltoXml(xmlBody);
      } catch (error) {
        // ALTO not available, return null (not an error)
        return null;
      }
    }
  • Helper method getPlainText that fetches plain text from Gallica's texteBrut endpoint as a fallback when ALTO is unavailable
    private async getPlainText(ark: string, _page: number): Promise<string | null> {
      try {
        // Extract ARK identifier
        const arkId = ark.replace(/^ark:\/12148\//, '').replace(/^\/ark:\/12148\//, '');
        
        // Try plain text endpoint
        const url = `${this.baseUrl}/ark:/12148/${arkId}.texteBrut`;
        const text = await this.httpClient.get(url);
        
        if (text.statusCode === 200 && text.body.trim().length > 0) {
          // If we have page-specific text, extract relevant portion
          // For now, return full text (page extraction would require parsing)
          return text.body;
        }
        
        return null;
      } catch (error) {
        return null;
      }
    }
  • src/mcpServer.ts:88-88 (registration)
    Registration of get_page_text tool by calling createGetPageTextTool with the textClient instance
    const getPageText = createGetPageTextTool(textClient);
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden. It discloses that the tool may return null if text is unavailable, which is a key behavioral trait. However, it lacks details on permissions, rate limits, or error handling, leaving gaps for a tool that interacts with external data.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two concise sentences with zero waste. It front-loads the core purpose and follows with important behavioral context (null returns), making it efficiently structured and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no annotations and no output schema, the description is minimal but covers the essential behavior (retrieving text and handling unavailability). For a tool with 3 parameters and external dependencies, it could benefit from more context on error cases or output structure, but it meets the minimum viable threshold.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters (ark, page, format). The description adds no additional meaning beyond what the schema provides, such as explaining ARK identifiers or format differences. Baseline 3 is appropriate as the schema handles the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb 'Retrieve' and the resource 'OCR or TEI text for a specific page', specifying what the tool does. It distinguishes from siblings like get_page_image (which retrieves images) and get_item_pages (which likely lists pages), making the purpose specific and differentiated.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage by noting 'when available' and 'Returns null if text is not available', suggesting it's for pages with text content. However, it doesn't explicitly state when to use this tool versus alternatives like get_page_image for images or advanced_search for broader queries, leaving some ambiguity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ukicar/sweet-bnf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server