Skip to main content
Glama
cablate

Simple Document Processing MCP Server

document_reader

Read extractable text from non-image document files including PDF, DOCX, TXT, HTML, and CSV at specified paths.

Instructions

Read content from non-image document-files at specified paths, supporting various file formats: .pdf, .docx, .txt, .html, .csv

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
filePathYesPath to the file to be read

Implementation Reference

  • Main handler function 'readFile' that dispatches to format-specific readers (.pdf, .docx, .txt, .html, .csv) based on file extension.
    export async function readFile(filePath: string) {
      try {
        const ext = path.extname(filePath).toLowerCase();
        let content: string;
    
        switch (ext) {
          case ".pdf":
            content = await readPDFFile(filePath);
            break;
          case ".docx":
            content = await readDocxFile(filePath);
            break;
          case ".txt":
            content = await readTextFile(filePath);
            break;
          case ".html":
            content = await readHTMLFile(filePath);
            break;
          case ".csv":
            content = await readCSVFile(filePath);
            break;
          default:
            throw new Error(`Unsupported file format: ${ext}`);
        }
    
        return {
          success: true,
          data: content,
        };
      } catch (error) {
        return {
          success: false,
          error: error instanceof Error ? error.message : "Unknown error",
        };
      }
    } 
  • Helper function to read plain text (.txt) files.
    async function readTextFile(filePath: string): Promise<string> {
      return await fs.readFile(filePath, "utf-8");
    }
  • Helper function to read PDF files using pdfreader library.
    async function readPDFFile(filePath: string): Promise<string> {
      const buffer = await fs.readFile(filePath);
    
      return new Promise((resolve, reject) => {
        let content = "";
        const reader = new PdfReader();
    
        reader.parseBuffer(buffer, ((err: null | Error, item: Item | undefined) => {
          if (err) {
            reject(err);
          } else if (!item) {
            resolve(content);
          } else if (item.text) {
            content += item.text + " ";
          }
        }) as ItemHandler);
      });
    }
  • Helper function to read DOCX files using mammoth library.
    async function readDocxFile(filePath: string): Promise<string> {
      const buffer = await fs.readFile(filePath);
      const result = await mammoth.extractRawText({ buffer });
      return result.value;
    }
  • Helper function to read HTML files using JSDOM to extract text content.
    async function readHTMLFile(filePath: string): Promise<string> {
      const content = await fs.readFile(filePath, "utf-8");
      const dom = new JSDOM(content);
      return dom.window.document.body.textContent || "";
    }
  • Input schema and type definitions for the document_reader tool, requiring a 'filePath' string property.
      inputSchema: {
        type: "object",
        properties: {
          filePath: {
            type: "string",
            description: "Path to the file to be read",
          },
        },
        required: ["filePath"],
      },
    };
    
    export interface FileReaderArgs {
      filePath: string;
    }
  • Tool definition object DOCUMENT_READER_TOOL with name 'document_reader', description, and inputSchema.
    export const DOCUMENT_READER_TOOL: Tool = {
      name: "document_reader",
      description:
        "Read content from non-image document-files at specified paths, supporting various file formats: .pdf, .docx, .txt, .html, .csv",
      inputSchema: {
        type: "object",
        properties: {
          filePath: {
            type: "string",
            description: "Path to the file to be read",
          },
        },
        required: ["filePath"],
      },
    };
  • Re-export of DOCUMENT_READER_TOOL and tool registration in the tools array and barrel export.
    import { DOCUMENT_READER_TOOL } from "./documentReader.js";
    import { DOCX_TO_HTML_TOOL, DOCX_TO_PDF_TOOL } from "./docxTools.js";
    import { EXCEL_READ_TOOL } from "./excelTools.js";
    import { FORMAT_CONVERTER_TOOL } from "./formatConverterPlus.js";
    import { HTML_CLEAN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_TO_TEXT_TOOL } from "./htmlTools.js";
    import { PDF_MERGE_TOOL, PDF_SPLIT_TOOL } from "./pdfTools.js";
    import { TEXT_DIFF_TOOL, TEXT_ENCODING_CONVERT_TOOL, TEXT_FORMAT_TOOL, TEXT_SPLIT_TOOL } from "./txtTools.js";
    
    export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL];
  • Server request handler that validates args and calls readFile() for the 'document_reader' tool name.
    if (name === "document_reader") {
      if (!isFileReaderArgs(args)) {
        throw new Error("Invalid arguments for document_reader");
      }
    
      const result = await readFile(args.filePath);
      if (!result.success) {
        return {
          content: [{ type: "text", text: `Error: ${result.error}` }],
          isError: true,
        };
      }
      return {
        content: [{ type: "text", text: result.data }],
        isError: false,
      };
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full responsibility for disclosing behavior. It states 'read content' implying a read-only, non-destructive operation. However, it fails to mention error handling (e.g., missing file, unsupported format) or any side effects, leaving some behavioral uncertainty.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single sentence that efficiently conveys the purpose and supported formats. It is front-loaded with the key action ('Read content from non-image document-files…'). omits any extraneous information, but could be slightly more structured with separate clauses.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity (one parameter, no output schema, no nested objects), the description covers the essential purpose and file types. It does not detail return format or encoding, but for a straightforward reader, the context is reasonably complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has 100% coverage for its single parameter (filePath), so the schema already provides the path description. The tool description adds no additional semantics beyond mentioning supported formats, which are implicit from the file extension. Thus, it adds minimal value over the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool reads content from non-image document files and lists specific formats (.pdf, .docx, .txt, .html, .csv). This specificity distinguishes it from sibling tools like docx_to_html or pdf_splitter, which perform targeted conversions or manipulations.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for reading text-based files, but it does not explicitly advise when to avoid this tool or suggest alternatives. For instance, it does not mention that image files should be handled by other tools or that excel_read should be used for .xlsx files.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cablate/mcp-doc-forge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server