Skip to main content
Glama
cordlesssteve

Document Organizer MCP Server

convert_pdf

Convert PDF files to Markdown format for better organization and processing using marker or pymupdf4llm libraries.

Instructions

Convert PDF files to Markdown format using marker (recommended) or pymupdf4llm

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
output_pathNoOptional path to write markdown output
pdf_pathYesAbsolute path to the PDF file to convert

Implementation Reference

  • Core handler function that orchestrates PDF to Markdown conversion, selecting between marker or pymupdf4llm engines based on options, with input validation and comprehensive error handling.
    async function convertPdfToMarkdown(
      pdfPath: string,
      outputPath?: string,
      options: {
        engine?: "pymupdf4llm" | "marker";
        page_chunks?: boolean;
        write_images?: boolean;
        image_path?: string;
        table_strategy?: "fast" | "accurate";
        extract_content?: "text" | "figures" | "both";
        auto_clean?: boolean;
      } = {}
    ): Promise<ConversionResult> {
      const startTime = Date.now();
      
      try {
        // Validate PDF exists
        const validatedPdfPath = validatePath(pdfPath);
        await fs.access(validatedPdfPath);
        
        // Determine which engine to use
        const engine = options.engine || "marker";
        
        if (engine === "marker") {
          return await convertWithMarker(validatedPdfPath, outputPath, options);
        } else {
          return await convertWithPymupdf4llm(validatedPdfPath, outputPath, options);
        }
      } catch (error) {
        return {
          success: false,
          error: error instanceof Error ? error.message : String(error),
          page_count: 0,
          char_count: 0,
          images_extracted: 0,
          processing_time: Date.now() - startTime,
          memory_used: 0,
          warnings: []
        };
      }
    }
  • Zod schema defining the input parameters for the convert_pdf tool, including pdf_path, optional output_path, and detailed conversion options.
    const ConvertPdfArgsSchema = z.object({
      pdf_path: z.string().describe("Absolute path to the PDF file to convert"),
      output_path: z.string().optional().describe("Optional path to write markdown output. If not provided, returns content directly"),
      options: z.object({
        engine: z.enum(["pymupdf4llm", "marker"]).optional().default("marker").describe("PDF conversion engine (marker recommended for complex documents)"),
        page_chunks: z.boolean().optional().default(false).describe("Process as individual pages for memory efficiency (pymupdf4llm only)"),
        write_images: z.boolean().optional().default(false).describe("Extract embedded images to files"),
        image_path: z.string().optional().describe("Directory for extracted images (requires write_images: true)"),
        table_strategy: z.enum(["fast", "accurate"]).optional().default("accurate").describe("Table extraction strategy (pymupdf4llm only)"),
        extract_content: z.enum(["text", "figures", "both"]).optional().default("both").describe("Content to extract from PDF (pymupdf4llm only)"),
        auto_clean: z.boolean().optional().default(true).describe("Automatically clean marker formatting artifacts")
      }).optional().default({})
    });
  • src/index.ts:1291-1295 (registration)
    Tool registration object defining the 'convert_pdf' tool's name, description, and input schema in the tools array.
      name: "convert_pdf",
      description: "Convert PDF files to Markdown format using marker (recommended) or pymupdf4llm. Marker provides superior quality for complex documents with tables and structured content, with automatic cleaning of formatting artifacts. Returns detailed conversion statistics including processing time and content metrics.",
      inputSchema: zodToJsonSchema(ConvertPdfArgsSchema) as ToolInput,
    },
    {
  • Helper function implementing PDF conversion specifically using the 'marker' engine, including subprocess execution, temporary directory management, automatic cleaning, and result processing.
    async function convertWithMarker(
      pdfPath: string,
      outputPath?: string,
      options: { auto_clean?: boolean } = {}
    ): Promise<ConversionResult> {
      const startTime = Date.now();
      
      try {
        // Check if marker is available
        const markerCheck = await checkMarker();
        if (!markerCheck.available) {
          throw new Error(`Marker not available: ${markerCheck.error}`);
        }
        
        // Create temporary output directory for marker
        const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'marker-'));
        
        // Run marker conversion with correct syntax
        const result = await new Promise<{ success: boolean; content?: string; error?: string }>((resolve) => {
          const markerProcess = spawn('marker_single', [pdfPath, '--output_dir', tempDir], {
            stdio: ['pipe', 'pipe', 'pipe']
          });
    
          let stdout = '';
          let stderr = '';
    
          markerProcess.stdout.on('data', (data: Buffer) => {
            stdout += data.toString();
          });
    
          markerProcess.stderr.on('data', (data: Buffer) => {
            stderr += data.toString();
          });
    
          markerProcess.on('close', async (code: number | null) => {
            try {
              if (code === 0) {
                // Find the markdown file that marker created
                // Marker creates a directory with PDF basename, then puts .md file inside
                const pdfBaseName = path.basename(pdfPath, '.pdf');
                const markerDir = path.join(tempDir, pdfBaseName);
                const expectedOutput = path.join(markerDir, `${pdfBaseName}.md`);
                
                // Check if the expected structure exists
                let outputFile = expectedOutput;
                try {
                  await fs.access(expectedOutput);
                } catch {
                  // Look for marker directory and .md file inside it
                  try {
                    const tempFiles = await fs.readdir(tempDir, { withFileTypes: true });
                    const markerDirEntry = tempFiles.find(entry => entry.isDirectory());
                    
                    if (markerDirEntry) {
                      const dirPath = path.join(tempDir, markerDirEntry.name);
                      const dirFiles = await fs.readdir(dirPath);
                      const mdFile = dirFiles.find(file => file.endsWith('.md'));
                      if (mdFile) {
                        outputFile = path.join(dirPath, mdFile);
                      } else {
                        resolve({ success: false, error: 'No markdown file found in marker directory' });
                        return;
                      }
                    } else {
                      resolve({ success: false, error: 'No marker output directory found' });
                      return;
                    }
                  } catch {
                    resolve({ success: false, error: 'Failed to find marker output' });
                    return;
                  }
                }
                
                // Read the converted content
                const content = await fs.readFile(outputFile, 'utf-8');
                resolve({ success: true, content });
              } else {
                resolve({ success: false, error: stderr || `Marker exited with code ${code}` });
              }
            } catch (readError) {
              resolve({ success: false, error: `Failed to read marker output: ${readError}` });
            } finally {
              // Clean up temp directory
              try {
                await fs.rm(tempDir, { recursive: true });
              } catch {}
            }
          });
    
          markerProcess.on('error', (error: Error) => {
            resolve({ success: false, error: error.message });
          });
        });
    
        if (!result.success || !result.content) {
          throw new Error(result.error || 'Marker conversion failed');
        }
        
        // Apply cleaning if enabled (default: true)
        let finalContent = result.content;
        if (options.auto_clean !== false) {
          finalContent = cleanMarkerOutput(result.content);
        }
        
        // Write to output file if specified
        if (outputPath) {
          await fs.writeFile(outputPath, finalContent, 'utf-8');
        }
        
        // Calculate statistics
        const charCount = finalContent.length;
        const pageCount = Math.max(1, Math.floor(charCount / 3000)); // Rough estimate
        
        return {
          success: true,
          markdown_content: finalContent,
          output_file: outputPath,
          page_count: pageCount,
          char_count: charCount,
          images_extracted: 0, // Marker doesn't extract images to separate files
          processing_time: Date.now() - startTime,
          memory_used: 0, // Would need process monitoring
          warnings: options.auto_clean !== false ? ['Content automatically cleaned (table-aware)'] : []
        };
        
      } catch (error) {
        return {
          success: false,
          error: error instanceof Error ? error.message : String(error),
          page_count: 0,
          char_count: 0,
          images_extracted: 0,
          processing_time: Date.now() - startTime,
          memory_used: 0,
          warnings: []
        };
      }
    }
  • Helper function implementing PDF conversion using pymupdf4llm via dynamically generated Python script execution, supporting advanced options like image extraction and page chunking.
    async function convertWithPymupdf4llm(
      pdfPath: string,
      outputPath?: string,
      options: {
        page_chunks?: boolean;
        write_images?: boolean;
        image_path?: string;
        table_strategy?: "fast" | "accurate";
        extract_content?: "text" | "figures" | "both";
      } = {}
    ): Promise<ConversionResult> {
      const startTime = Date.now();
      
      try {
        // Validate PDF exists
        const validatedPdfPath = validatePath(pdfPath);
        await fs.access(validatedPdfPath);
        
        // Build Python conversion script for pymupdf4llm
        const pythonScript = `
    import pymupdf4llm
    import json
    import sys
    import os
    import gc
    import psutil
    
    def get_memory_usage():
        process = psutil.Process(os.getpid())
        return process.memory_info().rss / 1024 / 1024  # MB
    
    # Get initial memory
    initial_memory = get_memory_usage()
    
    try:
        # Set up conversion options
        kwargs = {}
        ${options.page_chunks ? "kwargs['page_chunks'] = True" : ""}
        ${options.write_images ? "kwargs['write_images'] = True" : ""}
        ${options.image_path ? `kwargs['image_path'] = "${options.image_path}"` : ""}
        
        # Convert PDF to markdown
        pdf_path = "${validatedPdfPath.replace(/\\/g, '\\\\\\\\')}"
        md_content = pymupdf4llm.to_markdown(pdf_path, **kwargs)
        
        # Get memory usage after conversion
        peak_memory = get_memory_usage()
        
        # Calculate statistics
        char_count = len(md_content)
        page_count = 0  # pymupdf4llm doesn't directly expose page count
        images_extracted = 0  # Would need to count files in image_path if provided
        
        # Estimate page count from content (rough heuristic)
        if isinstance(md_content, list):
            page_count = len(md_content)
            md_content = "\\n\\n---\\n\\n".join(md_content)
        else:
            # Estimate based on typical PDF page length
            page_count = max(1, char_count // 3000)
        
        # Count extracted images if image_path was provided
        ${options.write_images && options.image_path ? `
        if os.path.exists("${options.image_path}"):
            images_extracted = len([f for f in os.listdir("${options.image_path}") if f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))])
        ` : ""}
        
        result = {
            "success": True,
            "markdown_content": md_content,
            "page_count": page_count,
            "char_count": char_count,
            "images_extracted": images_extracted,
            "memory_used": peak_memory - initial_memory,
            "warnings": []
        }
        
        # Write to file if output_path provided
        ${outputPath ? `
        output_path = "${validatePath(outputPath).replace(/\\/g, '\\\\\\\\')}"
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(md_content)
        result["output_file"] = output_path
        ` : ""}
        
        print(json.dumps(result))
        
    except Exception as e:
        result = {
            "success": False,
            "error": str(e),
            "page_count": 0,
            "char_count": 0,
            "images_extracted": 0,
            "memory_used": get_memory_usage() - initial_memory,
            "warnings": []
        }
        print(json.dumps(result))
        sys.exit(1)
    `;
    
        // Execute conversion
        const result = await new Promise<ConversionResult>((resolve, reject) => {
          const pythonProcess = spawn('python3', ['-c', pythonScript], {
            stdio: ['pipe', 'pipe', 'pipe']
          });
    
          let stdout = '';
          let stderr = '';
    
          pythonProcess.stdout.on('data', (data: Buffer) => {
            stdout += data.toString();
          });
    
          pythonProcess.stderr.on('data', (data: Buffer) => {
            stderr += data.toString();
          });
    
          pythonProcess.on('close', (code: number | null) => {
            try {
              const result = JSON.parse(stdout.trim()) as ConversionResult;
              result.processing_time = Date.now() - startTime;
              
              if (stderr.trim()) {
                result.warnings.push(stderr.trim());
              }
              
              resolve(result);
            } catch (parseError) {
              reject(new Error(`Failed to parse conversion result: ${parseError}, stdout: ${stdout}, stderr: ${stderr}`));
            }
          });
    
          pythonProcess.on('error', (error: Error) => {
            reject(error);
          });
        });
    
        return result;
    
      } catch (error) {
        return {
          success: false,
          error: error instanceof Error ? error.message : String(error),
          page_count: 0,
          char_count: 0,
          images_extracted: 0,
          processing_time: Date.now() - startTime,
          memory_used: 0,
          warnings: []
        };
      }
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cordlesssteve/document-organizer-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server