Skip to main content
Glama

convert_pdf

Convert PDF files to Markdown format for better organization and processing using marker or pymupdf4llm libraries.

Instructions

Convert PDF files to Markdown format using marker (recommended) or pymupdf4llm

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
output_pathNoOptional path to write markdown output
pdf_pathYesAbsolute path to the PDF file to convert

Implementation Reference

  • Core handler function that orchestrates PDF to Markdown conversion, selecting between marker or pymupdf4llm engines based on options, with input validation and comprehensive error handling.
    async function convertPdfToMarkdown( pdfPath: string, outputPath?: string, options: { engine?: "pymupdf4llm" | "marker"; page_chunks?: boolean; write_images?: boolean; image_path?: string; table_strategy?: "fast" | "accurate"; extract_content?: "text" | "figures" | "both"; auto_clean?: boolean; } = {} ): Promise<ConversionResult> { const startTime = Date.now(); try { // Validate PDF exists const validatedPdfPath = validatePath(pdfPath); await fs.access(validatedPdfPath); // Determine which engine to use const engine = options.engine || "marker"; if (engine === "marker") { return await convertWithMarker(validatedPdfPath, outputPath, options); } else { return await convertWithPymupdf4llm(validatedPdfPath, outputPath, options); } } catch (error) { return { success: false, error: error instanceof Error ? error.message : String(error), page_count: 0, char_count: 0, images_extracted: 0, processing_time: Date.now() - startTime, memory_used: 0, warnings: [] }; } }
  • Zod schema defining the input parameters for the convert_pdf tool, including pdf_path, optional output_path, and detailed conversion options.
    const ConvertPdfArgsSchema = z.object({ pdf_path: z.string().describe("Absolute path to the PDF file to convert"), output_path: z.string().optional().describe("Optional path to write markdown output. If not provided, returns content directly"), options: z.object({ engine: z.enum(["pymupdf4llm", "marker"]).optional().default("marker").describe("PDF conversion engine (marker recommended for complex documents)"), page_chunks: z.boolean().optional().default(false).describe("Process as individual pages for memory efficiency (pymupdf4llm only)"), write_images: z.boolean().optional().default(false).describe("Extract embedded images to files"), image_path: z.string().optional().describe("Directory for extracted images (requires write_images: true)"), table_strategy: z.enum(["fast", "accurate"]).optional().default("accurate").describe("Table extraction strategy (pymupdf4llm only)"), extract_content: z.enum(["text", "figures", "both"]).optional().default("both").describe("Content to extract from PDF (pymupdf4llm only)"), auto_clean: z.boolean().optional().default(true).describe("Automatically clean marker formatting artifacts") }).optional().default({}) });
  • src/index.ts:1291-1295 (registration)
    Tool registration object defining the 'convert_pdf' tool's name, description, and input schema in the tools array.
    name: "convert_pdf", description: "Convert PDF files to Markdown format using marker (recommended) or pymupdf4llm. Marker provides superior quality for complex documents with tables and structured content, with automatic cleaning of formatting artifacts. Returns detailed conversion statistics including processing time and content metrics.", inputSchema: zodToJsonSchema(ConvertPdfArgsSchema) as ToolInput, }, {
  • Helper function implementing PDF conversion specifically using the 'marker' engine, including subprocess execution, temporary directory management, automatic cleaning, and result processing.
    async function convertWithMarker( pdfPath: string, outputPath?: string, options: { auto_clean?: boolean } = {} ): Promise<ConversionResult> { const startTime = Date.now(); try { // Check if marker is available const markerCheck = await checkMarker(); if (!markerCheck.available) { throw new Error(`Marker not available: ${markerCheck.error}`); } // Create temporary output directory for marker const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'marker-')); // Run marker conversion with correct syntax const result = await new Promise<{ success: boolean; content?: string; error?: string }>((resolve) => { const markerProcess = spawn('marker_single', [pdfPath, '--output_dir', tempDir], { stdio: ['pipe', 'pipe', 'pipe'] }); let stdout = ''; let stderr = ''; markerProcess.stdout.on('data', (data: Buffer) => { stdout += data.toString(); }); markerProcess.stderr.on('data', (data: Buffer) => { stderr += data.toString(); }); markerProcess.on('close', async (code: number | null) => { try { if (code === 0) { // Find the markdown file that marker created // Marker creates a directory with PDF basename, then puts .md file inside const pdfBaseName = path.basename(pdfPath, '.pdf'); const markerDir = path.join(tempDir, pdfBaseName); const expectedOutput = path.join(markerDir, `${pdfBaseName}.md`); // Check if the expected structure exists let outputFile = expectedOutput; try { await fs.access(expectedOutput); } catch { // Look for marker directory and .md file inside it try { const tempFiles = await fs.readdir(tempDir, { withFileTypes: true }); const markerDirEntry = tempFiles.find(entry => entry.isDirectory()); if (markerDirEntry) { const dirPath = path.join(tempDir, markerDirEntry.name); const dirFiles = await fs.readdir(dirPath); const mdFile = dirFiles.find(file => file.endsWith('.md')); if (mdFile) { outputFile = path.join(dirPath, mdFile); } else { resolve({ success: false, error: 'No markdown file found in marker directory' }); return; } } else { resolve({ success: false, error: 'No marker output directory found' }); return; } } catch { resolve({ success: false, error: 'Failed to find marker output' }); return; } } // Read the converted content const content = await fs.readFile(outputFile, 'utf-8'); resolve({ success: true, content }); } else { resolve({ success: false, error: stderr || `Marker exited with code ${code}` }); } } catch (readError) { resolve({ success: false, error: `Failed to read marker output: ${readError}` }); } finally { // Clean up temp directory try { await fs.rm(tempDir, { recursive: true }); } catch {} } }); markerProcess.on('error', (error: Error) => { resolve({ success: false, error: error.message }); }); }); if (!result.success || !result.content) { throw new Error(result.error || 'Marker conversion failed'); } // Apply cleaning if enabled (default: true) let finalContent = result.content; if (options.auto_clean !== false) { finalContent = cleanMarkerOutput(result.content); } // Write to output file if specified if (outputPath) { await fs.writeFile(outputPath, finalContent, 'utf-8'); } // Calculate statistics const charCount = finalContent.length; const pageCount = Math.max(1, Math.floor(charCount / 3000)); // Rough estimate return { success: true, markdown_content: finalContent, output_file: outputPath, page_count: pageCount, char_count: charCount, images_extracted: 0, // Marker doesn't extract images to separate files processing_time: Date.now() - startTime, memory_used: 0, // Would need process monitoring warnings: options.auto_clean !== false ? ['Content automatically cleaned (table-aware)'] : [] }; } catch (error) { return { success: false, error: error instanceof Error ? error.message : String(error), page_count: 0, char_count: 0, images_extracted: 0, processing_time: Date.now() - startTime, memory_used: 0, warnings: [] }; } }
  • Helper function implementing PDF conversion using pymupdf4llm via dynamically generated Python script execution, supporting advanced options like image extraction and page chunking.
    async function convertWithPymupdf4llm( pdfPath: string, outputPath?: string, options: { page_chunks?: boolean; write_images?: boolean; image_path?: string; table_strategy?: "fast" | "accurate"; extract_content?: "text" | "figures" | "both"; } = {} ): Promise<ConversionResult> { const startTime = Date.now(); try { // Validate PDF exists const validatedPdfPath = validatePath(pdfPath); await fs.access(validatedPdfPath); // Build Python conversion script for pymupdf4llm const pythonScript = ` import pymupdf4llm import json import sys import os import gc import psutil def get_memory_usage(): process = psutil.Process(os.getpid()) return process.memory_info().rss / 1024 / 1024 # MB # Get initial memory initial_memory = get_memory_usage() try: # Set up conversion options kwargs = {} ${options.page_chunks ? "kwargs['page_chunks'] = True" : ""} ${options.write_images ? "kwargs['write_images'] = True" : ""} ${options.image_path ? `kwargs['image_path'] = "${options.image_path}"` : ""} # Convert PDF to markdown pdf_path = "${validatedPdfPath.replace(/\\/g, '\\\\\\\\')}" md_content = pymupdf4llm.to_markdown(pdf_path, **kwargs) # Get memory usage after conversion peak_memory = get_memory_usage() # Calculate statistics char_count = len(md_content) page_count = 0 # pymupdf4llm doesn't directly expose page count images_extracted = 0 # Would need to count files in image_path if provided # Estimate page count from content (rough heuristic) if isinstance(md_content, list): page_count = len(md_content) md_content = "\\n\\n---\\n\\n".join(md_content) else: # Estimate based on typical PDF page length page_count = max(1, char_count // 3000) # Count extracted images if image_path was provided ${options.write_images && options.image_path ? ` if os.path.exists("${options.image_path}"): images_extracted = len([f for f in os.listdir("${options.image_path}") if f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))]) ` : ""} result = { "success": True, "markdown_content": md_content, "page_count": page_count, "char_count": char_count, "images_extracted": images_extracted, "memory_used": peak_memory - initial_memory, "warnings": [] } # Write to file if output_path provided ${outputPath ? ` output_path = "${validatePath(outputPath).replace(/\\/g, '\\\\\\\\')}" with open(output_path, 'w', encoding='utf-8') as f: f.write(md_content) result["output_file"] = output_path ` : ""} print(json.dumps(result)) except Exception as e: result = { "success": False, "error": str(e), "page_count": 0, "char_count": 0, "images_extracted": 0, "memory_used": get_memory_usage() - initial_memory, "warnings": [] } print(json.dumps(result)) sys.exit(1) `; // Execute conversion const result = await new Promise<ConversionResult>((resolve, reject) => { const pythonProcess = spawn('python3', ['-c', pythonScript], { stdio: ['pipe', 'pipe', 'pipe'] }); let stdout = ''; let stderr = ''; pythonProcess.stdout.on('data', (data: Buffer) => { stdout += data.toString(); }); pythonProcess.stderr.on('data', (data: Buffer) => { stderr += data.toString(); }); pythonProcess.on('close', (code: number | null) => { try { const result = JSON.parse(stdout.trim()) as ConversionResult; result.processing_time = Date.now() - startTime; if (stderr.trim()) { result.warnings.push(stderr.trim()); } resolve(result); } catch (parseError) { reject(new Error(`Failed to parse conversion result: ${parseError}, stdout: ${stdout}, stderr: ${stderr}`)); } }); pythonProcess.on('error', (error: Error) => { reject(error); }); }); return result; } catch (error) { return { success: false, error: error instanceof Error ? error.message : String(error), page_count: 0, char_count: 0, images_extracted: 0, processing_time: Date.now() - startTime, memory_used: 0, warnings: [] }; } }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cordlesssteve/document-organizer-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server