Skip to main content
Glama

get_paper_content

Extract full text content from arXiv paper PDFs using the paper ID to access complete research documents for reading or analysis.

Instructions

Get the full text content of a paper by downloading and extracting text from its PDF

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
paper_idYesarXiv paper ID (e.g., 2104.13478 or cs/0001001)

Implementation Reference

  • The primary handler function for the 'get_paper_content' tool. Downloads the PDF from arXiv, extracts text content using pdf-parse library, cleans the text, and returns it in the MCP response format.
    private async getPaperContent(args: GetPaperContentArgs) { try { // Construct the PDF URL directly // arXiv PDF URLs follow the pattern: https://arxiv.org/pdf/{paper_id}.pdf const pdfUrl = `https://arxiv.org/pdf/${args.paper_id}.pdf`; // Download the PDF const pdfPath = await this.downloadPdf(pdfUrl, args.paper_id); // Extract text from the PDF const textContent = await this.extractTextFromPdf(pdfPath); // Clean up the text (remove excessive whitespace, normalize line breaks) const cleanedText = textContent .replace(/\s+/g, ' ') .replace(/(\r\n|\n|\r)/gm, '\n') .trim(); // Return the extracted text return { content: [ { type: 'text', text: cleanedText, }, ], }; } catch (error) { console.error('Error in getPaperContent:', error); if (axios.isAxiosError(error)) { return { content: [ { type: 'text', text: `Error retrieving paper content: ${error.response?.data || error.message}`, }, ], isError: true, }; } return { content: [ { type: 'text', text: `Error processing paper content: ${error instanceof Error ? error.message : String(error)}`, }, ], isError: true, }; } }
  • TypeScript interface defining the input arguments for the get_paper_content tool.
    interface GetPaperContentArgs { paper_id: string; }
  • src/index.ts:201-214 (registration)
    Tool registration in the ListToolsRequestSchema handler. Defines the tool name, description, and JSON schema for input validation.
    { name: 'get_paper_content', description: 'Get the full text content of a paper by downloading and extracting text from its PDF', inputSchema: { type: 'object', properties: { paper_id: { type: 'string', description: 'arXiv paper ID (e.g., 2104.13478 or cs/0001001)', }, }, required: ['paper_id'], }, },
  • src/index.ts:239-246 (registration)
    Dispatch logic in the CallToolRequestSchema handler that validates input and invokes the getPaperContent method.
    case 'get_paper_content': if (!request.params.arguments || typeof request.params.arguments.paper_id !== 'string') { throw new McpError( ErrorCode.InvalidParams, 'Missing or invalid paper_id parameter' ); } return await this.getPaperContent(request.params.arguments as unknown as GetPaperContentArgs);
  • Helper function to extract text content from a downloaded PDF file using the pdf-parse library.
    private async extractTextFromPdf(pdfPath: string): Promise<string> { try { // Read the PDF file const dataBuffer = await fs.readFile(pdfPath); // Dynamically import pdf-parse const pdfParse = (await import('pdf-parse')).default; // Parse the PDF const data = await pdfParse(dataBuffer); // Return the text content return data.text; } catch (error) { console.error('Error extracting text from PDF:', error); throw new Error(`Failed to extract text from PDF: ${error instanceof Error ? error.message : String(error)}`); } }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mnehmos/mnehmos.arxiv.mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server