Skip to main content
Glama

arXiv MCP Server

by Mnehmos

get_paper_content

Download and extract the full text content of an ArXiv paper by providing its arXiv ID for research or analysis purposes.

Instructions

Get the full text content of a paper by downloading and extracting text from its PDF

Input Schema

NameRequiredDescriptionDefault
paper_idYesarXiv paper ID (e.g., 2104.13478 or cs/0001001)

Input Schema (JSON Schema)

{ "properties": { "paper_id": { "description": "arXiv paper ID (e.g., 2104.13478 or cs/0001001)", "type": "string" } }, "required": [ "paper_id" ], "type": "object" }

Implementation Reference

  • The main execution handler for the 'get_paper_content' tool. It constructs the PDF URL, downloads the PDF, extracts text using pdf-parse, cleans the text, and returns it in the MCP format. Handles errors appropriately.
    private async getPaperContent(args: GetPaperContentArgs) { try { // Construct the PDF URL directly // arXiv PDF URLs follow the pattern: https://arxiv.org/pdf/{paper_id}.pdf const pdfUrl = `https://arxiv.org/pdf/${args.paper_id}.pdf`; // Download the PDF const pdfPath = await this.downloadPdf(pdfUrl, args.paper_id); // Extract text from the PDF const textContent = await this.extractTextFromPdf(pdfPath); // Clean up the text (remove excessive whitespace, normalize line breaks) const cleanedText = textContent .replace(/\s+/g, ' ') .replace(/(\r\n|\n|\r)/gm, '\n') .trim(); // Return the extracted text return { content: [ { type: 'text', text: cleanedText, }, ], }; } catch (error) { console.error('Error in getPaperContent:', error); if (axios.isAxiosError(error)) { return { content: [ { type: 'text', text: `Error retrieving paper content: ${error.response?.data || error.message}`, }, ], isError: true, }; } return { content: [ { type: 'text', text: `Error processing paper content: ${error instanceof Error ? error.message : String(error)}`, }, ], isError: true, }; } }
  • TypeScript interface defining the input arguments for the get_paper_content tool.
    interface GetPaperContentArgs { paper_id: string; }
  • src/index.ts:201-214 (registration)
    Tool registration in the ListTools handler, including name, description, and input schema.
    { name: 'get_paper_content', description: 'Get the full text content of a paper by downloading and extracting text from its PDF', inputSchema: { type: 'object', properties: { paper_id: { type: 'string', description: 'arXiv paper ID (e.g., 2104.13478 or cs/0001001)', }, }, required: ['paper_id'], }, },
  • src/index.ts:239-246 (registration)
    Dispatch case in the CallToolRequestSchema handler that validates input and calls the getPaperContent method.
    case 'get_paper_content': if (!request.params.arguments || typeof request.params.arguments.paper_id !== 'string') { throw new McpError( ErrorCode.InvalidParams, 'Missing or invalid paper_id parameter' ); } return await this.getPaperContent(request.params.arguments as unknown as GetPaperContentArgs);
  • Helper function to extract text content from a downloaded PDF file using the pdf-parse library.
    private async extractTextFromPdf(pdfPath: string): Promise<string> { try { // Read the PDF file const dataBuffer = await fs.readFile(pdfPath); // Dynamically import pdf-parse const pdfParse = (await import('pdf-parse')).default; // Parse the PDF const data = await pdfParse(dataBuffer); // Return the text content return data.text; } catch (error) { console.error('Error extracting text from PDF:', error); throw new Error(`Failed to extract text from PDF: ${error instanceof Error ? error.message : String(error)}`); } }
  • Helper function to download the PDF from arXiv URL to a temporary location, with caching support.
    private async downloadPdf(url: string, paperId: string): Promise<string> { try { // Ensure temp directory exists await fs.ensureDir(TEMP_PDF_DIR); // Create a unique filename based on the paper ID const sanitizedPaperId = paperId.replace(/\//g, '_'); const pdfPath = path.join(TEMP_PDF_DIR, `${sanitizedPaperId}.pdf`); // Check if we already have this PDF cached if (await fs.pathExists(pdfPath)) { console.error(`Using cached PDF for ${paperId}`); return pdfPath; } console.error(`Downloading PDF for ${paperId} from ${url}`); // Download the PDF with proper headers // Note: Using responseType 'arraybuffer' to handle binary data const response = await axios.get(url, { responseType: 'arraybuffer', headers: { 'User-Agent': 'arXiv-MCP-Server/0.1.0 (https://github.com/your-username/arxiv-mcp-server)', }, // Add a timeout to prevent hanging on large files timeout: 30000, }); // Save the PDF to disk await fs.outputFile(pdfPath, response.data); return pdfPath; } catch (error) { console.error('Error downloading PDF:', error); throw new Error(`Failed to download PDF: ${error instanceof Error ? error.message : String(error)}`); } }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mnehmos/arxiv-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server