Skip to main content
Glama
Mnehmos

arXiv MCP Server

get_paper_content

Download and extract the full text content of an ArXiv paper by providing its arXiv ID for research or analysis purposes.

Instructions

Get the full text content of a paper by downloading and extracting text from its PDF

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
paper_idYesarXiv paper ID (e.g., 2104.13478 or cs/0001001)

Implementation Reference

  • The main execution handler for the 'get_paper_content' tool. It constructs the PDF URL, downloads the PDF, extracts text using pdf-parse, cleans the text, and returns it in the MCP format. Handles errors appropriately.
    private async getPaperContent(args: GetPaperContentArgs) {
      try {
        // Construct the PDF URL directly
        // arXiv PDF URLs follow the pattern: https://arxiv.org/pdf/{paper_id}.pdf
        const pdfUrl = `https://arxiv.org/pdf/${args.paper_id}.pdf`;
        
        // Download the PDF
        const pdfPath = await this.downloadPdf(pdfUrl, args.paper_id);
        
        // Extract text from the PDF
        const textContent = await this.extractTextFromPdf(pdfPath);
        
        // Clean up the text (remove excessive whitespace, normalize line breaks)
        const cleanedText = textContent
          .replace(/\s+/g, ' ')
          .replace(/(\r\n|\n|\r)/gm, '\n')
          .trim();
        
        // Return the extracted text
        return {
          content: [
            {
              type: 'text',
              text: cleanedText,
            },
          ],
        };
      } catch (error) {
        console.error('Error in getPaperContent:', error);
        
        if (axios.isAxiosError(error)) {
          return {
            content: [
              {
                type: 'text',
                text: `Error retrieving paper content: ${error.response?.data || error.message}`,
              },
            ],
            isError: true,
          };
        }
        
        return {
          content: [
            {
              type: 'text',
              text: `Error processing paper content: ${error instanceof Error ? error.message : String(error)}`,
            },
          ],
          isError: true,
        };
      }
    }
  • TypeScript interface defining the input arguments for the get_paper_content tool.
    interface GetPaperContentArgs {
      paper_id: string;
    }
  • src/index.ts:201-214 (registration)
    Tool registration in the ListTools handler, including name, description, and input schema.
    {
      name: 'get_paper_content',
      description: 'Get the full text content of a paper by downloading and extracting text from its PDF',
      inputSchema: {
        type: 'object',
        properties: {
          paper_id: {
            type: 'string',
            description: 'arXiv paper ID (e.g., 2104.13478 or cs/0001001)',
          },
        },
        required: ['paper_id'],
      },
    },
  • src/index.ts:239-246 (registration)
    Dispatch case in the CallToolRequestSchema handler that validates input and calls the getPaperContent method.
    case 'get_paper_content':
      if (!request.params.arguments || typeof request.params.arguments.paper_id !== 'string') {
        throw new McpError(
          ErrorCode.InvalidParams,
          'Missing or invalid paper_id parameter'
        );
      }
      return await this.getPaperContent(request.params.arguments as unknown as GetPaperContentArgs);
  • Helper function to extract text content from a downloaded PDF file using the pdf-parse library.
    private async extractTextFromPdf(pdfPath: string): Promise<string> {
      try {
        // Read the PDF file
        const dataBuffer = await fs.readFile(pdfPath);
    
        // Dynamically import pdf-parse
        const pdfParse = (await import('pdf-parse')).default;
    
        // Parse the PDF
        const data = await pdfParse(dataBuffer);
    
        // Return the text content
        return data.text;
      } catch (error) {
        console.error('Error extracting text from PDF:', error);
        throw new Error(`Failed to extract text from PDF: ${error instanceof Error ? error.message : String(error)}`);
      }
    }
  • Helper function to download the PDF from arXiv URL to a temporary location, with caching support.
    private async downloadPdf(url: string, paperId: string): Promise<string> {
      try {
        // Ensure temp directory exists
        await fs.ensureDir(TEMP_PDF_DIR);
        
        // Create a unique filename based on the paper ID
        const sanitizedPaperId = paperId.replace(/\//g, '_');
        const pdfPath = path.join(TEMP_PDF_DIR, `${sanitizedPaperId}.pdf`);
        
        // Check if we already have this PDF cached
        if (await fs.pathExists(pdfPath)) {
          console.error(`Using cached PDF for ${paperId}`);
          return pdfPath;
        }
        
        console.error(`Downloading PDF for ${paperId} from ${url}`);
        
        // Download the PDF with proper headers
        // Note: Using responseType 'arraybuffer' to handle binary data
        const response = await axios.get(url, {
          responseType: 'arraybuffer',
          headers: {
            'User-Agent': 'arXiv-MCP-Server/0.1.0 (https://github.com/your-username/arxiv-mcp-server)',
          },
          // Add a timeout to prevent hanging on large files
          timeout: 30000,
        });
        
        // Save the PDF to disk
        await fs.outputFile(pdfPath, response.data);
        
        return pdfPath;
      } catch (error) {
        console.error('Error downloading PDF:', error);
        throw new Error(`Failed to download PDF: ${error instanceof Error ? error.message : String(error)}`);
      }
    }
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mnehmos/arxiv-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server