get_paper_content
Extract full text content from arXiv paper PDFs using the paper ID to access complete research documents for reading or analysis.
Instructions
Get the full text content of a paper by downloading and extracting text from its PDF
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| paper_id | Yes | arXiv paper ID (e.g., 2104.13478 or cs/0001001) |
Implementation Reference
- src/index.ts:558-610 (handler)The primary handler function for the 'get_paper_content' tool. Downloads the PDF from arXiv, extracts text content using pdf-parse library, cleans the text, and returns it in the MCP response format.private async getPaperContent(args: GetPaperContentArgs) { try { // Construct the PDF URL directly // arXiv PDF URLs follow the pattern: https://arxiv.org/pdf/{paper_id}.pdf const pdfUrl = `https://arxiv.org/pdf/${args.paper_id}.pdf`; // Download the PDF const pdfPath = await this.downloadPdf(pdfUrl, args.paper_id); // Extract text from the PDF const textContent = await this.extractTextFromPdf(pdfPath); // Clean up the text (remove excessive whitespace, normalize line breaks) const cleanedText = textContent .replace(/\s+/g, ' ') .replace(/(\r\n|\n|\r)/gm, '\n') .trim(); // Return the extracted text return { content: [ { type: 'text', text: cleanedText, }, ], }; } catch (error) { console.error('Error in getPaperContent:', error); if (axios.isAxiosError(error)) { return { content: [ { type: 'text', text: `Error retrieving paper content: ${error.response?.data || error.message}`, }, ], isError: true, }; } return { content: [ { type: 'text', text: `Error processing paper content: ${error instanceof Error ? error.message : String(error)}`, }, ], isError: true, }; } }
- src/index.ts:59-61 (schema)TypeScript interface defining the input arguments for the get_paper_content tool.interface GetPaperContentArgs { paper_id: string; }
- src/index.ts:201-214 (registration)Tool registration in the ListToolsRequestSchema handler. Defines the tool name, description, and JSON schema for input validation.{ name: 'get_paper_content', description: 'Get the full text content of a paper by downloading and extracting text from its PDF', inputSchema: { type: 'object', properties: { paper_id: { type: 'string', description: 'arXiv paper ID (e.g., 2104.13478 or cs/0001001)', }, }, required: ['paper_id'], }, },
- src/index.ts:239-246 (registration)Dispatch logic in the CallToolRequestSchema handler that validates input and invokes the getPaperContent method.case 'get_paper_content': if (!request.params.arguments || typeof request.params.arguments.paper_id !== 'string') { throw new McpError( ErrorCode.InvalidParams, 'Missing or invalid paper_id parameter' ); } return await this.getPaperContent(request.params.arguments as unknown as GetPaperContentArgs);
- src/index.ts:534-551 (helper)Helper function to extract text content from a downloaded PDF file using the pdf-parse library.private async extractTextFromPdf(pdfPath: string): Promise<string> { try { // Read the PDF file const dataBuffer = await fs.readFile(pdfPath); // Dynamically import pdf-parse const pdfParse = (await import('pdf-parse')).default; // Parse the PDF const data = await pdfParse(dataBuffer); // Return the text content return data.text; } catch (error) { console.error('Error extracting text from PDF:', error); throw new Error(`Failed to extract text from PDF: ${error instanceof Error ? error.message : String(error)}`); } }