read_arxiv_paper

read_arxiv_paper

Download arXiv papers and convert them to Markdown format for easy reading and text extraction.

Instructions

Download and extract full text from arXiv paper as Markdown.

Args: paper_id: arXiv ID (e.g., '2106.12345'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format. Example: read_arxiv_paper("2106.12345")

Input Schema

TableJSON Schema

Name	Required	Description	Default
`paper_id`	Yes
`save_path`	No

Implementation Reference

paper_find_mcp/server.py:211-225 (registration)
Registration of the read_arxiv_paper tool using @mcp.tool() decorator. Defines input schema via type hints and docstring.
@mcp.tool() async def read_arxiv_paper(paper_id: str, save_path: Optional[str] = None) -> str: """Download and extract full text from arXiv paper as Markdown. Args: paper_id: arXiv ID (e.g., '2106.12345'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format. Example: read_arxiv_paper("2106.12345") """ return await _read('arxiv', paper_id, save_path)
paper_find_mcp/server.py:211-225 (handler)
Handler function for the read_arxiv_paper MCP tool. Delegates execution to the generic _read helper.
@mcp.tool() async def read_arxiv_paper(paper_id: str, save_path: Optional[str] = None) -> str: """Download and extract full text from arXiv paper as Markdown. Args: paper_id: arXiv ID (e.g., '2106.12345'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format. Example: read_arxiv_paper("2106.12345") """ return await _read('arxiv', paper_id, save_path)
paper_find_mcp/server.py:137-157 (helper)
Generic _read helper function that retrieves the appropriate searcher (ArxivSearcher for 'arxiv') and calls its read_paper method.
async def _read( searcher_name: str, paper_id: str, save_path: Optional[str] = None ) -> str: """通用阅读函数""" if save_path is None: save_path = get_download_path() searcher = SEARCHERS.get(searcher_name) if not searcher: return f"Error: Unknown searcher {searcher_name}" try: return searcher.read_paper(paper_id, save_path) except NotImplementedError as e: return str(e) except Exception as e: logger.error(f"Read failed for {searcher_name}: {e}") return f"Error reading paper: {str(e)}"
paper_find_mcp/academic_platforms/arxiv.py:158-196 (helper)
Core implementation in ArxivSearcher.read_paper: ensures PDF is downloaded, then extracts to Markdown using pymupdf4llm.to_markdown (in _extract_markdown).
def read_paper( self, paper_id: str, save_path: str, output_format: Literal["markdown", "text"] = "markdown", table_strategy: Literal["lines_strict", "lines", "text", "explicit"] = "lines_strict", pages: Optional[List[int]] = None ) -> str: """读取论文并提取内容使用 PyMuPDF4LLM 进行高质量文本提取，支持： - Markdown 格式输出（推荐，对 LLM 友好） - 表格自动转换为 Markdown 表格 - 多种表格检测策略 Args: paper_id: arXiv 论文 ID save_path: PDF 存储目录 output_format: 输出格式 - "markdown": Markdown 格式（推荐，包含表格） - "text": 纯文本格式 table_strategy: 表格检测策略 - "lines_strict": 严格模式，只检测有完整边框的表格 - "lines": 线条模式，检测有部分边框的表格 - "text": 文本模式，基于对齐检测（适合无边框表格） - "explicit": 显式模式，只检测明确标记的表格 pages: 要提取的页面列表（0-indexed），None 表示全部页面 Returns: str: 提取的论文内容 """ # 确保 PDF 已下载 pdf_path = self._ensure_pdf_downloaded(paper_id, save_path) if output_format == "markdown": return self._extract_markdown(pdf_path, table_strategy, pages) else: return self._extract_text(pdf_path, pages)
paper_find_mcp/academic_platforms/arxiv.py:116-157 (helper)
ArxivSearcher.download_pdf method: downloads the arXiv PDF, called indirectly by read_paper via _ensure_pdf_downloaded.
def download_pdf(self, paper_id: str, save_path: str) -> str: """下载 arXiv 论文 PDF Args: paper_id: arXiv 论文 ID (例如 '2106.12345') save_path: 保存目录 Returns: str: PDF 文件路径 Raises: RuntimeError: 下载失败时抛出 """ # 确保目录存在 os.makedirs(save_path, exist_ok=True) # 构建文件路径 # 处理带版本号的 ID (例如 2106.12345v2) safe_id = paper_id.replace('/', '_').replace(':', '_') output_file = os.path.join(save_path, f"{safe_id}.pdf") # 检查文件是否已存在 if os.path.exists(output_file): logger.info(f"PDF already exists: {output_file}") return output_file # 下载 PDF pdf_url = f"https://arxiv.org/pdf/{paper_id}.pdf" try: response = requests.get(pdf_url, timeout=60) response.raise_for_status() with open(output_file, 'wb') as f: f.write(response.content) logger.info(f"PDF downloaded: {output_file}") return output_file except requests.RequestException as e: raise RuntimeError(f"Failed to download PDF: {e}")

Paper Search MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API