read_semantic_paper
Extract full text from open-access academic papers using Semantic Scholar IDs, converting PDFs to Markdown format for analysis and reference.
Instructions
Read paper via Semantic Scholar (open-access only, use as LAST RESORT).
DOWNLOAD PRIORITY (try in order):
1. If arXiv paper -> use read_arxiv_paper(arxiv_id)
2. If published before 2023 -> use read_scihub_paper(doi)
3. Use this tool as last resort
Args:
paper_id: Semantic Scholar ID or prefixed ID (DOI:, ARXIV:, PMID:).
save_path: Directory to save PDF.
Returns:
Full paper text in Markdown format.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| paper_id | Yes | ||
| save_path | No |
Implementation Reference
- paper_find_mcp/server.py:602-618 (handler)MCP tool handler for 'read_semantic_paper'. This is the primary entry point decorated with @mcp.tool(). It delegates execution to the generic _read helper using the 'semantic' searcher instance.@mcp.tool() async def read_semantic_paper(paper_id: str, save_path: Optional[str] = None) -> str: """Read paper via Semantic Scholar (open-access only, use as LAST RESORT). DOWNLOAD PRIORITY (try in order): 1. If arXiv paper -> use read_arxiv_paper(arxiv_id) 2. If published before 2023 -> use read_scihub_paper(doi) 3. Use this tool as last resort Args: paper_id: Semantic Scholar ID or prefixed ID (DOI:, ARXIV:, PMID:). save_path: Directory to save PDF. Returns: Full paper text in Markdown format. """ return await _read('semantic', paper_id, save_path)
- paper_find_mcp/server.py:137-157 (helper)Generic helper function _read that retrieves the searcher instance from SEARCHERS dict and invokes searcher.read_paper(paper_id, save_path). Used by all platform-specific read_*_paper tools.async def _read( searcher_name: str, paper_id: str, save_path: Optional[str] = None ) -> str: """通用阅读函数""" if save_path is None: save_path = get_download_path() searcher = SEARCHERS.get(searcher_name) if not searcher: return f"Error: Unknown searcher {searcher_name}" try: return searcher.read_paper(paper_id, save_path) except NotImplementedError as e: return str(e) except Exception as e: logger.error(f"Read failed for {searcher_name}: {e}") return f"Error reading paper: {str(e)}"
- Core implementation logic in SemanticSearcher.read_paper(). Downloads open-access PDF using get_paper_details() and pdf_url, then extracts full text to Markdown using pymupdf4llm.to_markdown(), prepends metadata.def read_paper(self, paper_id: str, save_path: str) -> str: """下载并提取论文文本 使用 PyMuPDF4LLM 提取 Markdown 格式。 Args: paper_id: 论文 ID save_path: 保存目录 Returns: 提取的文本内容或错误信息 """ # 先下载 PDF pdf_path = self.download_pdf(paper_id, save_path) if pdf_path.startswith("Error"): return pdf_path # 获取论文元数据 paper = self.get_paper_details(paper_id) try: text = pymupdf4llm.to_markdown(pdf_path, show_progress=False) logger.info(f"Extracted {len(text)} characters using PyMuPDF4LLM") if not text.strip(): return f"PDF downloaded to {pdf_path}, but no text could be extracted." # 添加元数据 metadata = "" if paper: metadata = f"# {paper.title}\n\n" metadata += f"**Authors**: {', '.join(paper.authors)}\n" metadata += f"**Published**: {paper.published_date}\n" metadata += f"**URL**: {paper.url}\n" metadata += f"**PDF**: {pdf_path}\n\n" metadata += "---\n\n" return metadata + text except Exception as e: logger.error(f"Failed to extract text: {e}") return f"Error extracting text: {e}"
- paper_find_mcp/server.py:75-85 (registration)Global SEARCHERS dictionary where 'semantic': SemanticSearcher() instance is registered, used by _read and _download helpers to dispatch to platform-specific implementations.SEARCHERS = { 'arxiv': ArxivSearcher(), 'pubmed': PubMedSearcher(), 'biorxiv': BioRxivSearcher(), 'medrxiv': MedRxivSearcher(), 'google_scholar': GoogleScholarSearcher(), 'iacr': IACRSearcher(), 'semantic': SemanticSearcher(), 'crossref': CrossRefSearcher(), 'repec': RePECSearcher(), }