read_scihub_paper

Extract full text from pre-2023 academic papers via Sci-Hub, converting PDFs to Markdown format for analysis, summarization, or answering questions about research content.

Instructions

Download and extract full text from paper via Sci-Hub (older papers only).

USE THIS TOOL WHEN: - You need the complete text content of a paper (not just abstract) - The paper was published BEFORE 2023 - You want to analyze, summarize, or answer questions about a paper This downloads the PDF and extracts text as clean Markdown format, suitable for LLM processing. Includes paper metadata at the start. WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi) Args: doi: Paper DOI (e.g., '10.1038/nature12373'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format with metadata header, or error message if download/extraction fails. Example: read_scihub_paper("10.1038/nature12373")

Input Schema

TableJSON Schema

Name	Required	Description	Default
`doi`	Yes
`save_path`	No

Implementation Reference

paper_find_mcp/server.py:924-956 (registration)
MCP tool registration and handler function for 'read_scihub_paper'. Delegates to SciHubFetcher.read_paper after setting save_path.
@mcp.tool() async def read_scihub_paper(doi: str, save_path: Optional[str] = None) -> str: """Download and extract full text from paper via Sci-Hub (older papers only). USE THIS TOOL WHEN: - You need the complete text content of a paper (not just abstract) - The paper was published BEFORE 2023 - You want to analyze, summarize, or answer questions about a paper This downloads the PDF and extracts text as clean Markdown format, suitable for LLM processing. Includes paper metadata at the start. WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi) Args: doi: Paper DOI (e.g., '10.1038/nature12373'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format with metadata header, or error message if download/extraction fails. Example: read_scihub_paper("10.1038/nature12373") """ if save_path is None: save_path = get_download_path() try: return SCIHUB.read_paper(doi, save_path) except Exception as e: logger.error(f"Sci-Hub read failed: {e}") return f"Error: {e}"
paper_find_mcp/academic_platforms/sci_hub.py:222-258 (handler)
Core handler logic in SciHubFetcher: downloads PDF using download_pdf, extracts text to Markdown with pymupdf4llm, adds metadata.
def read_paper(self, doi: str, save_path: Optional[str] = None) -> str: """下载并提取论文文本 Args: doi: 论文 DOI save_path: 保存目录 Returns: 提取的 Markdown 文本或错误信息 """ # 先下载 PDF result = self.download_pdf(doi, save_path) if result.startswith("Error"): return result pdf_path = result try: text = pymupdf4llm.to_markdown(pdf_path, show_progress=False) logger.info(f"Extracted {len(text)} characters from {pdf_path}") if not text.strip(): return f"PDF downloaded to {pdf_path}, but no text could be extracted." # 添加元数据 metadata = f"# Paper: {doi}\n\n" metadata += f"**DOI**: https://doi.org/{doi}\n" metadata += f"**PDF**: {pdf_path}\n" metadata += f"**Source**: Sci-Hub\n\n" metadata += "---\n\n" return metadata + text except Exception as e: logger.error(f"Failed to extract text: {e}") return f"Error extracting text: {e}"
paper_find_mcp/academic_platforms/sci_hub.py:142-221 (helper)
Helper method to download PDF from Sci-Hub mirrors, prefers curl, falls back to requests, validates PDF.
def download_pdf(self, doi: str, save_path: Optional[str] = None) -> str: """通过 DOI 下载论文 PDF 优先使用 curl（更可靠），失败时回退到 requests。 Args: doi: 论文 DOI（如 "10.1038/nature12373"） save_path: 保存目录（默认 ~/paper_downloads） Returns: 下载的文件路径或错误信息 """ if not doi or not doi.strip(): return "Error: DOI is empty" doi = doi.strip() # 如果未指定路径，使用用户主目录下的 paper_downloads output_dir = Path(save_path) if save_path else Path.home() / "paper_downloads" output_dir.mkdir(parents=True, exist_ok=True) try: # 获取 PDF URL（必须用 requests 解析 HTML） pdf_url = self._get_pdf_url(doi) if not pdf_url: return f"Error: Could not find PDF for DOI {doi} on Sci-Hub" # 生成文件路径 clean_doi = re.sub(r'[^\w\-_.]', '_', doi) file_path = output_dir / f"scihub_{clean_doi}.pdf" # 方法1: 优先使用 curl（更可靠） if self._download_with_curl(pdf_url, str(file_path)): return str(file_path) logger.info("curl failed, falling back to requests...") # 方法2: 回退到 requests（带重试） max_retries = 3 for attempt in range(max_retries): try: response = self.session.get( pdf_url, verify=False, timeout=(30, 180), # 连接 30s，读取 180s stream=True ) if response.status_code != 200: logger.warning(f"Download failed with status {response.status_code}") continue # 流式写入 with open(file_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): if chunk: f.write(chunk) # 验证是 PDF with open(file_path, 'rb') as f: header = f.read(4) if header != b'%PDF': logger.warning("Downloaded file is not a PDF") os.remove(file_path) continue logger.info(f"PDF downloaded with requests: {file_path}") return str(file_path) except requests.exceptions.Timeout: logger.warning(f"Timeout (attempt {attempt + 1}/{max_retries})") except Exception as e: logger.warning(f"Download error (attempt {attempt + 1}/{max_retries}): {e}") return f"Error: Could not download PDF for DOI {doi}" except Exception as e: logger.error(f"Download failed for {doi}: {e}") return f"Error downloading PDF: {e}"
paper_find_mcp/server.py:88-88 (helper)
Global SciHubFetcher instance used by the tool.
SCIHUB = SciHubFetcher()
paper_find_mcp/server.py:36-36 (helper)
Import of SciHubFetcher class.
from .academic_platforms.sci_hub import SciHubFetcher, check_paper_year

Paper Search MCP Server