read_scihub_paper
Extract full text from pre-2023 academic papers via Sci-Hub, converting PDFs to Markdown format for analysis, summarization, or answering questions about research content.
Instructions
Download and extract full text from paper via Sci-Hub (older papers only).
USE THIS TOOL WHEN:
- You need the complete text content of a paper (not just abstract)
- The paper was published BEFORE 2023
- You want to analyze, summarize, or answer questions about a paper
This downloads the PDF and extracts text as clean Markdown format,
suitable for LLM processing. Includes paper metadata at the start.
WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi)
Args:
doi: Paper DOI (e.g., '10.1038/nature12373').
save_path: Directory to save PDF (default: ~/paper_downloads).
Returns:
Full paper text in Markdown format with metadata header,
or error message if download/extraction fails.
Example:
read_scihub_paper("10.1038/nature12373")
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| doi | Yes | ||
| save_path | No |
Implementation Reference
- paper_find_mcp/server.py:924-956 (registration)MCP tool registration and handler function for 'read_scihub_paper'. Delegates to SciHubFetcher.read_paper after setting save_path.@mcp.tool() async def read_scihub_paper(doi: str, save_path: Optional[str] = None) -> str: """Download and extract full text from paper via Sci-Hub (older papers only). USE THIS TOOL WHEN: - You need the complete text content of a paper (not just abstract) - The paper was published BEFORE 2023 - You want to analyze, summarize, or answer questions about a paper This downloads the PDF and extracts text as clean Markdown format, suitable for LLM processing. Includes paper metadata at the start. WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi) Args: doi: Paper DOI (e.g., '10.1038/nature12373'). save_path: Directory to save PDF (default: ~/paper_downloads). Returns: Full paper text in Markdown format with metadata header, or error message if download/extraction fails. Example: read_scihub_paper("10.1038/nature12373") """ if save_path is None: save_path = get_download_path() try: return SCIHUB.read_paper(doi, save_path) except Exception as e: logger.error(f"Sci-Hub read failed: {e}") return f"Error: {e}"
- Core handler logic in SciHubFetcher: downloads PDF using download_pdf, extracts text to Markdown with pymupdf4llm, adds metadata.def read_paper(self, doi: str, save_path: Optional[str] = None) -> str: """下载并提取论文文本 Args: doi: 论文 DOI save_path: 保存目录 Returns: 提取的 Markdown 文本或错误信息 """ # 先下载 PDF result = self.download_pdf(doi, save_path) if result.startswith("Error"): return result pdf_path = result try: text = pymupdf4llm.to_markdown(pdf_path, show_progress=False) logger.info(f"Extracted {len(text)} characters from {pdf_path}") if not text.strip(): return f"PDF downloaded to {pdf_path}, but no text could be extracted." # 添加元数据 metadata = f"# Paper: {doi}\n\n" metadata += f"**DOI**: https://doi.org/{doi}\n" metadata += f"**PDF**: {pdf_path}\n" metadata += f"**Source**: Sci-Hub\n\n" metadata += "---\n\n" return metadata + text except Exception as e: logger.error(f"Failed to extract text: {e}") return f"Error extracting text: {e}"
- Helper method to download PDF from Sci-Hub mirrors, prefers curl, falls back to requests, validates PDF.def download_pdf(self, doi: str, save_path: Optional[str] = None) -> str: """通过 DOI 下载论文 PDF 优先使用 curl(更可靠),失败时回退到 requests。 Args: doi: 论文 DOI(如 "10.1038/nature12373") save_path: 保存目录(默认 ~/paper_downloads) Returns: 下载的文件路径或错误信息 """ if not doi or not doi.strip(): return "Error: DOI is empty" doi = doi.strip() # 如果未指定路径,使用用户主目录下的 paper_downloads output_dir = Path(save_path) if save_path else Path.home() / "paper_downloads" output_dir.mkdir(parents=True, exist_ok=True) try: # 获取 PDF URL(必须用 requests 解析 HTML) pdf_url = self._get_pdf_url(doi) if not pdf_url: return f"Error: Could not find PDF for DOI {doi} on Sci-Hub" # 生成文件路径 clean_doi = re.sub(r'[^\w\-_.]', '_', doi) file_path = output_dir / f"scihub_{clean_doi}.pdf" # 方法1: 优先使用 curl(更可靠) if self._download_with_curl(pdf_url, str(file_path)): return str(file_path) logger.info("curl failed, falling back to requests...") # 方法2: 回退到 requests(带重试) max_retries = 3 for attempt in range(max_retries): try: response = self.session.get( pdf_url, verify=False, timeout=(30, 180), # 连接 30s,读取 180s stream=True ) if response.status_code != 200: logger.warning(f"Download failed with status {response.status_code}") continue # 流式写入 with open(file_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): if chunk: f.write(chunk) # 验证是 PDF with open(file_path, 'rb') as f: header = f.read(4) if header != b'%PDF': logger.warning("Downloaded file is not a PDF") os.remove(file_path) continue logger.info(f"PDF downloaded with requests: {file_path}") return str(file_path) except requests.exceptions.Timeout: logger.warning(f"Timeout (attempt {attempt + 1}/{max_retries})") except Exception as e: logger.warning(f"Download error (attempt {attempt + 1}/{max_retries}): {e}") return f"Error: Could not download PDF for DOI {doi}" except Exception as e: logger.error(f"Download failed for {doi}: {e}") return f"Error downloading PDF: {e}"
- paper_find_mcp/server.py:88-88 (helper)Global SciHubFetcher instance used by the tool.SCIHUB = SciHubFetcher()
- paper_find_mcp/server.py:36-36 (helper)Import of SciHubFetcher class.from .academic_platforms.sci_hub import SciHubFetcher, check_paper_year