Skip to main content
Glama
h-lu

Paper Search MCP Server

by h-lu

read_scihub_paper

Extract full text from pre-2023 academic papers via Sci-Hub, converting PDFs to Markdown format for analysis, summarization, or answering questions about research content.

Instructions

Download and extract full text from paper via Sci-Hub (older papers only).

USE THIS TOOL WHEN:
- You need the complete text content of a paper (not just abstract)
- The paper was published BEFORE 2023
- You want to analyze, summarize, or answer questions about a paper

This downloads the PDF and extracts text as clean Markdown format,
suitable for LLM processing. Includes paper metadata at the start.

WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi)

Args:
    doi: Paper DOI (e.g., '10.1038/nature12373').
    save_path: Directory to save PDF (default: ~/paper_downloads).

Returns:
    Full paper text in Markdown format with metadata header,
    or error message if download/extraction fails.

Example:
    read_scihub_paper("10.1038/nature12373")

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doiYes
save_pathNo

Implementation Reference

  • MCP tool registration and handler function for 'read_scihub_paper'. Delegates to SciHubFetcher.read_paper after setting save_path.
    @mcp.tool()
    async def read_scihub_paper(doi: str, save_path: Optional[str] = None) -> str:
        """Download and extract full text from paper via Sci-Hub (older papers only).
        
        USE THIS TOOL WHEN:
        - You need the complete text content of a paper (not just abstract)
        - The paper was published BEFORE 2023
        - You want to analyze, summarize, or answer questions about a paper
        
        This downloads the PDF and extracts text as clean Markdown format,
        suitable for LLM processing. Includes paper metadata at the start.
        
        WORKFLOW: search_crossref(query) -> get DOI -> read_scihub_paper(doi)
        
        Args:
            doi: Paper DOI (e.g., '10.1038/nature12373').
            save_path: Directory to save PDF (default: ~/paper_downloads).
        
        Returns:
            Full paper text in Markdown format with metadata header,
            or error message if download/extraction fails.
        
        Example:
            read_scihub_paper("10.1038/nature12373")
        """
        if save_path is None:
            save_path = get_download_path()
        try:
            return SCIHUB.read_paper(doi, save_path)
        except Exception as e:
            logger.error(f"Sci-Hub read failed: {e}")
            return f"Error: {e}"
  • Core handler logic in SciHubFetcher: downloads PDF using download_pdf, extracts text to Markdown with pymupdf4llm, adds metadata.
    def read_paper(self, doi: str, save_path: Optional[str] = None) -> str:
        """下载并提取论文文本
        
        Args:
            doi: 论文 DOI
            save_path: 保存目录
            
        Returns:
            提取的 Markdown 文本或错误信息
        """
        # 先下载 PDF
        result = self.download_pdf(doi, save_path)
        if result.startswith("Error"):
            return result
        
        pdf_path = result
        
        try:
            text = pymupdf4llm.to_markdown(pdf_path, show_progress=False)
            logger.info(f"Extracted {len(text)} characters from {pdf_path}")
            
            if not text.strip():
                return f"PDF downloaded to {pdf_path}, but no text could be extracted."
            
            # 添加元数据
            metadata = f"# Paper: {doi}\n\n"
            metadata += f"**DOI**: https://doi.org/{doi}\n"
            metadata += f"**PDF**: {pdf_path}\n"
            metadata += f"**Source**: Sci-Hub\n\n"
            metadata += "---\n\n"
            
            return metadata + text
            
        except Exception as e:
            logger.error(f"Failed to extract text: {e}")
            return f"Error extracting text: {e}"
  • Helper method to download PDF from Sci-Hub mirrors, prefers curl, falls back to requests, validates PDF.
    def download_pdf(self, doi: str, save_path: Optional[str] = None) -> str:
        """通过 DOI 下载论文 PDF
        
        优先使用 curl(更可靠),失败时回退到 requests。
        
        Args:
            doi: 论文 DOI(如 "10.1038/nature12373")
            save_path: 保存目录(默认 ~/paper_downloads)
        
        Returns:
            下载的文件路径或错误信息
        """
        if not doi or not doi.strip():
            return "Error: DOI is empty"
        
        doi = doi.strip()
        # 如果未指定路径,使用用户主目录下的 paper_downloads
        output_dir = Path(save_path) if save_path else Path.home() / "paper_downloads"
        output_dir.mkdir(parents=True, exist_ok=True)
        
        try:
            # 获取 PDF URL(必须用 requests 解析 HTML)
            pdf_url = self._get_pdf_url(doi)
            if not pdf_url:
                return f"Error: Could not find PDF for DOI {doi} on Sci-Hub"
            
            # 生成文件路径
            clean_doi = re.sub(r'[^\w\-_.]', '_', doi)
            file_path = output_dir / f"scihub_{clean_doi}.pdf"
            
            # 方法1: 优先使用 curl(更可靠)
            if self._download_with_curl(pdf_url, str(file_path)):
                return str(file_path)
            
            logger.info("curl failed, falling back to requests...")
            
            # 方法2: 回退到 requests(带重试)
            max_retries = 3
            for attempt in range(max_retries):
                try:
                    response = self.session.get(
                        pdf_url, 
                        verify=False, 
                        timeout=(30, 180),  # 连接 30s,读取 180s
                        stream=True
                    )
                    
                    if response.status_code != 200:
                        logger.warning(f"Download failed with status {response.status_code}")
                        continue
                    
                    # 流式写入
                    with open(file_path, 'wb') as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            if chunk:
                                f.write(chunk)
                    
                    # 验证是 PDF
                    with open(file_path, 'rb') as f:
                        header = f.read(4)
                    
                    if header != b'%PDF':
                        logger.warning("Downloaded file is not a PDF")
                        os.remove(file_path)
                        continue
                    
                    logger.info(f"PDF downloaded with requests: {file_path}")
                    return str(file_path)
                    
                except requests.exceptions.Timeout:
                    logger.warning(f"Timeout (attempt {attempt + 1}/{max_retries})")
                except Exception as e:
                    logger.warning(f"Download error (attempt {attempt + 1}/{max_retries}): {e}")
            
            return f"Error: Could not download PDF for DOI {doi}"
            
        except Exception as e:
            logger.error(f"Download failed for {doi}: {e}")
            return f"Error downloading PDF: {e}"
  • Global SciHubFetcher instance used by the tool.
    SCIHUB = SciHubFetcher()
  • Import of SciHubFetcher class.
    from .academic_platforms.sci_hub import SciHubFetcher, check_paper_year

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paper-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server