Which integrations are available for this server?

Provides comprehensive access to scientific papers from the arXiv repository, allowing users to browse categories, search by metadata, and extract full-text content. Features an advanced DOI resolution system that utilizes a fallback chain including Unpaywall, Crossref, and Semantic Scholar to locate and retrieve paper content. Specializes in harvesting and extracting text from over 200 million open access research papers across major academic databases and repositories. Enables searching and full-text extraction of biomedical and life science literature specifically from the PubMed Central (PMC) database. Integrates the Semantic Scholar Academic Graph as a metadata source and fallback provider for resolving paper DOIs and retrieving scholarly content.

How do I use Scientific Paper Harvester MCP Server?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@Scientific Paper Harvester MCP Server find the most cited machine learning papers since 2024" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

Scientific Paper Harvester MCP Server

A comprehensive Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from 6 major academic sources: arXiv, OpenAlex, PMC (PubMed Central), Europe PMC, bioRxiv/medRxiv, and CORE.

🚀 Features

Comprehensive Source Coverage

arXiv: Computer science, physics, mathematics preprints and papers
OpenAlex: Open catalog of scholarly papers with citation data
PMC: PubMed Central biomedical and life science literature
Europe PMC: European life science literature database
bioRxiv/medRxiv: Biology and medical preprint servers
CORE: World's largest collection of open access research papers

Advanced Capabilities

Paper Fetching: Get latest papers from any source by category/concept
Paper Search: Search papers by title, abstract, author, or full-text across 4 major sources
Full-Text Extraction: Extract complete text content with intelligent fallback strategies
Citation Analysis: Find top cited papers from OpenAlex since a specific date
Paper Lookup: Retrieve full metadata for specific papers by ID
Category Discovery: Browse available categories from all sources
Smart Rate Limiting: Respectful API usage with per-source rate limiting
DOI Resolution: Advanced DOI resolver with Unpaywall → Crossref → Semantic Scholar fallback
Dual Interface: Both MCP protocol and CLI access
TypeScript: Full type safety with ESM modules

📊 Coverage Statistics

Total Sources: 6 academic databases
Category Coverage: 100+ categories across all disciplines
Paper Access: 200M+ papers with intelligent text extraction
Text Extraction Success: >90% for supported paper types
Response Time: <15 seconds average for paper fetching

🛠 Installation

npm install npm run build

📋 MCP Client Configuration

To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:

For published package (available on npm):

Option 1: Using npx (recommended for AI tools like Claude)

{ "mcpServers": { "scientific-papers": { "command": "npx", "args": [ "-y", "@futurelab-studio/latest-science-mcp@latest" ] } } }

Option 2: Global installation

npm install -g @futurelab-studio/latest-science-mcp

Then configure:

{ "mcpServers": { "scientific-papers": { "command": "latest-science-mcp" } } }

📖 Usage

CLI Interface

# List arXiv categories node dist/cli.js list-categories --source=arxiv # List OpenAlex concepts node dist/cli.js list-categories --source=openalex # List PMC biomedical categories node dist/cli.js list-categories --source=pmc # List Europe PMC life science categories node dist/cli.js list-categories --source=europepmc # List bioRxiv/medRxiv categories (includes both servers) node dist/cli.js list-categories --source=biorxiv # List CORE academic categories node dist/cli.js list-categories --source=core

Fetch Latest Papers

# Get latest AI papers from arXiv node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10 # Get latest biology papers from bioRxiv node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=5 # Get latest immunology papers from PMC node dist/cli.js fetch-latest --source=pmc --category=immunology --count=3 # Get latest papers from CORE by subject node dist/cli.js fetch-latest --source=core --category=computer_science --count=5 # Search by concept name (OpenAlex) node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3

Fetch Top Cited Papers

# Get top 20 cited papers in machine learning since 2024 node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20 # Get top cited papers by concept ID node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10

Search Papers

# Search by keywords across all fields node dist/cli.js search-papers --source=arxiv --query="machine learning" --count=10 # Search by paper title node dist/cli.js search-papers --source=openalex --query="neural networks" --field=title --count=5 # Search by author name node dist/cli.js search-papers --source=europepmc --query="John Smith" --field=author --count=10 # Search full-text content sorted by citations node dist/cli.js search-papers --source=core --query="climate change" --field=fulltext --sortBy=citations --count=20

Fetch Specific Paper Content

# Get arXiv paper by ID node dist/cli.js fetch-content --source=arxiv --id=2401.12345 # Get bioRxiv paper by DOI node dist/cli.js fetch-content --source=biorxiv --id="10.1101/2021.01.01.425001" # Get PMC paper by ID node dist/cli.js fetch-content --source=pmc --id=PMC8245678 # Get CORE paper by ID node dist/cli.js fetch-content --source=core --id=12345678 # Show text content with preview node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500

🔧 Available Tools

`list_categories`

Lists available categories/concepts from any data source.

Parameters:

source: "arxiv" | "openalex" | "pmc" | "europepmc" | "biorxiv" | "core"

Returns:

Array of category objects with id, name, and optional description

Examples:

{ "name": "list_categories", "arguments": { "source": "biorxiv" } }

`fetch_latest`

Fetches the latest papers from any source for a given category with metadata only (no text extraction).

Parameters:

source: "arxiv" | "openalex" | "pmc" | "europepmc" | "biorxiv" | "core"
category: Category ID or concept name (varies by source)
count: Number of papers to fetch (default: 50, max: 200)

Category Examples by Source:

arXiv: "cs.AI", "physics.gen-ph", "math.CO"
OpenAlex: "artificial intelligence", "machine learning", "C41008148"
PMC: "immunology", "genetics", "neuroscience"
Europe PMC: "biology", "medicine", "cancer"
bioRxiv/medRxiv: "biorxiv:neuroscience", "medrxiv:psychiatry"
CORE: "computer_science", "mathematics", "physics"

Returns:

Array of paper objects with metadata (id, title, authors, date, pdf_url)
Text field: Empty string (text: "") - use fetch_content for full text

`fetch_top_cited`

Fetches the top cited papers from OpenAlex for a given concept since a specific date.

Parameters:

concept: Concept name or OpenAlex concept ID
since: Start date in YYYY-MM-DD format
count: Number of papers to fetch (default: 50, max: 200)

`search_papers`

Searches for papers across multiple academic sources with field-specific search and sorting options.

Parameters:

source: "arxiv" | "openalex" | "europepmc" | "core"
query: Search query string (max 1500 characters)
field: "all" | "title" | "abstract" | "author" | "fulltext" (default: "all")
count: Number of results to return (default: 50, max: 200)
sortBy: "relevance" | "date" | "citations" (default: "relevance")

Search Capabilities by Source:

arXiv: Title, abstract, author, and general search with Boolean operators
OpenAlex: Advanced search with relevance scoring and citation sorting
Europe PMC: Biomedical literature with MeSH terms and full-text search
CORE: Global academic papers with advanced query language

Example Queries:

Keywords: "machine learning", "climate change"
Phrases: "artificial intelligence" (use quotes for exact phrases)
Boolean: "deep learning AND neural networks" (arXiv supports this)
Authors: "John Smith", "Smith J"

Returns:

Array of paper objects with metadata (id, title, authors, date, pdf_url)
Text field: Empty string (text: "") - use fetch_content for full text

`fetch_content`

Fetches full metadata and text content for a specific paper by ID with complete text extraction.

Parameters:

source: Any of the 6 supported sources
id: Paper ID (format varies by source)

ID Formats by Source:

arXiv: "2401.12345", "cs/0601001", "1234.5678v2"
OpenAlex: "W2741809807" or numeric 2741809807
PMC: "PMC8245678" or "12345678"
Europe PMC: "PMC8245678", "12345678", or DOI
bioRxiv/medRxiv: "10.1101/2021.01.01.425001" or "2021.01.01.425001"
CORE: Numeric ID like "12345678"

📄 Paper Metadata Format

All tools return paper objects with the following structure:

{ id: string; // Paper ID title: string; // Paper title authors: string[]; // List of author names date: string; // Publication date (ISO format) pdf_url?: string; // PDF URL (if available) text: string; // Extracted full text content textTruncated?: boolean; // Warning: text was truncated due to size limits textExtractionFailed?: boolean; // Warning: text extraction failed }

🧠 Advanced Text Extraction

Multi-Source Strategy

Each source has specialized text extraction approaches:

arXiv: HTML from arxiv.org/html with ar5iv.labs.arxiv.org fallback
OpenAlex: HTML sources with DOI resolver fallback chain
PMC: E-utilities API with XML/HTML extraction
Europe PMC: REST API with multiple URL strategies
bioRxiv/medRxiv: Direct HTML extraction with abstract fallback
CORE: PDF/HTML with source URL fallback

DOI Resolution Chain

Advanced DOI resolver with multiple fallback strategies:

Unpaywall → Free full-text sources
Crossref → Publisher metadata and links
Semantic Scholar Academic Graph → Alternative access

Performance & Reliability

Text Extraction Success: >90% for HTML-available papers
Graceful Degradation: Always returns metadata even if text extraction fails
Size Management: 6MB text limit with intelligent truncation
Caching: 24-hour LRU cache for DOI resolution

🔄 Rate Limiting

Respectful API usage with per-source rate limiting:

arXiv: 5 requests per minute
OpenAlex: 10 requests per minute
PMC: 3 requests per second
Europe PMC: 10 requests per minute
bioRxiv/medRxiv: 5 requests per minute
CORE: 10 requests per minute (public), higher with API key

CORE API Configuration

For enhanced CORE access, set environment variable:

export CORE_API_KEY="your-api-key"

🧪 Testing

Run Test Suite

# Run all tests npm test # Run integration tests npm run test -- tests/integration # Run end-to-end workflow tests npm run test -- tests/e2e # Run performance benchmarks npm run test -- tests/integration/performance.test.ts

Test Coverage

Integration Tests: All 6 sources tested end-to-end
Performance Tests: Response time and throughput benchmarks
Workflow Tests: Real research scenarios across multiple sources
Unit Tests: Core components and edge cases

🏗 Architecture

Modular Driver System

Clean separation between sources
Consistent interface across all drivers
Specialized text extraction per source

Advanced Features

DOI Resolution: Multi-provider fallback chain
Rate Limiting: Token bucket algorithm per source
Text Processing: HTML cleaning and normalization
Error Handling: Structured responses with actionable suggestions
Caching: Intelligent caching for DOI resolution

Technology Stack

TypeScript + ESM: Modern JavaScript with full type safety
Modular Design: Clean separation of concerns
Graceful Degradation: Always functional even with partial failures
Response Size Management: Automatic truncation and warnings

📊 Source Comparison

Source	Papers	Disciplines	Full-Text	Citation Data	Preprints	Search
arXiv	2.3M+	STEM	HTML ✓	Limited	✓	✓✓✓
OpenAlex	200M+	All	Variable	✓✓✓	✓	✓✓✓
PMC	7M+	Biomedical	XML/HTML ✓	Limited	✗	Limited
Europe PMC	40M+	Life Sciences	HTML ✓	Limited	✓	✓✓✓
bioRxiv/medRxiv	500K+	Bio/Medical	HTML ✓	Limited	✓✓✓	Limited
CORE	200M+	All	PDF/HTML ✓	Limited	✓	✓✓✓

🔧 Development

Build

npm run build

Test Individual Sources

# Test specific sources node dist/cli.js list-categories --source=arxiv node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=3 node dist/cli.js fetch-content --source=core --id=12345678 # Test search functionality node dist/cli.js search-papers --source=arxiv --query="artificial intelligence" --count=5 node dist/cli.js search-papers --source=openalex --query="quantum computing" --field=title --count=3

Performance Testing

# Run performance benchmarks npm run test -- tests/integration/performance.test.ts # Test memory usage npm run test -- --reporter=verbose

🚨 Error Handling

Comprehensive error handling for all sources:

Invalid paper IDs with format suggestions
Rate limiting with retry-after information
API timeouts and server errors
Missing authentication (CORE API key)
Network connectivity issues
Text extraction failures with fallback strategies

🔍 Troubleshooting

Common Issues

Rate limiting: Automatic retry with exponential backoff
Missing papers: Try alternative sources for the same content
Text extraction failures: Fallback to abstract or metadata
CORE API limits: Set CORE_API_KEY environment variable

Performance Optimization

Use appropriate count parameters (smaller for faster responses)
Cache results when possible
Use fetch_latest for discovery, fetch_content for detailed reading

📝 License

MIT

Ready to explore the world's scientific knowledge? Start with any of the 6 sources and discover papers across all academic disciplines! 🔬📚

Scientific Paper Harvester MCP Server

🚀 Features

Comprehensive Source Coverage

Advanced Capabilities

📊 Coverage Statistics

🛠 Installation

📋 MCP Client Configuration

For published package (available on npm):

📖 Usage

CLI Interface

List Categories

Fetch Latest Papers

Fetch Top Cited Papers

Search Papers

Fetch Specific Paper Content

🔧 Available Tools

list_categories

fetch_latest

fetch_top_cited

search_papers

fetch_content

📄 Paper Metadata Format

🧠 Advanced Text Extraction

Multi-Source Strategy

DOI Resolution Chain

Performance & Reliability

🔄 Rate Limiting

CORE API Configuration

🧪 Testing

Run Test Suite

Test Coverage

🏗 Architecture

Modular Driver System

Advanced Features

Technology Stack

📊 Source Comparison

🔧 Development

Build

Test Individual Sources

Performance Testing

🚨 Error Handling

🔍 Troubleshooting

Common Issues

Performance Optimization

📝 License

Resources

Appeared in Searches

Latest Blog Posts

MCP directory API

`list_categories`

`fetch_latest`

`fetch_top_cited`

`search_papers`

`fetch_content`