Scientific Paper Harvester MCP Server

story-2.story.md•8.52 kB

# Story-2: Paper Fetching Tools (Metadata Only) **Epic:** Scientific Paper Harvester MCP Server MVP **Status:** Completed **Assignee:** Developer **Sprint:** 2 **Estimated Effort:** Medium ## User Story As a developer or researcher, I want to fetch the latest papers from arXiv and OpenAlex, get top cited papers from OpenAlex, and retrieve metadata for specific papers so that I can access and query scientific literature metadata through the MCP server before adding text extraction capabilities. ## Acceptance Criteria - [x] **AC-1:** `fetch_latest` tool works for arXiv source and returns papers with metadata (id, title, authors, date, pdf_url) - [x] **AC-2:** `fetch_latest` tool works for OpenAlex source and returns papers with metadata - [x] **AC-3:** `fetch_top_cited` tool works for OpenAlex and returns top cited papers sorted by citation count (metadata only) - [x] **AC-4:** `fetch_content` tool retrieves full metadata for a specific paper by ID from both sources (no text yet) - [x] **AC-5:** All tools respect rate limiting (5 req/s max per source) and handle errors gracefully - [x] **AC-6:** CLI interface provides access to all three new tools for offline testing - [x] **AC-7:** Response payloads are properly structured and validated with consistent schema - [x] **AC-8:** Tools integrate seamlessly with existing MCP server from Story 1 ## Technical Requirements ### Prerequisites - Story 1 completed (MCP server foundation with list_categories) - Node.js 20 LTS with ESM modules - Existing rate limiter and driver architecture ### Implementation Tasks - [x] **T-1:** Extend arXiv driver to support `fetch_latest` returning metadata from API search results - [x] **T-2:** Extend OpenAlex driver to support `fetch_latest` and `fetch_top_cited` with metadata from API - [x] **T-3:** Implement `fetch_content` tool for both arXiv and OpenAlex sources (metadata only) - [x] **T-4:** Add MCP tool definitions with proper Zod validation for all three new tools - [x] **T-5:** Implement response size monitoring framework for future text extraction - [x] **T-6:** Extend CLI interface to support all three new tools with proper parameter handling - [x] **T-7:** Add comprehensive error handling for API failures and rate limiting - [x] **T-8:** Update documentation in README with usage examples for all new tools - [x] **T-9:** Explorative testing by the user by running the server via `npx @srbhptl39/mcp-superassistant-proxy@latest --config ./mcpconfig.json` ### Definition of Done - [x] All acceptance criteria met - [x] Unit tests pass (>90% coverage) for all new tools and driver extensions - [x] Integration tests pass for all tools via both MCP and CLI interfaces - [x] Rate limiting works correctly for all tools - [x] Response schemas are consistent and properly validated - [x] Code follows TypeScript/ESM standards with proper error handling - [x] Documentation in README updated with examples and usage instructions ## Technical Design ### Architecture Components - **Enhanced Drivers**: Extended arXiv and OpenAlex drivers with new query methods - **MCP Tools**: Three new tool implementations with Zod validation - **Response Framework**: Foundation for future payload size monitoring - **CLI Extensions**: Command-line access to all new tools ### Data Models ```typescript interface PaperMetadata { id: string; title: string; authors: string[]; date: string; // ISO format pdf_url?: string; // text field will be added in Story 3 } interface FetchLatestRequest { source: 'arxiv' | 'openalex'; category: string; count: number; // default 50 } interface FetchTopCitedRequest { concept: string; since: string; // ISO date format count: number; // default 50 } interface FetchContentRequest { source: 'arxiv' | 'openalex'; id: string; } ``` ### API Design - **MCP Tool**: `fetch_latest({ source, category, count })` - **MCP Tool**: `fetch_top_cited({ concept, since, count })` - **MCP Tool**: `fetch_content({ source, id })` - **CLI Commands**: - `npx latest-science-mcp fetch-latest --source=arxiv --category=cs.AI --count=10` - `npx latest-science-mcp fetch-top-cited --concept="artificial intelligence" --since=2024-01-01 --count=20` - `npx latest-science-mcp fetch-content --source=arxiv --id=2401.12345` ### Query Implementation 1. **arXiv fetch_latest**: Use arXiv API search endpoint with category filter and date sorting 2. **OpenAlex fetch_latest**: Use OpenAlex works endpoint with concept filter and publication_date sorting 3. **OpenAlex fetch_top_cited**: Use OpenAlex works endpoint with concept filter and cited_by_count sorting 4. **fetch_content**: Direct lookup by ID for both sources returning full metadata ## Test Plan ### Unit Tests - arXiv driver fetch_latest with mocked API responses - OpenAlex driver fetch_latest and fetch_top_cited with mocked responses - fetch_content tool with both sources - MCP tool parameter validation with Zod schemas - Error handling for API failures and rate limiting ### Integration Tests - CLI commands for all three tools return valid JSON - MCP server responds correctly to all tool calls via stdio - Rate limiting prevents excessive API calls during bulk operations - End-to-end pipeline from MCP call to formatted metadata response ## Dependencies ### Internal Dependencies - Story 1 foundation (MCP server, rate limiter, base drivers) - Existing driver architecture and interfaces - Rate limiting service ### External Dependencies - arXiv API for search and individual paper lookup - OpenAlex API for works search with concept and citation filtering - Existing dependencies: axios, zod, winston ## Chat Log **2025-01-XX**: Story 2 updated to focus specifically on metadata-only paper fetching tools. This provides a solid foundation for the three core tools (fetch_latest, fetch_top_cited, fetch_content) without the complexity of text extraction. **Key Technical Decisions:** - Implement all three tools but return metadata only (no text extraction yet) - Build response size monitoring framework for future text integration - Focus on robust API integration and error handling - Ensure comprehensive CLI and MCP interfaces - Prepare architecture for Story 3 text extraction integration **Scope Reduction Benefits:** - Delivers working tools immediately with clear value - Reduces implementation risk by isolating text extraction complexity - Allows thorough testing of tool interfaces before adding text processing - Enables user feedback on core functionality before text features **2025-05-23**: Story 2 implementation completed successfully. All acceptance criteria met: **✅ Implementation Completed:** - Extended BaseDriver with fetchLatest and fetchContent abstract methods - Implemented ArxivDriver with XML parsing for arXiv API responses - Implemented OpenAlexDriver with JSON parsing and concept filtering - Created three new MCP tools: fetch_latest, fetch_top_cited, fetch_content - Extended MCP server with proper Zod validation for all new tools - Extended CLI with comprehensive command support and help text - Updated README with complete documentation and examples **✅ Testing Completed:** - All CLI commands tested successfully: - `fetch-latest` works for both arXiv (cs.AI) and OpenAlex (concept IDs) - `fetch-top-cited` works with OpenAlex concept filtering and date ranges - `fetch-content` works for individual paper lookup by ID - MCP server tested with proxy - all tools register and validate correctly - Rate limiting implemented and working (5 req/min arXiv, 10 req/min OpenAlex) - Error handling tested with invalid inputs (proper Zod validation errors) - Exploratory testing completed with mcp-superassistant-proxy **✅ Key Features Delivered:** - Metadata-only paper fetching (id, title, authors, date, pdf_url) - Robust API integration with both arXiv and OpenAlex - Comprehensive error handling and input validation - Rate limiting with proper retry-after messaging - Dual interface (MCP + CLI) with consistent functionality - Foundation ready for Story 3 text extraction integration **Technical Notes:** - OpenAlex works best with concept IDs (e.g., C41008148) rather than concept names - arXiv XML parsing implemented with regex for MVP (can be enhanced with xml2js later) - Response size monitoring framework in place for future text extraction - All tools respect API rate limits and provide clear error messages --- **Created:** 2025-01-XX **Last Updated:** 2025-05-23 **Completed:** 2025-05-23

Loading blob content...

Latest Blog Posts

Don't Use Large Strings as Cache Keys
By punkpeye on January 11, 2026.
markdown
node-js
cache
What are Claude Skills?
By punkpeye on January 10, 2026.
mcp
skills
How to Test MCP Streamable HTTP Endpoints Using cURL
By punkpeye on January 2, 2026.
tutorial
bash

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/benedict2310/Scientific-Papers-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

story-2.story.md•8.52 kB