Scientific Paper Harvester MCP Server

architecture.md•14.4 KiB

# Architecture for MCP Server – Scientific‑Paper Harvester & Text‑Search Status: Draft ## Technical Summary This architecture defines a Node.js-based Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from arXiv and OpenAlex. The system follows a streaming, stateless design with per-source rate limiting and robust error handling. Built as an ESM package targeting Node.js 20 LTS, it exposes four core tools through the MCP protocol while maintaining a CLI interface for testing and offline usage. The architecture emphasizes simplicity for the MVP with clean separation between data source drivers, text extraction pipelines, and MCP protocol handling. Rate limiting is implemented per data source per session, ensuring respectful API usage while maintaining responsive performance for LLM interactions. ## Technology Table | Technology | Version | Description | | ---------- | ------- | ----------- | | Node.js | 20 LTS | Runtime environment with ESM support | | @modelcontextprotocol/sdk | ^1.0.0 | Official MCP server framework and transport | | TypeScript | ^5.0.0 | Type safety and development experience | | Zod | ^3.22.0 | Runtime schema validation for MCP tools | | axios | ^1.6.0 | HTTP client with retry and timeout support | | cheerio | ^1.0.0 | Server-side HTML parsing for text extraction | | winston | ^3.11.0 | Structured logging with multiple levels | | vitest | ^1.0.0 | Fast unit testing framework | | nock | ^13.0.0 | HTTP mocking for offline testing | | execa | ^8.0.0 | CLI testing with subprocess execution | ## High-Level Overview The system follows a **Service-Oriented Architecture** within a single Node.js process, structured as a pipeline of specialized services. This approach balances simplicity for the MVP while providing clear separation of concerns for future extensibility. ### Component View ```mermaid graph TB subgraph "MCP Protocol Layer" MCP[MCP Server] CLI[CLI Interface] end subgraph "Tool Orchestration" TO[Tool Orchestrator] RV[Request Validator] end subgraph "Data Source Drivers" AD[arXiv Driver] OD[OpenAlex Driver] end subgraph "Processing Pipeline" TE[Text Extractor] RL[Rate Limiter] EH[Error Handler] end subgraph "External APIs" ARXIV[arXiv API] OPENALEX[OpenAlex API] HTML[HTML Sources] end MCP --> TO CLI --> TO TO --> RV RV --> AD RV --> OD AD --> RL OD --> RL RL --> ARXIV RL --> OPENALEX AD --> TE OD --> TE TE --> HTML TO --> EH classDef external fill:#e1f5fe classDef processing fill:#f3e5f5 classDef protocol fill:#e8f5e8 class ARXIV,OPENALEX,HTML external class TE,RL,EH processing class MCP,CLI protocol ``` ## Architectural Diagrams, Data Models, Schemas ### MCP Tool Interface Schema ```typescript // Tool Response Schema interface PaperMetadata { id: string; title: string; authors: string[]; date: string; // ISO format pdf_url?: string; text: string; } interface ToolResponse { content: PaperMetadata[] | CategoryList | PaperMetadata; warnings?: string[]; errors?: string[]; } // Error Response Schema interface MCPError { code: 'NotAvailable' | 'PartialSuccess' | 'RateLimited' | 'SourceDown' | 'InvalidQuery'; message: string; suggestions?: string[]; retryAfter?: number; // seconds } ``` ### Rate Limiter State Schema ```typescript interface RateLimiterState { [source: string]: { tokens: number; lastRefill: number; maxTokens: number; refillRate: number; // tokens per second } } // Per session state interface SessionState { rateLimiters: RateLimiterState; startTime: number; } ``` ### Data Source Response Mapping ```typescript // arXiv Entry to Internal Format interface ArxivEntry { id: string; // "2401.12345" title: string; authors: Author[]; published: string; pdf_url: string; html_url?: string; } // OpenAlex Work to Internal Format interface OpenAlexWork { id: string; // "W2741809807" title: string; authorships: Authorship[]; publication_date: string; primary_location: { landing_page_url?: string; pdf_url?: string; source_type: string; } } ``` ### Text Extraction Pipeline Flow ```mermaid sequenceDiagram participant TO as Tool Orchestrator participant Driver as Source Driver participant RL as Rate Limiter participant API as External API participant TE as Text Extractor participant EH as Error Handler TO->>Driver: fetch_latest(source, category, count) Driver->>RL: checkRateLimit(source) alt Rate limit OK RL-->>Driver: proceed Driver->>API: GET /search?category=X&count=N API-->>Driver: paper metadata list loop For each paper Driver->>RL: checkRateLimit(source) RL-->>Driver: proceed Driver->>TE: extractText(paper.html_url) TE->>API: GET html_url API-->>TE: HTML content TE->>TE: parse & clean text TE-->>Driver: cleaned text Driver->>Driver: augment paper with text end Driver-->>TO: PaperMetadata[] else Rate limited RL-->>Driver: rate limited Driver->>EH: handleRateLimit() EH-->>TO: MCPError{code: "RateLimited"} end ``` ### Error Handling Decision Tree ```mermaid flowchart TD A[API Request] --> B{Response Status} B -->|200| C[Extract Text] B -->|429| D[Rate Limited] B -->|404| E[Not Found] B -->|500-503| F[Server Error] B -->|Network Error| G[Connection Failed] C --> H{Text Extraction Success?} H -->|Yes| I[Return Paper Data] H -->|No| J[Return Metadata Only] D --> K[Update Rate Limiter] K --> L[Return RateLimited Error] E --> M[Skip Paper with Warning] F --> N[Retry with Backoff] G --> O[Return SourceDown Error] N --> P{Max Retries?} P -->|No| A P -->|Yes| O style I fill:#c8e6c9 style L fill:#ffcdd2 style O fill:#ffcdd2 style M fill:#fff3e0 ``` ## Project Structure ``` latest-science-mcp/ ├── src/ │ ├── server.ts # MCP server entry point │ ├── cli.ts # CLI interface entry point │ ├── tools/ # MCP tool implementations │ │ ├── fetch-latest.ts │ │ ├── fetch-top-cited.ts │ │ ├── list-categories.ts │ │ └── fetch-content.ts │ ├── drivers/ # Data source drivers │ │ ├── base-driver.ts # Abstract base class │ │ ├── arxiv-driver.ts # arXiv API integration │ │ └── openalex-driver.ts # OpenAlex API integration │ ├── extractors/ # Text extraction pipeline │ │ ├── base-extractor.ts # Abstract extractor │ │ ├── html-extractor.ts # HTML parsing & cleaning │ │ └── text-cleaner.ts # Text normalization │ ├── core/ # Core services │ │ ├── rate-limiter.ts # Per-source rate limiting │ │ ├── error-handler.ts # Structured error responses │ │ ├── validator.ts # Request validation │ │ └── logger.ts # Structured logging │ ├── types/ # TypeScript definitions │ │ ├── mcp.ts # MCP-specific types │ │ ├── papers.ts # Paper metadata types │ │ └── sources.ts # Data source types │ └── config/ # Configuration │ ├── constants.ts # API endpoints, defaults │ └── schemas.ts # Zod validation schemas ├── tests/ # Test suites │ ├── unit/ # Unit tests with mocks │ ├── integration/ # CLI integration tests │ ├── mcp/ # MCP protocol tests │ └── __fixtures__/ # Test data and mocks │ ├── arxiv-responses/ │ ├── openalex-responses/ │ └── html-samples/ ├── docs/ # Documentation │ ├── api.md # MCP tool documentation │ ├── cli.md # CLI usage guide │ └── examples/ # Usage examples ├── package.json # Package configuration ├── tsconfig.json # TypeScript configuration ├── vitest.config.ts # Test configuration └── README.md # Project overview ``` ## Testing Requirements and Framework ### Test Strategy by Layer | Test Type | Framework | Coverage Goal | Mock Strategy | |-----------|-----------|---------------|---------------| | Unit Tests | Vitest | >90% line coverage | nock for HTTP, in-memory mocks for services | | Integration | Vitest + execa | All CLI commands | nock fixtures, real MCP protocol | | MCP Contract | Custom harness | All tool schemas | Stdio transport testing | | E2E | GitHub Actions | Happy path only | Blocked external network | ### Test Data Management ```typescript // Fixture structure interface TestFixture { arxiv: { searchResponse: ArxivSearchResponse; paperHtml: string; expectedText: string; }; openalex: { worksResponse: OpenAlexWorksResponse; paperHtml: string; expectedText: string; }; } ``` ## Patterns and Standards ### Architectural Patterns - **Repository Pattern**: Each data source driver implements a common interface (`BaseDriver`) for consistent tool orchestration - **Strategy Pattern**: Text extraction varies by source type (HTML vs future PDF support) - **Factory Pattern**: Driver instantiation based on source parameter - **Chain of Responsibility**: Error handling pipeline with source-specific handlers ### API Design Standards - **Protocol**: MCP 1.0 with stdio transport - **Validation**: All tool parameters validated with Zod schemas before processing - **Error Format**: Structured MCPError objects with actionable suggestions - **Response Size**: Maximum 8MB payload with automatic truncation warnings ### Coding Standards - **Style Guide**: Standard TypeScript with ESLint + Prettier - **Naming Conventions**: - Files: kebab-case (`fetch-latest.ts`) - Classes: PascalCase (`ArxivDriver`) - Functions/variables: camelCase (`fetchContent`) - Constants: SCREAMING_SNAKE_CASE (`MAX_PAPERS_PER_REQUEST`) - **Import Strategy**: Barrel exports from each module directory - **Documentation**: JSDoc for all public methods and interfaces ### Error Handling Strategy ```typescript // Standardized error creation class MCPErrorBuilder { static notAvailable(source: string, id: string): MCPError { return { code: 'NotAvailable', message: `Paper ${id} from ${source} is not accessible`, suggestions: [ 'Try searching for a different paper', 'Check if the paper ID is correct', 'Use fetch_latest to find accessible papers' ] }; } static partialSuccess(successCount: number, totalCount: number): MCPError { return { code: 'PartialSuccess', message: `Retrieved ${successCount}/${totalCount} papers successfully`, suggestions: successCount === 0 ? [ 'Try a different category or time range', 'Check if the source is currently available' ] : [] }; } } ``` ### Logging Standards ```typescript // Winston configuration const logger = winston.createLogger({ level: 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: 'latest-science-mcp' }, transports: [ new winston.transports.Console({ format: winston.format.simple() }) ] }); // Usage patterns logger.info('MCP tool called', { tool: 'fetch_latest', source: 'arxiv', category: 'cs.AI' }); logger.warn('Rate limit approaching', { source: 'arxiv', remainingTokens: 2 }); logger.error('Text extraction failed', { source: 'openalex', paperId: 'W123', error: error.message }); ``` ## Initial Project Setup (Manual Steps) ### Story 0: Project Initialization 1. **Repository Setup**: ```bash mkdir latest-science-mcp cd latest-science-mcp npm init -y git init ``` 2. **Package Configuration**: ```bash # Update package.json with proper metadata npm pkg set name="@futurelab/latest-science-mcp" npm pkg set type="module" npm pkg set main="dist/server.js" npm pkg set bin.latest-science-mcp="dist/cli.js" ``` 3. **Core Dependencies**: ```bash npm install @modelcontextprotocol/sdk zod axios cheerio winston npm install -D typescript @types/node vitest nock execa eslint prettier ``` 4. **TypeScript Configuration**: - Create `tsconfig.json` with ES2022 target, ESM modules - Enable strict mode and path mapping for clean imports 5. **Environment Setup**: - No external accounts required for MVP - All API endpoints are public and open access - Rate limiting handles respectful usage automatically ## Infrastructure and Deployment ### MVP Deployment Strategy - **Target Environment**: npm registry for global installation - **Runtime**: Node.js 20 LTS (user-provided) - **Distribution**: Pre-compiled TypeScript to JavaScript ESM - **Installation**: Single command `npx @futurelab/latest-science-mcp` ### CI/CD Pipeline ```yaml # GitHub Actions workflow name: Test and Publish on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20' - run: npm ci - run: npm test - run: npm run build publish: if: startsWith(github.ref, 'refs/tags/v') needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20' registry-url: 'https://registry.npmjs.org' - run: npm ci - run: npm run build - run: npm publish env: NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} ``` ### Resource Requirements - **Memory**: ~50MB base + ~10MB per concurrent operation - **Network**: Outbound HTTPS to arxiv.org and api.openalex.org - **Storage**: No persistent storage required for MVP - **Dependencies**: Node.js 20+ runtime environment only ## Change Log | Date | Version | Changes | Author | |------|---------|---------|---------| | 2025-01-XX | 0.1.0 | Initial architecture draft | R&D |

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/benedict2310/Scientific-Papers-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

architecture.md•14.4 KiB