README.md•10 kB
# arXiv CLI & MCP Server
A Python toolkit for searching and downloading papers from arXiv.org, with both a command-line interface and a Model Context Protocol (MCP) server for LLM integration.
CLI agents work well with well-documented CLI tools and/or MCP servers. This project provides both options.
## Features
- **Search** arXiv papers by title, author, abstract, category, and more
- **Download** PDFs automatically with local caching
- **MCP Server** for integration with LLM assistants (Claude Desktop, etc.)
- **Typed responses** using Pydantic models for clean data handling
- **Rate limiting** built-in to respect arXiv API guidelines
- **Comprehensive tests** with 26 integration tests (no mocking)
## Installation
### Option 1: Install from GitHub (Recommended)
Install directly from the GitHub repository:
```bash
# Install the latest version
uv pip install git+https://github.com/LiamConnell/arxiv_for_agents.git
# Or with pip
pip install git+https://github.com/LiamConnell/arxiv_for_agents.git
# Now you can use the arxiv command
arxiv --help
```
### Option 2: Install from Source
Clone the repository and install locally:
```bash
# Clone the repository
git clone https://github.com/LiamConnell/arxiv_for_agents.git
cd arxiv_for_agents
# Install in editable mode
uv pip install -e .
# Now you can use the arxiv command
arxiv --help
```
### Option 3: Development Installation
For development with all dependencies:
```bash
# Clone and install with dev dependencies
git clone https://github.com/LiamConnell/arxiv_for_agents.git
cd arxiv_for_agents
uv pip install -e ".[dev]"
# Run tests
uv run pytest
```
### Verify Installation
```bash
# If installed as package
arxiv --help
# Or if using as module
uv run python -m arxiv --help
```
## Usage
**Note:** If you installed as a package, use `arxiv` directly. Otherwise, use `uv run python -m arxiv`.
### Search Papers
Search by title:
```bash
# Using installed package
arxiv search "ti:attention is all you need"
# Or using as module
uv run python -m arxiv search "ti:attention is all you need"
```
Search by author:
```bash
arxiv search "au:Hinton" --max-results 20
```
Search by category:
```bash
arxiv search "cat:cs.AI" --max-results 10
```
Combined search:
```bash
arxiv search "ti:transformer AND au:Vaswani"
```
### Get Specific Paper
Get paper metadata and download PDF:
```bash
arxiv get 1706.03762
```
Get metadata only (no download):
```bash
arxiv get 1706.03762 --no-download
```
Force re-download:
```bash
arxiv get 1706.03762 --force
```
### Download PDF
Download just the PDF:
```bash
arxiv download 1706.03762
```
### List Downloaded PDFs
```bash
arxiv list-downloads
```
### JSON Output
Get results as JSON for scripting:
```bash
arxiv search "ti:neural" --json
arxiv get 1706.03762 --json --no-download
```
## Search Query Syntax
The arXiv API supports field-specific searches:
- `ti:` - Title
- `au:` - Author
- `abs:` - Abstract
- `cat:` - Category (e.g., cs.AI, cs.LG)
- `all:` - All fields (default)
You can combine searches with `AND`, `OR`, and `ANDNOT`:
```bash
arxiv search "ti:neural AND cat:cs.LG"
arxiv search "au:Hinton OR au:Bengio"
```
## Download Directory
PDFs are downloaded to `./.arxiv` by default. Change this with:
```bash
arxiv --download-dir ./papers search "ti:transformer"
```
## MCP Server (Model Context Protocol)
The arXiv CLI includes a Model Context Protocol (MCP) server that allows LLM assistants (like Claude Desktop) to search and download arXiv papers programmatically.
### Running the MCP Server
```bash
# Option 1: Using the script entry point (recommended)
uv run arxiv-mcp
# Option 2: Using the module
uv run python -m arxiv.mcp
```
The server runs in stdio mode and communicates via JSON-RPC over stdin/stdout.
### MCP Tools
The server provides 4 tools for paper discovery and management:
1. **search_papers** - Search arXiv with advanced query syntax
- Supports field prefixes (ti:, au:, abs:, cat:)
- Boolean operators (AND, OR, ANDNOT)
- Pagination and sorting options
- Returns paper metadata including title, authors, abstract, categories
2. **get_paper** - Get detailed information about a specific paper
- Accepts flexible ID formats (1706.03762, arXiv:1706.03762, 1706.03762v1)
- Optionally downloads PDF automatically
- Returns complete metadata including DOI, journal references, comments
3. **download_paper** - Download PDF for a specific paper
- Downloads to local `.arxiv` directory
- Returns file path and size information
- Supports force re-download option
4. **list_downloaded_papers** - List all locally downloaded PDFs
- Shows arxiv IDs, file sizes, and paths
- Useful for managing local paper collection
### MCP Resources
The server exposes 2 resources for direct access:
- **paper://{arxiv_id}** - Get formatted paper metadata in markdown
- **downloads://list** - Get markdown table of all downloaded papers
### MCP Prompts
Pre-built prompt templates to guide usage:
- **search_arxiv_prompt** - Guide for searching arXiv papers
- **download_paper_prompt** - Guide for downloading and managing papers
### Claude Desktop Configuration
Add to your Claude Desktop config file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):
**If installed from GitHub/pip:**
```json
{
"mcpServers": {
"arxiv": {
"command": "arxiv-mcp"
}
}
}
```
**If running from source/development:**
```json
{
"mcpServers": {
"arxiv": {
"command": "uv",
"args": ["run", "arxiv-mcp"],
"cwd": "/path/to/arxiv_for_agents"
}
}
}
```
Or use `--directory` to avoid needing `cwd`:
```json
{
"mcpServers": {
"arxiv": {
"command": "uv",
"args": ["--directory", "/path/to/arxiv_for_agents", "run", "arxiv-mcp"]
}
}
}
```
### MCP Use Cases
Once configured, you can ask Claude to:
- "Search arXiv for recent papers on transformer architectures"
- "Find papers by Geoffrey Hinton in the cs.AI category"
- "Download the 'Attention is All You Need' paper"
- "Show me papers about neural networks from 2023"
- "List all the papers I've downloaded"
- "Get the abstract for arXiv:1706.03762"
The MCP integration allows Claude to autonomously search, retrieve, and manage academic papers from arXiv.
## Architecture
### Module Structure
```
arxiv/
├── __init__.py # Package exports
├── __main__.py # CLI entry point
├── cli.py # Click commands
├── models.py # Pydantic models
├── services.py # API client service
└── mcp/ # MCP server
├── __init__.py # MCP package exports
├── __main__.py # MCP server entry point
└── server.py # FastMCP server with tools, resources, prompts
tests/
└── test_services.py # Integration tests (26 tests)
```
### Pydantic Models
All API responses are typed using Pydantic:
```python
from arxiv import ArxivService
service = ArxivService()
result = service.search("ti:neural", max_results=5)
# result is typed as ArxivSearchResult
print(f"Total: {result.total_results}")
for entry in result.entries:
# entry is typed as ArxivEntry
print(f"{entry.arxiv_id}: {entry.title}")
print(f"Authors: {', '.join(a.name for a in entry.authors)}")
```
### Key Models
- **ArxivSearchResult**: Search results with metadata
- `total_results`: Total matching papers
- `entries`: List of ArxivEntry objects
- **ArxivEntry**: Individual paper
- `arxiv_id`: Clean ID (e.g., "1706.03762")
- `title`, `summary`: Paper metadata
- `authors`: List of Author objects
- `categories`: Subject categories
- `pdf_url`: Direct PDF link
- `published`, `updated`: Datetime objects
- **Author**: Paper author
- `name`: Author name
- `affiliation`: Optional affiliation
## Testing
Run all 26 integration tests (makes real API calls):
```bash
uv run pytest tests/test_services.py -v
```
Run specific test class:
```bash
uv run pytest tests/test_services.py::TestArxivServiceSearch -v
```
The tests are integration tests that hit the real arXiv API, ensuring the service works with actual data.
## API Rate Limiting
The service enforces a 3-second delay between API requests by default (arXiv's recommendation). You can adjust this:
```python
from arxiv import ArxivService
service = ArxivService(rate_limit_delay=5.0) # 5 seconds
```
## Examples
### Python API
```python
from arxiv import ArxivService
# Initialize service
service = ArxivService(download_dir="./papers")
# Search
results = service.search(
query="ti:attention is all you need",
max_results=5,
sort_by="relevance"
)
print(f"Found {results.total_results} papers")
for entry in results.entries:
print(f"- {entry.title}")
# Get specific paper
entry = service.get("1706.03762", download_pdf=True)
print(f"Downloaded: {entry.title}")
# Just download PDF
pdf_path = service.download_pdf("1706.03762")
print(f"PDF saved to: {pdf_path}")
```
### CLI Examples
```bash
# Find recent papers in a category
arxiv search "cat:cs.AI" \
--max-results 10 \
--sort-by submittedDate \
--sort-order descending
# Search and output as JSON for processing
arxiv search "ti:transformer" --json | jq '.entries[].title'
# Batch download multiple papers
for id in 1706.03762 1810.04805 2010.11929; do
arxiv download $id
done
```
## Development
The codebase follows these principles:
1. **Type safety**: Pydantic models for all API responses
2. **Clean architecture**: Separation of CLI, service, and models
3. **Real tests**: Integration tests with actual API calls (no mocks)
4. **Rate limiting**: Respects arXiv API guidelines
5. **Caching**: Automatic local caching to avoid re-downloads
## arXiv API Reference
- Base URL: https://export.arxiv.org/api/query
- Format: Atom XML
- Rate limit: 3 seconds between requests (recommended)
- Documentation: https://info.arxiv.org/help/api/user-manual.html
## License
This is a personal project for interacting with arXiv's public API.