# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Academia MCP is an MCP (Model Context Protocol) server that provides tools for searching, fetching, analyzing, and reporting on scientific papers and datasets. It integrates with multiple academic APIs (ArXiv, ACL Anthology, Semantic Scholar, Hugging Face) and web search providers (Exa, Brave, Tavily), plus optional LLM-powered document analysis tools.
**Key Features:**
- ArXiv and ACL Anthology search/download
- Semantic Scholar citation graphs
- Hugging Face datasets search
- Web search and page crawling
- LaTeX compilation and PDF reading
- LLM-powered document QA and research proposal workflows
**Tech Stack:**
- Python 3.12+ with type hints (strict mypy)
- FastMCP framework for the MCP server
- OpenAI SDK for LLM calls (via OpenRouter)
- Pydantic for data models and settings
- Fire for CLI argument parsing
- Multiple transport options: stdio, SSE, streamable-http
## Development Commands
**IMPORTANT: Always prefer `make` commands when available.** The Makefile provides consistent, tested workflows.
### Setup
```bash
# Create virtual environment and install dependencies
uv venv .venv
make install
```
### Validation (ALWAYS run before committing)
```bash
# Format code with black (line length: 100)
make black
# Run all validation: black, flake8, mypy --strict
make validate
```
This is the most important command - run `make validate` frequently during development.
### Testing
```bash
# Run full test suite (via make)
make test
# Run a single test file
uv run pytest -s ./tests/test_arxiv_search.py
# Run a specific test
uv run pytest -s ./tests/test_arxiv_search.py::test_arxiv_search
```
### Running the Server Locally
```bash
# Run with streamable-http (default, port 5056)
uv run -m academia_mcp --transport streamable-http
# Run with stdio (for Claude Desktop)
uv run -m academia_mcp --transport stdio
# Run with custom port
uv run -m academia_mcp --transport streamable-http --port 8080
```
### Publishing
```bash
make publish # Builds and publishes to PyPI
```
## Architecture
### Server Initialization (server.py)
The `create_server()` function in `academia_mcp/server.py` is the heart of the application:
1. **Core Tools** (always available): arxiv_search, arxiv_download, anthology_search, s2_* (Semantic Scholar), hf_datasets_search, visit_webpage, get_latex_templates_list, show_image, yt_transcript
2. **Conditional Tool Registration** (based on environment variables):
- `WORKSPACE_DIR` set → enables compile_latex, download_pdf_paper, read_pdf
- `OPENROUTER_API_KEY` set → enables LLM tools (document_qa, review_pdf_paper, bitflip tools, describe_image)
- `EXA_API_KEY`/`BRAVE_API_KEY`/`TAVILY_API_KEY` set → enables respective web_search tools
3. **Transport Modes**:
- `stdio`: for local MCP clients (Claude Desktop)
- `streamable-http`: HTTP with CORS enabled for browser clients
- `sse`: server-sent events
### Tool Structure
All tools live in `academia_mcp/tools/` and follow this pattern:
- Each tool is a standalone async function with type hints
- Tools use Pydantic models for inputs/outputs (enables structured_output mode)
- Most tools are registered with `structured_output=True` for schema validation
- Tools import from shared utilities (`utils.py`, `llm.py`, `settings.py`)
**Key Tool Categories:**
- **Search tools**: arxiv_search.py, anthology_search.py, s2.py, hf_datasets_search.py, web_search.py
- **Fetch/download tools**: arxiv_download.py, visit_webpage.py, review.py
- **Document processing**: latex.py (compile_latex, read_pdf), image_processing.py
- **LLM-powered tools**: document_qa.py, bitflip.py (research proposals), review.py
### Settings Management (settings.py)
Uses `pydantic-settings` to load configuration from `.env` file or environment variables:
- API keys: OPENROUTER_API_KEY, TAVILY_API_KEY, EXA_API_KEY, BRAVE_API_KEY, OPENAI_API_KEY
- Model names: REVIEW_MODEL_NAME, BITFLIP_MODEL_NAME, DOCUMENT_QA_MODEL_NAME, DESCRIBE_IMAGE_MODEL_NAME
- Workspace: WORKSPACE_DIR (Path), PORT (int)
- Authentication: ENABLE_AUTH (bool, default False), TOKENS_FILE (Path, default ./tokens.json)
- All settings accessible via `from academia_mcp.settings import settings`
### Authentication System (auth/)
The authentication system provides optional token-based security for HTTP transports (streamable-http, sse).
**Architecture:**
- Integrated authentication model (single server handles both token validation and MCP tools)
- Bearer token validation via Starlette middleware
- Token storage in JSON file with metadata (client_id, scopes, expiration, etc.)
- CLI commands for token lifecycle management
**Key Components:**
1. **Token Models** (`academia_mcp/auth/models.py`):
- `TokenMetadata`: Stores token_id, client_id, scopes, issued_at, expires_at, description, revoked, last_used
- `TokenStore`: Container for all tokens with version tracking
- Token format: `mcp_<32 hex chars>` (128 bits of entropy via `secrets.token_hex(16)`)
2. **Token Manager** (`academia_mcp/auth/token_manager.py`):
- `generate_token()`: Creates cryptographically secure tokens
- `issue_token()`: Creates and persists new token with metadata
- `validate_token()`: Checks existence, revocation status, and expiration
- `list_tokens()`: Returns all non-revoked tokens
- `revoke_token()`: Marks token as revoked
- `update_last_used()`: Updates last usage timestamp
- File locking for concurrent access safety
- Atomic writes via temp file + rename
3. **Authentication Middleware** (`academia_mcp/auth/middleware.py`):
- `BearerTokenAuthMiddleware`: Starlette BaseHTTPMiddleware implementation
- Intercepts HTTP requests before they reach MCP tools
- Validates `Authorization: Bearer <token>` header
- Returns 401 with WWW-Authenticate header on auth failures
- Skips OPTIONS requests (CORS preflight compatibility)
- Updates last_used timestamp asynchronously (non-blocking)
4. **Server Integration** (`academia_mcp/server.py:165-169`):
- Middleware added BEFORE CORS when `ENABLE_AUTH=true`
- Only applies to streamable-http transport (stdio/sse unaffected by default)
- Logs "Authentication enabled for streamable-http transport" when active
5. **CLI Commands** (`academia_mcp/auth/cli.py`):
- `AuthCLI` class with Fire-compatible methods
- `issue_token()`: Issues new token, displays ONCE with rich formatting
- `list_tokens()`: Displays table with token prefixes, client IDs, timestamps
- `revoke_token()`: Revokes token by ID
**CLI Usage:**
```bash
# Issue token
academia_mcp auth issue-token --client-id=my-client --description="Production"
# Issue with expiration
academia_mcp auth issue-token --client-id=test --expires-days=30
# List active tokens
academia_mcp auth list-tokens
# Revoke token
academia_mcp auth revoke-token mcp_a1b2c3d4e5f6...
# Run server with auth
ENABLE_AUTH=true academia_mcp run --transport=streamable-http
```
**Security Considerations:**
- Tokens stored in plaintext (standard for bearer tokens) with file permissions 600
- 128 bits of entropy for cryptographically secure token generation
- Tokens displayed only once during issuance
- HTTPS strongly recommended for production use
- Last-used timestamps for audit trails
**Testing:**
- Unit tests: `tests/test_auth.py` (token manager, middleware)
- Integration tests: `tests/test_server_auth.py` (server with auth enabled/disabled)
### LLM Integration (llm.py)
Two main functions for calling LLMs via OpenRouter:
- `llm_acall()`: unstructured text response
- `llm_acall_structured()`: structured response with Pydantic validation (uses OpenAI's `.parse()` with retry logic)
Both use `ChatMessage` model for message formatting.
### Utilities (utils.py)
Common helper functions used across tools:
- `get_with_retries()`: HTTP GET with retry logic
- File handling utilities
- Text processing helpers
## Adding New Tools
To add a new tool:
1. Create a new file in `academia_mcp/tools/` (e.g., `my_tool.py`)
2. Define Pydantic models for input/output if using structured output
3. Implement an async function with proper type hints
4. Export the function in `academia_mcp/tools/__init__.py`
5. Register the tool in `create_server()` in `academia_mcp/server.py`
6. Add tests in `tests/test_my_tool.py`
Example pattern:
```python
from pydantic import BaseModel, Field
class MyToolInput(BaseModel):
query: str = Field(description="Search query")
class MyToolOutput(BaseModel):
result: str = Field(description="Result")
async def my_tool(query: str) -> MyToolOutput:
# Implementation
return MyToolOutput(result="...")
```
Then in server.py:
```python
from academia_mcp.tools.my_tool import my_tool
# ...
server.add_tool(my_tool, structured_output=True)
```
## Testing Notes
- Tests use pytest with asyncio support (see `pytest.ini_options` in pyproject.toml)
- `conftest.py` contains shared fixtures
- Tests requiring API keys should check for env vars or use mocking
- Workspace-dependent tests use `tests/workdir/` for temporary files
## Code Style
- Line length: 100 characters (black)
- Strict mypy type checking
- Import sorting with isort
- All public APIs should have type hints
- Use Pydantic models for data validation
### Comments and Documentation
**DO NOT write inline comments explaining what code does.** The code should be self-explanatory through:
- Clear variable and function names
- Type hints
- Well-structured code
**ONLY write docstrings for MCP tools** (functions registered with `server.add_tool()`). These docstrings become the tool descriptions in the MCP protocol, so they must clearly explain:
- What the tool does
- What parameters it accepts
- What it returns
Example of acceptable docstring for an MCP tool:
```python
async def arxiv_search(query: str, limit: int = 10) -> ArxivSearchResponse:
"""
Search arXiv for papers matching the query.
Supports field-specific queries (e.g., 'ti:neural networks' for title search).
Returns paper metadata including title, authors, abstract, and arXiv ID.
"""
...
```
**Do not write docstrings** for internal helper functions, utilities, or Pydantic models - type hints and clear naming are sufficient.
## Environment Variables for Testing
When testing locally, create a `.env` file in the project root:
```
OPENROUTER_API_KEY=your_key_here
WORKSPACE_DIR=/path/to/workspace
# Optional: EXA_API_KEY, BRAVE_API_KEY, TAVILY_API_KEY, OPENAI_API_KEY
```
## LaTeX/PDF Requirements
For LaTeX compilation and PDF processing:
- Install TeX Live: `sudo apt install texlive-latex-base texlive-fonts-recommended texlive-latex-extra texlive-science latexmk`
- Ensure `pdflatex` and `latexmk` are on PATH
## Docker
Pre-built image available: `phoenix120/academia_mcp`
Build locally:
```bash
docker build -t academia_mcp .
```
Run with workspace volume:
```bash
docker run --rm -p 5056:5056 \
-e OPENROUTER_API_KEY=your_key \
-e WORKSPACE_DIR=/workspace \
-v "$PWD/workdir:/workspace" \
academia_mcp
```