paperstack
Provides tools for searching, retrieving, and analyzing research papers from arXiv, including PDF download, text extraction, citation graphs, and paper comparisons.
Enables extraction of official code repositories from papers, comparison of paper claims with GitHub implementations, and reproducibility assessment.
Extracts Kaggle dataset and competition links from research papers.
Leverages Ollama local LLM to generate audience-specific explanations of research papers.
Integrates with Semantic Scholar API to enhance citation graph analysis and author/paper citation networks.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@paperstackfind recent papers on quantum machine learning"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
paperstack (Model Context Protocol)
Overview
paperstack is a production-grade Model Context Protocol (MCP) server focused on arXiv research retrieval.
It provides:
arXiv Atom API search by ID/query
PDF download, validation, and cache
PDF text extraction (title, abstract, body, references)
Token-aware context chunking for LLM pipelines
CLI, API, and autonomous agent integration support
Table of Contents
Quickstart
1. Clone repository
git clone https://github.com/Aldrin-Joan/paperstack.git
cd paperstack2. Set up Python environment (recommended)
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate3. Install dependencies
pip install -r requirements.txt4. Run smoke test
python test_smoke.pyInstallation
From source:
pip install -e .From PyPI:
pip install paperstack-mcpUsage
CLI
paperstack --helpRun server locally:
python -m src.mcp_serverPython API
from paperstack_mcp import entrypoint # import alias for the package
from src.arxiv_client import ArxivClient
from src.pdf_fetcher import PdfFetcher
from src.pdf_parser import PdfParser
from src.context_builder import ContextBuilder
client = ArxivClient()
results = client.search('quantum computing', max_results=3)
pdf_path = PdfFetcher().fetch_paper(results[0].id)
parsed = PdfParser().parse(pdf_path)
context = ContextBuilder().build(parsed)
print(context.summary)Architecture Layers
Layer | Features |
Layer 1 — retrieval (both tools have this) | Search · PDF fetch + cache · Text extraction + chunking |
Layer 2 — intelligence (your opportunity) | Citation graph · Concept extraction · Cross-paper synthesis |
Layer 3 — dev tooling (highly unique) | Code + dataset links · Implementation diff · Reproducibility audit |
Layer 4 — research workflows (unique) | Reading lists · Topic tracking + alerts · Agent-ready Q&A |
MCP Server
src/mcp_server/__main__.py starts an MCP tool server exposing:
arxiv_search(query or ID expand)arxiv_fetch_pdf(download + cache)arxiv_parse_pdf(extract text and metadata)arxiv_build_context(chunk to LLM-friendly context)arxiv_citation_graph(author/paper citation network)arxiv_extract_contributions(structured contribution extractor)arxiv_semantic_index(semantic similarity index builder/query)arxiv_compare_papers(paper comparison report)arxiv_extract_code_links(discover official GitHub/HuggingFace/Kaggle links from a paper)arxiv_reproducibility_score(reproducibility heuristic score with evidence details)arxiv_diff_implementations(compare paper method claims against a GitHub implementation)arxiv_reading_list(persistent reading list CRUD and filters)arxiv_watch_topic(watch query topics and detect new papers)arxiv_explain_for_audience(audience-specific explanation synthesis)
Use any MCP-capable client (VS Code MCP extension, custom agent SDK) to connect.
VS Code MCP server setup
In VS Code, add an MCP server entry to your workspace settings (e.g., .vscode/settings.json):
{
"servers": {
"arxiv-mcp": {
"command": "D:/Softwares/Anaconda3/python.exe",
"args": ["-m", "src.mcp_server"],
"cwd": "${workspaceFolder}",
"env": {
"PYTHONPATH": "${workspaceFolder}",
"ARXIV_DOWNLOAD_DIR": "${workspaceFolder}/downloads",
"ARXIV_KEEP_PDFS": "true",
"CHUNK_SIZE_TOKENS": "800",
"CHUNK_OVERLAP_TOKENS": "100",
"ARXIV_RATE_LIMIT_DELAY": "3.0",
"MAX_RETRIES": "3",
"HTTP_TIMEOUT": "60"
}
}
}
}MCP JSON entry for paperstack-mcp
If you installed from PyPI (pip install paperstack-mcp), the MCP server command can be the package executable instead of a direct Python module path. In .vscode/mcp.json or your .code-workspace settings, use an entry like:
{
"servers": {
"paperstack-mcp": {
"command": "paperstack-mcp",
"args": [],
"cwd": "C:\\path\\to\\your\\project",
"env": {
"PYTHONPATH": "C:\\path\\to\\your\\project",
"ARXIV_DOWNLOAD_DIR": "C:\\path\\to\\your\\project\\downloads",
"ARXIV_KEEP_PDFS": "false",
"CHUNK_SIZE_TOKENS": "800",
"CHUNK_OVERLAP_TOKENS": "100",
"ARXIV_RATE_LIMIT_DELAY": "3.0",
"MAX_RETRIES": "3",
"HTTP_TIMEOUT": "60"
}
}
}
}Adjust values for your local path, rate limit, and retry/timeouts.
Run
pip install paperstack-mcpfirst.Ensure workspace
cwdandPYTHONPATHpoint to the project root.Customize
ARXIV_DOWNLOAD_DIRfor your downloaded PDF cache location.
Adjust values for your local path, rate limit, and retry/timeouts.
Run
pip install paperstack-mcpfirst.Ensure workspace
cwdandPYTHONPATHpoint to the project root.Customize
ARXIV_DOWNLOAD_DIRfor your downloaded PDF cache location.ARXIV_DOWNLOAD_DIR: local storage for downloaded PDFs.ARXIV_KEEP_PDFS: keep cached PDFs after parse.CHUNK_SIZE_TOKENS/CHUNK_OVERLAP_TOKENS: controls text-chunking in context builder.ARXIV_RATE_LIMIT_DELAY: delay between arXiv API calls.MAX_RETRIES,HTTP_TIMEOUT: network robustness.
You can apply this configuration also in other compatible MCP clients using their server configuration schema.
Project structure
src/- package sourcearxiv_client/- arXiv Atom API logicpdf_fetcher/- download/cache PDFpdf_parser/- extract/clean PDF textcontext_builder/- tokenization + chunkingmcp_server/- MCP protocol/adapters
tests/- pytest suiterequirements.txt- dependenciespyproject.toml- package metadata
Configuration
Environment variables:
ARXIV_CACHE_DIR(default:./downloads)ARXIV_CACHE_TTL(default:604800seconds / 7 days)ARXIV_DB_PATH(default:${ARXIV_DOWNLOAD_DIR}/arxiv_mcp.db) path to the SQLite workflow databaseARXIV_RATE_LIMIT(default:1request/sec)S2_API_KEY(optional; Semantic Scholar API key for higher rate limits)OLLAMA_BASE_URL(default:http://localhost:11434)OLLAMA_MODEL(default:mistral)SEMANTIC_INDEX_DIR(default:${ARXIV_DOWNLOAD_DIR}/semantic_index)CITATION_CACHE_TTL(default:86400seconds / 24 hours)CONTRIBUTION_CACHE_TTL(default:604800seconds / 7 days)EMBEDDING_MODEL(default:sentence-transformers/all-MiniLM-L6-v2)GITHUB_TOKEN(optional; for GitHub API auth, improves 60 -> 5000 req/hour)LINK_CACHE_TTL(default:172800seconds / 48 hours)REPRO_CACHE_TTL(default:604800seconds / 7 days)DIFF_CACHE_TTL(default:86400seconds / 24 hours)GITHUB_MAX_FILES(default:20)GITHUB_MAX_FILE_SIZE_KB(default:50)
Set in shell or via .env before running.
Testing
Run full tests:
pytest -qSmoke test:
python test_smoke.pyTroubleshooting
arxiv-mcpcommand not found: ensure virtualenv is active and package installedPDF download failure: check network access to
https://arxiv.org/pdf/Rate-limit errors: lower request frequency or adjust
ARXIV_RATE_LIMITTopic duplicates observed after repeated tests: use
DatabaseClient.reset()on workflow DB and/ortopic_watcher.addnow enforces dedupe by(query, label).Reading list duplicate notes:
ReadingListManager.addnow avoids re-appending identical note blocks.Ollama not available fallback:
_passthroughnow uses arXivmetadata.abstractfor all explanation fields (what_it_is/problem_solved/how_it_works/why_it_matters/key_result).Dependency pin check:
pip install -r requirements.txtincludesprotobuf==3.20.3andurllib3>=2.0.0,<3to avoid known warning/conflict cases (TensorFlow + ChromaDBMessageFactoryand RequestsRequestsDependencyWarning).Smoke harness summary:
scripts/run_all_tools.pyprints final status with count of run/passed/failed tools.
Contributing
Fork repo
Create feature branch
Add tests and update README
Open PR
Follow style checks (Black, formatting and lint).
License
Apache-2.0
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Aldrin-Joan/paperstack'
If you have feedback or need assistance with the MCP directory API, please join our Discord server