parse_paper_content
Extract structured content from arXiv research papers by parsing HTML or PDF versions to access scientific literature through the Model Context Protocol.
Instructions
解析论文内容(优先使用 HTML 版本,回退到 PDF)
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | arXiv 论文URL或 arXiv ID | |
| paperInfo | No | 论文信息(可选,用于添加论文元数据) |
Implementation Reference
- src/index.ts:247-318 (handler)The core handler function that implements the 'parse_paper_content' tool logic. It extracts content from arXiv papers, preferring HTML versions and falling back to PDF parsing, optionally including paper metadata.
async function parsePaperContent(input: string, paperInfo?: any): Promise<{content: string, source: 'html' | 'pdf'}> { let tempPdfPath: string | null = null; try { // 获取 arXiv ID let arxivId: string; if (input.startsWith('http://') || input.startsWith('https://')) { const urlParts = input.split('/'); arxivId = urlParts[urlParts.length - 1]; } else { arxivId = input; } // 首先尝试获取 HTML 版本 console.log("尝试获取 HTML 版本..."); const htmlContent = await getArxivHtmlContent(arxivId); let paperText: string; let source: 'html' | 'pdf'; if (htmlContent) { // 使用 HTML 版本 console.log("使用 HTML 版本解析内容"); paperText = extractTextFromHtml(htmlContent); source = 'html'; } else { // 回退到 PDF 版本 console.log("HTML 版本不可用,回退到 PDF 版本"); const pdfUrl = getArxivPdfUrl(input); tempPdfPath = await downloadTempPdf(pdfUrl); paperText = await extractPdfText(tempPdfPath); source = 'pdf'; } // 构建输出内容 let outputContent = ''; if (paperInfo) { outputContent += `=== 论文信息 ===\n`; outputContent += `标题: ${paperInfo.title}\n`; outputContent += `arXiv ID: ${arxivId}\n`; outputContent += `发布日期: ${paperInfo.published}\n`; outputContent += `内容来源: ${source.toUpperCase()}\n`; if (paperInfo.authors && paperInfo.authors.length > 0) { outputContent += `作者: ${paperInfo.authors.map((author: any) => author.name || author).join(', ')}\n`; } outputContent += `摘要: ${paperInfo.summary}\n`; outputContent += `\n=== 论文内容 ===\n\n`; } else { outputContent += `=== 论文内容 (来源: ${source.toUpperCase()}) ===\n\n`; } outputContent += paperText; return { content: outputContent, source }; } catch (error) { console.error("解析论文内容时出错:", error); throw new Error(`论文内容解析失败: ${error instanceof Error ? error.message : String(error)}`); } finally { // 清理临时 PDF 文件 if (tempPdfPath && fs.existsSync(tempPdfPath)) { try { fs.unlinkSync(tempPdfPath); console.log(`临时文件已删除: ${tempPdfPath}`); } catch (cleanupError) { console.warn(`清理临时文件失败: ${cleanupError}`); } } } } - src/index.ts:369-388 (schema)The input schema definition for the 'parse_paper_content' tool, specifying parameters 'input' (required) and optional 'paperInfo' object.
inputSchema: { type: "object", properties: { input: { type: "string", description: "arXiv 论文URL或 arXiv ID" }, paperInfo: { type: "object", description: "论文信息(可选,用于添加论文元数据)", properties: { title: { type: "string" }, summary: { type: "string" }, published: { type: "string" }, authors: { type: "array" } } } }, required: ["input"] } - src/index.ts:366-389 (registration)The tool registration in the ListTools response, defining name, description, and input schema for 'parse_paper_content'.
{ name: "parse_paper_content", description: "解析论文内容(优先使用 HTML 版本,回退到 PDF)", inputSchema: { type: "object", properties: { input: { type: "string", description: "arXiv 论文URL或 arXiv ID" }, paperInfo: { type: "object", description: "论文信息(可选,用于添加论文元数据)", properties: { title: { type: "string" }, summary: { type: "string" }, published: { type: "string" }, authors: { type: "array" } } } }, required: ["input"] } } - src/index.ts:437-447 (registration)The switch case in CallToolRequestHandler that dispatches calls to the 'parse_paper_content' handler function.
case "parse_paper_content": { const { input, paperInfo } = args as { input: string; paperInfo?: any }; const result = await parsePaperContent(input, paperInfo); return { content: [{ type: "text", text: result.content }] }; } - src/index.ts:71-104 (helper)Key helper function to fetch HTML version of arXiv paper, used by the main handler.
async function getArxivHtmlContent(arxivId: string): Promise<string | null> { try { const cleanArxivId = arxivId.replace(/v\d+$/, ''); const htmlUrl = `https://arxiv.org/html/${cleanArxivId}`; console.log(`尝试获取 HTML 版本: ${htmlUrl}`); const response = await axios({ method: 'GET', url: htmlUrl, timeout: 20000, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; ArXiv-Paper-MCP/1.0)' } }); // 检查响应状态和内容类型 if (response.status === 200 && response.headers['content-type']?.includes('text/html')) { const html = response.data; // 简单检查是否是有效的论文HTML(而不是错误页面) if (html.includes('ltx_document') || html.includes('ltx_page_main') || html.includes('ltx_abstract')) { console.log(`成功获取 HTML 版本: ${htmlUrl}`); return html; } } console.log(`HTML 版本不可用或无效: ${htmlUrl}`); return null; } catch (error) { console.log(`HTML 版本获取失败,将使用 PDF: ${error instanceof Error ? error.message : String(error)}`); return null; } }