parse_paper_content
Extract structured content from arXiv papers by parsing HTML or PDF formats to retrieve text, metadata, and research information for analysis.
Instructions
解析论文内容(优先使用 HTML 版本,回退到 PDF)
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | arXiv 论文URL或 arXiv ID | |
| paperInfo | No | 论文信息(可选,用于添加论文元数据) |
Implementation Reference
- src/index.ts:247-318 (handler)The core handler function that implements the parse_paper_content tool logic: extracts text content from arXiv paper preferring HTML version, falling back to PDF parsing, and formats output with optional paper metadata.
async function parsePaperContent(input: string, paperInfo?: any): Promise<{content: string, source: 'html' | 'pdf'}> { let tempPdfPath: string | null = null; try { // 获取 arXiv ID let arxivId: string; if (input.startsWith('http://') || input.startsWith('https://')) { const urlParts = input.split('/'); arxivId = urlParts[urlParts.length - 1]; } else { arxivId = input; } // 首先尝试获取 HTML 版本 console.log("尝试获取 HTML 版本..."); const htmlContent = await getArxivHtmlContent(arxivId); let paperText: string; let source: 'html' | 'pdf'; if (htmlContent) { // 使用 HTML 版本 console.log("使用 HTML 版本解析内容"); paperText = extractTextFromHtml(htmlContent); source = 'html'; } else { // 回退到 PDF 版本 console.log("HTML 版本不可用,回退到 PDF 版本"); const pdfUrl = getArxivPdfUrl(input); tempPdfPath = await downloadTempPdf(pdfUrl); paperText = await extractPdfText(tempPdfPath); source = 'pdf'; } // 构建输出内容 let outputContent = ''; if (paperInfo) { outputContent += `=== 论文信息 ===\n`; outputContent += `标题: ${paperInfo.title}\n`; outputContent += `arXiv ID: ${arxivId}\n`; outputContent += `发布日期: ${paperInfo.published}\n`; outputContent += `内容来源: ${source.toUpperCase()}\n`; if (paperInfo.authors && paperInfo.authors.length > 0) { outputContent += `作者: ${paperInfo.authors.map((author: any) => author.name || author).join(', ')}\n`; } outputContent += `摘要: ${paperInfo.summary}\n`; outputContent += `\n=== 论文内容 ===\n\n`; } else { outputContent += `=== 论文内容 (来源: ${source.toUpperCase()}) ===\n\n`; } outputContent += paperText; return { content: outputContent, source }; } catch (error) { console.error("解析论文内容时出错:", error); throw new Error(`论文内容解析失败: ${error instanceof Error ? error.message : String(error)}`); } finally { // 清理临时 PDF 文件 if (tempPdfPath && fs.existsSync(tempPdfPath)) { try { fs.unlinkSync(tempPdfPath); console.log(`临时文件已删除: ${tempPdfPath}`); } catch (cleanupError) { console.warn(`清理临时文件失败: ${cleanupError}`); } } } } - src/index.ts:369-388 (schema)Input schema defining parameters for the parse_paper_content tool: required 'input' (arXiv URL or ID), optional 'paperInfo' object with title, summary, published, authors.
inputSchema: { type: "object", properties: { input: { type: "string", description: "arXiv 论文URL或 arXiv ID" }, paperInfo: { type: "object", description: "论文信息(可选,用于添加论文元数据)", properties: { title: { type: "string" }, summary: { type: "string" }, published: { type: "string" }, authors: { type: "array" } } } }, required: ["input"] } - src/index.ts:366-389 (registration)Registration of the parse_paper_content tool in the ListToolsRequestSchema handler, specifying name, description, and input schema.
{ name: "parse_paper_content", description: "解析论文内容(优先使用 HTML 版本,回退到 PDF)", inputSchema: { type: "object", properties: { input: { type: "string", description: "arXiv 论文URL或 arXiv ID" }, paperInfo: { type: "object", description: "论文信息(可选,用于添加论文元数据)", properties: { title: { type: "string" }, summary: { type: "string" }, published: { type: "string" }, authors: { type: "array" } } } }, required: ["input"] } } - src/index.ts:437-447 (registration)Tool dispatch/execution in the CallToolRequestSchema handler switch statement, invoking the parsePaperContent function and returning formatted text content.
case "parse_paper_content": { const { input, paperInfo } = args as { input: string; paperInfo?: any }; const result = await parsePaperContent(input, paperInfo); return { content: [{ type: "text", text: result.content }] }; }