Skip to main content
Glama

unpaywall_fetch_pdf_text

Extract text from open access research papers by DOI or PDF URL to analyze academic content without accessing paywalls.

Instructions

Download and extract text from best OA PDF for a DOI, or from a provided PDF URL.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doiNoDOI string or DOI URL. Used if pdf_url is not provided.
pdf_urlNoDirect PDF URL to download and parse (takes precedence over DOI).
emailNoEmail to identify requests to Unpaywall (required when resolving via DOI).
truncate_charsNoMax characters of extracted text to return (default 20000).

Implementation Reference

  • Handler for TOOL_FETCH_PDF_TEXT: resolves PDF URL from DOI if provided, downloads the PDF buffer, extracts text using pdf-parse, applies truncation if specified, and returns structured JSON with text and metadata.
    if (tool === TOOL_FETCH_PDF_TEXT) { const args = (req.params.arguments ?? {}) as Partial<FetchPdfTextArgs>; const truncate = args.truncate_chars && Number.isFinite(args.truncate_chars) ? Math.max(1000, Math.floor(Number(args.truncate_chars))) : 20000; let pdfUrl = (args.pdf_url ?? "").toString().trim(); if (!pdfUrl) { const rawDoi = (args.doi ?? "").toString().trim(); if (!rawDoi) { return { content: [{ type: "text", text: "Provide either 'pdf_url' or 'doi'" }], isError: true }; } const email = (args.email || process.env.UNPAYWALL_EMAIL || "").toString().trim(); if (!email) { return { content: [{ type: "text", text: "Unpaywall requires an email. Set UNPAYWALL_EMAIL or pass 'email'." }], isError: true }; } const doi = normalizeDoi(rawDoi); const obj = await fetchUnpaywallByDoi(doi, email); const best = obj?.best_oa_location ?? null; const locations: any[] = Array.isArray(obj?.oa_locations) ? obj.oa_locations : []; const pickPdfFrom = (locs: any[]) => locs.find(l => l?.url_for_pdf) || locs.find(l => l?.url); pdfUrl = best?.url_for_pdf || (pickPdfFrom(locations)?.url_for_pdf || pickPdfFrom(locations)?.url) || ""; if (!pdfUrl) { return { content: [{ type: "text", text: "No OA PDF URL found for the provided DOI." }], isError: true }; } } // Download and parse PDF const pdfBuffer = await downloadPdfAsBuffer(pdfUrl); const parsed = await pdfParse(pdfBuffer); const text = parsed.text || ""; const truncated = text.length > truncate; const output = { pdf_url: pdfUrl, length_chars: text.length, truncated, text: truncated ? text.slice(0, truncate) : text, metadata: { n_pages: parsed.numpages ?? undefined, info: parsed.info ?? undefined, metadata: parsed.metadata ?? undefined, }, }; return { content: [{ type: "json", json: output }] }; }
  • src/index.ts:178-192 (registration)
    Tool registration in the ListTools response, including name, description, and input schema.
    { name: TOOL_FETCH_PDF_TEXT, description: "Download and extract text from best OA PDF for a DOI, or from a provided PDF URL.", inputSchema: { type: "object", properties: { doi: { type: "string", description: "DOI string or DOI URL. Used if pdf_url is not provided." }, pdf_url: { type: "string", description: "Direct PDF URL to download and parse (takes precedence over DOI)." }, email: { type: "string", description: "Email to identify requests to Unpaywall (required when resolving via DOI)." }, truncate_chars: { type: "integer", minimum: 1000, description: "Max characters of extracted text to return (default 20000)." }, }, required: [], additionalProperties: false, }, },
  • TypeScript type definition matching the tool's input schema for type safety in the handler.
    type FetchPdfTextArgs = { doi?: string; // if provided, we will resolve best OA PDF via Unpaywall pdf_url?: string; // optional direct PDF URL (takes precedence if provided) email?: string; // required if using DOI truncate_chars?: number; // optional truncation to avoid massive outputs (default 20000) };
  • Helper function to download PDF as buffer with timeout, size limit (30MB), and streaming to prevent memory issues.
    async function downloadPdfAsBuffer(url: string, maxBytes = 30 * 1024 * 1024) { // Limit to 30MB by default to avoid extremely large downloads const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), 30_000); try { const resp = await fetch(url, { signal: controller.signal, headers: { "Accept": "application/pdf, application/octet-stream;q=0.9,*/*;q=0.8", }, redirect: "follow", }); if (!resp.ok) { const text = await resp.text().catch(() => ""); throw new Error(`PDF download HTTP ${resp.status}: ${text.slice(0, 400)}`); } const reader = resp.body?.getReader(); if (!reader) return Buffer.from(await resp.arrayBuffer()); const chunks: Uint8Array[] = []; let received = 0; while (true) { const { done, value } = await reader.read(); if (done) break; if (value) { received += value.byteLength; if (received > maxBytes) throw new Error(`PDF exceeds size limit of ${maxBytes} bytes`); chunks.push(value); } } return Buffer.concat(chunks); } finally { clearTimeout(timeout); } }
  • Helper to fetch Unpaywall metadata by DOI, used to resolve best OA PDF URL when DOI is provided.
    async function fetchUnpaywallByDoi(doi: string, email: string) { const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), 20_000); try { const url = `https://api.unpaywall.org/v2/${encodeURIComponent(doi)}?email=${encodeURIComponent(email)}`; const resp = await fetch(url, { signal: controller.signal, headers: { "Accept": "application/json" } }); if (!resp.ok) { const text = await resp.text().catch(() => ""); throw new Error(`Unpaywall HTTP ${resp.status}: ${text.slice(0, 400)}`); } return await resp.json(); } finally { clearTimeout(timeout); } }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ElliotPadfield/unpaywall-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server