Skip to main content
Glama
ElliotPadfield

Unpaywall MCP Server

unpaywall_fetch_pdf_text

Extract text from open access research papers by DOI or PDF URL to analyze academic content without accessing paywalls.

Instructions

Download and extract text from best OA PDF for a DOI, or from a provided PDF URL.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doiNoDOI string or DOI URL. Used if pdf_url is not provided.
pdf_urlNoDirect PDF URL to download and parse (takes precedence over DOI).
emailNoEmail to identify requests to Unpaywall (required when resolving via DOI).
truncate_charsNoMax characters of extracted text to return (default 20000).

Implementation Reference

  • Handler for TOOL_FETCH_PDF_TEXT: resolves PDF URL from DOI if provided, downloads the PDF buffer, extracts text using pdf-parse, applies truncation if specified, and returns structured JSON with text and metadata.
    if (tool === TOOL_FETCH_PDF_TEXT) {
      const args = (req.params.arguments ?? {}) as Partial<FetchPdfTextArgs>;
      const truncate = args.truncate_chars && Number.isFinite(args.truncate_chars)
        ? Math.max(1000, Math.floor(Number(args.truncate_chars)))
        : 20000;
    
      let pdfUrl = (args.pdf_url ?? "").toString().trim();
      if (!pdfUrl) {
        const rawDoi = (args.doi ?? "").toString().trim();
        if (!rawDoi) {
          return { content: [{ type: "text", text: "Provide either 'pdf_url' or 'doi'" }], isError: true };
        }
        const email = (args.email || process.env.UNPAYWALL_EMAIL || "").toString().trim();
        if (!email) {
          return { content: [{ type: "text", text: "Unpaywall requires an email. Set UNPAYWALL_EMAIL or pass 'email'." }], isError: true };
        }
        const doi = normalizeDoi(rawDoi);
        const obj = await fetchUnpaywallByDoi(doi, email);
        const best = obj?.best_oa_location ?? null;
        const locations: any[] = Array.isArray(obj?.oa_locations) ? obj.oa_locations : [];
        const pickPdfFrom = (locs: any[]) => locs.find(l => l?.url_for_pdf) || locs.find(l => l?.url);
        pdfUrl = best?.url_for_pdf || (pickPdfFrom(locations)?.url_for_pdf || pickPdfFrom(locations)?.url) || "";
        if (!pdfUrl) {
          return { content: [{ type: "text", text: "No OA PDF URL found for the provided DOI." }], isError: true };
        }
      }
    
      // Download and parse PDF
      const pdfBuffer = await downloadPdfAsBuffer(pdfUrl);
      const parsed = await pdfParse(pdfBuffer);
      const text = parsed.text || "";
      const truncated = text.length > truncate;
      const output = {
        pdf_url: pdfUrl,
        length_chars: text.length,
        truncated,
        text: truncated ? text.slice(0, truncate) : text,
        metadata: {
          n_pages: parsed.numpages ?? undefined,
          info: parsed.info ?? undefined,
          metadata: parsed.metadata ?? undefined,
        },
      };
      return { content: [{ type: "json", json: output }] };
    }
  • src/index.ts:178-192 (registration)
    Tool registration in the ListTools response, including name, description, and input schema.
    {
      name: TOOL_FETCH_PDF_TEXT,
      description: "Download and extract text from best OA PDF for a DOI, or from a provided PDF URL.",
      inputSchema: {
        type: "object",
        properties: {
          doi: { type: "string", description: "DOI string or DOI URL. Used if pdf_url is not provided." },
          pdf_url: { type: "string", description: "Direct PDF URL to download and parse (takes precedence over DOI)." },
          email: { type: "string", description: "Email to identify requests to Unpaywall (required when resolving via DOI)." },
          truncate_chars: { type: "integer", minimum: 1000, description: "Max characters of extracted text to return (default 20000)." },
        },
        required: [],
        additionalProperties: false,
      },
    },
  • TypeScript type definition matching the tool's input schema for type safety in the handler.
    type FetchPdfTextArgs = {
      doi?: string; // if provided, we will resolve best OA PDF via Unpaywall
      pdf_url?: string; // optional direct PDF URL (takes precedence if provided)
      email?: string; // required if using DOI
      truncate_chars?: number; // optional truncation to avoid massive outputs (default 20000)
    };
  • Helper function to download PDF as buffer with timeout, size limit (30MB), and streaming to prevent memory issues.
    async function downloadPdfAsBuffer(url: string, maxBytes = 30 * 1024 * 1024) {
      // Limit to 30MB by default to avoid extremely large downloads
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 30_000);
      try {
        const resp = await fetch(url, {
          signal: controller.signal,
          headers: {
            "Accept": "application/pdf, application/octet-stream;q=0.9,*/*;q=0.8",
          },
          redirect: "follow",
        });
        if (!resp.ok) {
          const text = await resp.text().catch(() => "");
          throw new Error(`PDF download HTTP ${resp.status}: ${text.slice(0, 400)}`);
        }
        const reader = resp.body?.getReader();
        if (!reader) return Buffer.from(await resp.arrayBuffer());
        const chunks: Uint8Array[] = [];
        let received = 0;
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          if (value) {
            received += value.byteLength;
            if (received > maxBytes) throw new Error(`PDF exceeds size limit of ${maxBytes} bytes`);
            chunks.push(value);
          }
        }
        return Buffer.concat(chunks);
      } finally {
        clearTimeout(timeout);
      }
    }
  • Helper to fetch Unpaywall metadata by DOI, used to resolve best OA PDF URL when DOI is provided.
    async function fetchUnpaywallByDoi(doi: string, email: string) {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 20_000);
      try {
        const url = `https://api.unpaywall.org/v2/${encodeURIComponent(doi)}?email=${encodeURIComponent(email)}`;
        const resp = await fetch(url, { signal: controller.signal, headers: { "Accept": "application/json" } });
        if (!resp.ok) {
          const text = await resp.text().catch(() => "");
          throw new Error(`Unpaywall HTTP ${resp.status}: ${text.slice(0, 400)}`);
        }
        return await resp.json();
      } finally {
        clearTimeout(timeout);
      }
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ElliotPadfield/unpaywall-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server