Skip to main content
Glama
pubtator3_paper.txt26.2 kB
Nucleic Acids Research, 2024, 52, W540–W546 https://doi.org/10.1093/nar/gkae235 Advance access publication date: 4 April 2024 Web Server issue PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge Chih-Hsuan Wei † , Alexis Allot † , Po-Ting Lai , Robert Leaman , Shubo Tian , Ling Luo , Qiao Jin , Zhizheng Wang , Qingyu Chen and Zhiyong Lu * National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA To whom correspondence should be addressed. Tel: +1 301 594 7089; Email: zhiyong.lu@nih.gov The first two authors should be regarded as Joint First Authors. Present addresses: Alexis Allot, The Neuro (Montreal Neurological Institute-Hospital), McGill University, Montreal, Quebec H3A 2B4, Canada. Ling Luo, School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China. Qingyu Chen, Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, USA. † Abstract PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0’s online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery. Graphical abstract Introduction The biomedical literature is a primary resource to address information needs across the biological and clinical sciences (1), however the requirements for literature search vary widely. Activities such as formulating a research hypothesis require an exploratory approach, whereas tasks like interpreting the clinical significance of genetic variants are more focused. Traditional keyword-based search methods have long formed the foundation of biomedical literature search (2). While generally effective for basic search, these methods also have significant limitations, such as missing relevant articles due to differing terminology or including irrelevant articles because surface-level term matches cannot adequately represent the required association between query terms. These limitations cost time and risk information needs remaining unmet. Natural language processing (NLP) methods provide substantial value for creating bioinformatics resources (3–5), and may improve literature search by enabling semantic and relation search (6). In semantic search, users indicate specific concepts of interest (entities) for which the system has precomputed matches regardless of the terminology used. Relation search increases precision by allowing users to specify the Received: January 18, 2024. Revised: March 2, 2024. Editorial Decision: March 16, 2024. Accepted: March 21, 2024 Published by Oxford University Press on behalf of Nucleic Acids Research 2024. This work is written by (a) US Government employee(s) and is in the public domain in the US. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 * W541 Nucleic Acids Research, 2024, Vol. 52, Web Server issue type of relationship desired between entities, such as whether a chemical enhances or reduces expression of a gene. In this regard, we present PubTator 3.0, a novel resource engineered to support semantic and relation search in the biomedical literature. Its search capabilities allow users to explore automated entity annotations for six key biomedical entities: genes, diseases, chemicals, genetic variants, species, and cell lines. PubTator 3.0 also identifies and makes searchable 12 common types of relations between entities, enhancing its utility for both targeted and exploratory searches. Focusing on relations and entity types of interest across the biomedical sciences allows PubTator 3.0 to retrieve information precisely while providing broad utility (see detailed comparisons with its predecessor in Supplementary Table S1). The PubTator 3.0 online interface, illustrated in Figure 1 and Supplementary Figure S1, is designed for interactive literature exploration, supporting semantic, relation, keyword, and Boolean queries. An auto-complete function provides semantic search suggestions to assist users with query formulation. For example, it automatically suggests replacing either ‘COVID-19 or "SARS-CoV-2 infection’ with the semantic term ‘@DISEASE_COVID_19 . Relation queries – new to PubTator 3.0 – provide increased precision, allowing users to target articles which discuss specific relationships between entities. PubTator 3.0 offers unified search results, simultaneously searching approximately 36 million PubMed abstracts and over 6 million full-text articles from the PMC Open Access Subset (PMC-OA), improving access to the substantial amount of relevant information present in the article full text (7). Search results are prioritized based on the depth of the relationship between the query terms: articles containing identifiable relations between semantic terms receive the highest priority, while articles where semantic or keyword terms cooccur nearby (e.g. within the same sentence) receive secondary priority. Search results are also prioritized based on the article section where the match appears (e.g. matches within the title receive higher priority). Users can further refine results by employing filters, narrowing articles returned to specific publication types, journals, or article sections. PubTator 3.0 is supported by an NLP pipeline, depicted in Figure 2A. This pipeline, run weekly, first identifies articles newly added to PubMed and PMC-OA. Articles are then processed through three major steps: (i) named entity recognition, provided by the recently developed deep-learning transformer model AIONER (8), (ii) identifier mapping and (iii) relation extraction, performed by BioREx (9) of 12 common types of relations (described in Supplementary Table S2). In total, PubTator 3.0 contains over 1.6 billion entity annotations (4.6 million unique identifiers) and 33 million relations (8.8 million unique pairs). It provides enhanced entity recognition and normalization performance over its previous version, PubTator 2 (10), also known as PubTator Central (Figure 2B and Supplementary Table S3). We show the relation extraction performance of PubTator 3.0 in Figure 2C and its comparison results to the previous state-of-the-art systems (11–13) on the BioCreative V Chemical-Disease Relation (14) corpus, finding that PubTator 3.0 provided substantially higher accuracy. Moreover, when evaluating a randomized sample of entity pair queries compared to PubMed and Google Scholar, Materials and methods Data sources and article processing PubTator 3.0 downloads new articles weekly from the BioC PubMed API (https://www.ncbi.nlm.nih.gov/research/bionlp/ APIs/BioC-PubMed/) and the BioC PMC API (https://www. ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/) in BioCXML format (16). Local abbreviations are identified using Ab3P (17). Article text and extracted data are stored internally using MongoDB and indexed for search with Solr, ensuring robust and scalable accessibility unconstrained by external dependencies such as the NCBI eUtils API. Entity recognition and normalization/linking PubTator 3.0 uses AIONER (8), a recently developed named entity recognition (NER) model, to recognize entities of six types: genes/proteins, chemicals, diseases, species, genetic variants, and cell lines. AIONER utilizes a flexible tagging scheme to integrate training data created separately into a single resource. These training datasets include NLM-Gene (18), NLM-Chem (19), NCBI-Disease (20), BC5CDR (14), tmVar3 (21), Species-800 (22), BioID (23) and BioRED (15). This consolidation creates a larger training set, improving the model’s ability to generalize to unseen data. Furthermore, it enables recognizing multiple entity types simultaneously, enhancing efficiency and simplifying the challenge of distinguishing boundaries between entities that reference others, such as the disorder ‘Alpha-1 antitrypsin deficiency’ and the protein ‘Alpha-1 antitrypsin’. We previously evaluated the performance of AIONER on 14 benchmark datasets (8), including the test sets for the aforementioned training sets. This evaluation demonstrated that AIONER’s performance surpasses or matches previous state-of-the-art methods. Entity mentions found by AIONER are normalized (linked) to a unique identifier in an appropriate entity database. Normalization is performed by a module designed for (or adapted to) each entity type, using the latest version. The recentlyupgraded GNorm2 system (24) normalizes genes to NCBI Gene identifiers and species mentions to NCBI Taxonomy. tmVar3 (21), also recently upgraded, normalizes genetic variants; it uses dbSNP identifiers for variants listed in dbSNP and HGNV format otherwise. Chemicals are normalized by the NLM-Chem tagger (19) to MeSH identifiers (25). TaggerOne (26) normalizes diseases to MeSH and cell lines to Cellosaurus (27) using a new normalization-only mode. This mode only applies the normalization model, which converts both mentions and lexicon names into high-dimensional TFIDF vectors and learns a mapping, as before. However, it now augments the training data by mapping each lexicon name to itself, resulting in a large performance improvement for names present in the lexicon but not in the annotated training data. These enhancements provide a significant overall improvement in entity normalization performance (Supplementary Table S3). Relation extraction Relations for PubTator 3.0 are extracted by the unified relation extraction model BioREx (9), designed to simulta- Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 System overview PubTator 3.0 consistently returns a greater number of articles with higher precision in the top 20 results (Figure 2D and Supplementary Table S4). W542 Nucleic Acids Research, 2024, Vol. 52, Web Server issue neously extract 12 types of relations across eight entity type pairs: chemical–chemical, chemical–disease, chemical– gene, chemical–variant, disease–gene, disease–variant, gene– gene and variant–variant. Detailed definitions of these relation types and their corresponding entity pairs are presented in Supplementary Table S2. Deep-learning methods for relation extraction, such as BioREx, require ample training data. However, training data for relation extraction is fragmented into many datasets, often tailored to specific entity pairs. BioREx overcomes this limitation with a data-centric approach, reconciling discrepancies between disparate training datasets to construct a comprehensive, unified dataset. We evaluated the relations extracted by BioREx using performance on manually annotated relation extraction datasets as well as a comparative analysis between BioREx and notable comparable systems. BioREx established a new performance benchmark on the BioRED corpus test set (15), elevating the performance from 74.4% (F-score) to 79.6%, and demonstrating higher performance than alternative models such as transfer learning (TL), multi-task learning (MTL), and stateof-the-art models trained on isolated datasets (9). For PubTator 3.0, we replaced its deep learning module, PubMedBERT (28), with LinkBERT (29), further increasing the performance to 82.0%. Furthermore, we conducted a comparative analysis between BioREx and SemRep (11), a widely used rule- based method for extracting diverse relations, the CD-REST (13) system, and the previous state-of-the-art system (12), using the BioCreative V Chemical Disease Relation corpus test set (14). Our evaluation demonstrated that PubTator 3.0 provided substantially higher F-score than previous methods. Programmatic access and data formats PubTator 3.0 offers programmatic access through its API and bulk download. The API (https://www.ncbi. nlm.nih.gov/research/pubtator3/) supports keyword, entity and relation search, and also supports exporting annotations in XML and JSON-based BioC (16) formats and tab-delimited free text. The PubTator 3.0 FTP site (https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3) provides bulk downloads of annotated articles and extraction summaries for entities and relations. Programmatic access supports more flexible query options; for example, the information need ‘what chemicals reduce expression of JAK1?’ can be answered directly via API (e.g. https: //www.ncbi.nlm.nih.gov/research/pubtator3-api/relations? e1=@GENE_JAK1&type=negative_correlate&e2=Chemical) or by filtering the bulk relations file. Additionally, the PubTator 3.0 API supports annotation of user-defined free text. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Figure 1. PubTator 3.0 system overview and search results page: 1. Query auto-complete enhances search accuracy and synonym matching. 2. Natural language processing (NLP)-enhanced relevance: Search results are prioritized according to the strength of the relationship between the entities queried. 3. Users can further refine results with facet filters—section, journal and type. 4. Search results include highlighted entity snippets explaining relevance. 5. Histogram visualizes number of results by publication year. 6. Entity highlighting can be switched on or off according to user preference. Nucleic Acids Research, 2024, Vol. 52, Web Server issue W543 Case study I: entity relation queries We analyzed the retrieval quality of PubTator 3.0 by preparing a series of 12 entity pairs to serve as case studies for comparison between PubTator 3.0, PubMed and Google Scholar. To provide an equal comparison, we filtered about 30% of the Google Scholar results for articles not present in PubMed. To ensure that the number of results would remain low enough to allow filtering Google Scholar results for articles not in PubMed, we identified entity pairs first discussed together in the literature in 2022 or later. We then randomly selected two entity pairs of each of the following types: disease/gene, chemical/disease, chemical/gene, chemical/chemical, gene/gene and disease/variant. None of the relation pairs selected appears in the training set. The comparison was performed with respect to a snapshot of the search results returned by all search engines on 19 May 2023. We manually evaluated the top 20 results for each system and each query; articles were judged to be relevant if they mentioned both entities in the query and supported a relationship between them. Two curators independently judged each article, and discrepancies were discussed until agreement. The curators were not blinded to the retrieval method but were required to record the text supporting the relationship, if relevant. This experiment evaluated the relevance of the top 20 results for each retrieval method, regardless of whether the article appeared in PubMed. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Figure 2. (A) The PubTator 3.0 processing pipeline: AIONER (8) identifies six types of entities in PubMed abstracts and PMC-OA full-text articles. Entity annotations are associated with database identifiers by specialized mappers and BioREx (9) identifies relations between entities. Extracted data is stored in MongoDB and made searchable using Solr. (B) Entity recognition performance for each entity type compared with PubTator2 (also known as PubTatorCentral) (13) on the BioRED corpus (15). (C) Relation extraction performance compared with SemRep (11) and notable previous best systems (12,13) on the BioCreative V Chemical-Disease Relation (14) corpus. (D) Comparison of information retrieval for PubTator 3.0, PubMed, and Google Scholar for entity pair queries, with respect to total article count and top-20 article precision. W544 Case study II: retrieval-augmented generation In the era of large language models (LLMs), PubTator 3.0 can also enhance their factual accuracy via retrieval augmented generation. Despite their strong language ability, LLMs are prone to generating incorrect assertions, sometimes known as hallucinations (30,31). For example, when requested to cite sources for questions such as ‘which diseases can doxorubicin treat’, GPT-4 frequently provides seemingly plausible but nonexistent references. Augmenting GPT-4 with PubTator 3.0 APIs can anchor the model’s response to verifiable references via the extracted relations, significantly reducing hallucinations. We assessed the citation accuracy of responses from three GPT-4 variations: PubTator-augmented GPT-4, PubMedaugmented GPT-4 and standard GPT-4. We performed a qualitative evaluation based on eight questions selected as follows. We identified entities mentioned in the PubMed query logs and randomly selected from entities searched both frequently and rarely. We then identified the common queries for each entity that request relational information and adapted one into a natural language question. Each question is therefore grounded on common information needs of real PubMed users. For example, the questions ‘What can be caused by tocilizumab?’ and ‘What can be treated by doxorubicin?’ are adapted from the user queries ‘tocilizumab side effects’ and ‘doxorubicin treatment’ respectively. Such questions typically require extracting information from multiple articles and an understanding of biomedical entities and relationship descriptions. Supplementary Table S5 lists the questions chosen. We augmented the GPT-4 large language model (LLM) with PubTator 3.0 via the function calling mechanism of the OpenAI ChatCompletion API. This integration involved prompt- ing GPT-4 with descriptions of three PubTator APIs: (i) find entity ID, which retrieves PubTator entity identifiers; (ii) find related entities, which identifies related entities based on an input entity and specified relations and (iii) export relevant search results, which returns PubMed article identifiers containing textual evidence for specific entity relationships. Our instructions prompted GPT-4 to decompose user questions into sub-questions addressable by these APIs, execute the function calls, and synthesize the responses into a coherent final answer. Our prompt promoted a summarized response by instructing GPT-4 to start its message with ‘Summary:’ and requested the response include citations to the articles providing evidence. The PubMed augmentation experiments provided GPT-4 with access to PubMed database search via the National Center for Biotechnology Information (NCBI) E-utils APIs (32). We used Azure OpenAI Services (version 2023-0701-preview) and GPT-4 (version 2023-06-13) and set the decoding temperature to zero to obtain deterministic outputs. The full prompts are provided in Supplementary Table S6. PubTator-augmented GPT-4 generally processed the questions in three steps: (i) finding the standard entity identifiers, (ii) finding its related entity identifiers and (iii) searching PubMed articles. For example, to answer ‘What drugs can treat breast cancer?’, GPT-4 first found the PubTator entity identifier for breast cancer (@DISEASE_Breast_Cancer) using the Find Entity ID API. It then used the Find Related Entities API to identify entities related to @DISEASE_Breast_Cancer through a ‘treat’ relation. For demonstration purposes, we limited the maximum number of output entities to five. Finally, GPT-4 called the Export Relevant Search Results API for the PubMed article identifiers containing evidence for these relationships. The raw responses to each prompt for each method are provided in Supplementary Table S6. We manually evaluated the accuracy of the citations in the responses by reviewing each PubMed article and verifying whether each PubMed article cited supported the stated relationship (e.g. Tamoxifen treating breast cancer). Supplementary Table S5 reports the proportion of the cited articles with valid supporting evidence for each method. GPT4 frequently generated fabricated citations, widely known as the hallucination issue. While PubMed-augmented GPT-4 showed a higher proportion of accurate citations, some articles cited did not support the relation claims. This is likely because PubMed is based on keyword and Boolean search and does not support queries for specific relationships. Responses generated by PubTator-augmented GPT-4 demonstrated the highest level of citation accuracy, underscoring the potential of PubTator 3.0 as a high-quality knowledge source for addressing biomedical information needs through retrievalaugmented generation with LLMs such as GPT-4. In our experiment, using Azure for ChatGPT, the cost was approximately $1 for two questions with GPT-4-Turbo, or 40 questions when downgraded to GPT-3.5-Turbo, including the cost of input/output tokens. Discussion Previous versions of PubTator have fulfilled over one billion API requests since 2015, supporting a wide range of research applications. Numerous studies have harnessed PubTator annotations for disease-specific gene research, including efforts to prioritize candidate genes (33), determine gene–phenotype associations (34), and identify the genetic underpinnings of Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Our analysis is summarized in Figure 2D, and Supplementary Table S4 presents a detailed comparison of the quality of retrieved results between PubTator 3.0, PubMed and Google Scholar. Our results demonstrate that PubTator 3.0 retrieves a greater number of articles than the comparison systems and its precision is higher for the top 20 results. For instance, PubTator 3.0 returned 346 articles for the query ‘GLPG0634 + ulcerative colitis’, and manual review of the top 20 articles showed that all contained statements about an association between GLPG0634 and ulcerative colitis. In contrast, PubMed only returned a total of 18 articles, with only 12 mentioning an association. Moreover, when searching for ‘COVID19 + PON1’, PubTator 3.0 returns 212 articles in PubMed, surpassing the 43 articles obtained from Google Scholar, only 29 of which are sourced from PubMed. These disparities can be attributed to several factors: (i) PubTator 3.0’s search includes full texts available in PMC-OA, resulting in significantly broader coverage of articles, (ii) entity normalization improves recall, for example, by matching ‘paraoxonase 1’ to ‘PON1’, (iii) PubTator 3.0 prioritizes articles containing relations between the query entities, (iv) Pubtator 3.0 prioritizes articles where the entities appear nearby, rather than distant paragraphs. Across the 12 information retrieval case studies, PubTator 3.0 demonstrated an overall precision of 90.0% for the top 20 articles (216 out of 240), which is significantly higher than PubMed’s precision of 81.6% (84 out of 103) and Google Scholar’s precision of 48.5% (98 out of 202). Nucleic Acids Research, 2024, Vol. 52, Web Server issue W545 Nucleic Acids Research, 2024, Vol. 52, Web Server issue Conclusion PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery. The PubTator 3.0 interface, API, and bulk file downloads are available at https: //www.ncbi.nlm.nih.gov/research/pubtator3/. Data availability Data is available through the online interface at https:// www.ncbi.nlm.nih.gov/research/pubtator3/, through the API at https://www.ncbi.nlm.nih.gov/research/pubtator3/api or bulk FTP download at https://ftp.ncbi.nlm.nih.gov/pub/lu/ PubTator3/. The source code for each component of PubTator 3.0 is openly accessible. The AIONER named entity recognizer is available at https://github.com/ncbi/AIONER. GNorm2, for gene name normalization, is available at https://github. com/ncbi/GNorm2. The tmVar3 variant name normalizer is available at https://github.com/ncbi/tmVar3. The NLMChem Tagger, for chemical name normalization, is available at https://ftp.ncbi.nlm.nih.gov/pub/lu/NLMChem. The TaggerOne system, for disease and cell line normalization, is available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/ taggerone. The BioREx relation extraction system is available at https://github.com/ncbi/BioREx. The code for customizing ChatGPT with the PubTator 3.0 API is available at https: //github.com/ncbi-nlp/pubtator-gpt. The details of the applications, performance, evaluation data, and citations for each tool are shown in Supplementary Table S7. All source code is also available at https://doi.org/10.5281/zenodo.10839630. Supplementary data Supplementary Data are available at NAR Online. Funding Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health; ODSS Support of the Exploration of Cloud in NIH Intramural Research. Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health. Conflict of interest statement None declared.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/genomoncology/biomcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server