Skip to main content
Glama
06_Extractive_QA_with_Elasticsearch.ipynb18.1 kB
{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "zzZbP0LM6m5z" }, "source": [ "# Extractive QA with Elasticsearch\n", "\n", "txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system." ] }, { "cell_type": "markdown", "metadata": { "id": "xk7t5Jcd6reO" }, "source": [ "# Install dependencies\n", "\n", "Install `txtai` and `Elasticsearch`." ] }, { "cell_type": "code", "metadata": { "id": "0y1UA4-q-YdA" }, "source": [ "%%capture\n", "\n", "# Install txtai and elasticsearch python client\n", "!pip install git+https://github.com/neuml/txtai elasticsearch\n", "\n", "# Download and extract elasticsearch\n", "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n", "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n", "!chown -R daemon:daemon elasticsearch-7.10.1" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "nKWz-C5gCJy8" }, "source": [ "Start an instance of Elasticsearch directly within this notebook. " ] }, { "cell_type": "code", "metadata": { "id": "3ZfJeWbM6wmj" }, "source": [ "import os\n", "from subprocess import Popen, PIPE, STDOUT\n", "\n", "# If issues are encountered with this section, ES can be manually started as follows:\n", "# ./elasticsearch-7.10.1/bin/elasticsearch\n", "\n", "# Start and wait for server\n", "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n", "!sleep 30" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TWEn4w68-D1y" }, "source": [ "# Download data\n", "\n", "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n", "\n", "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook." ] }, { "cell_type": "code", "metadata": { "id": "8tVrIqSq-KBa" }, "source": [ "%%capture\n", "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n", "!gunzip tests.gz\n", "!mv tests articles.sqlite" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hSWFzkCn61tM" }, "source": [ "# Load data into Elasticsearch\n", "\n", "The following block copies rows from SQLite to Elasticsearch." ] }, { "cell_type": "code", "metadata": { "id": "So-OBvUT61QD", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9647b8f8-8471-41bf-ccfa-a75306665638" }, "source": [ "import sqlite3\n", "\n", "import regex as re\n", "\n", "from elasticsearch import Elasticsearch, helpers\n", "\n", "# Connect to ES instance\n", "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n", "\n", "# Connection to database file\n", "db = sqlite3.connect(\"articles.sqlite\")\n", "cur = db.cursor()\n", "\n", "# Elasticsearch bulk buffer\n", "buffer = []\n", "rows = 0\n", "\n", "# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n", "cur.execute(\"SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null\")\n", "for row in cur:\n", " # Build dict of name-value pairs for fields\n", " article = dict(zip((\"id\", \"article\", \"title\", \"published\", \"reference\", \"name\", \"text\"), row))\n", " name = article[\"name\"]\n", "\n", " # Only process certain document sections\n", " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", " # Bulk action fields\n", " article[\"_id\"] = article[\"id\"]\n", " article[\"_index\"] = \"articles\"\n", "\n", " # Buffer article\n", " buffer.append(article)\n", "\n", " # Increment number of articles processed\n", " rows += 1\n", "\n", " # Bulk load every 1000 records\n", " if rows % 1000 == 0:\n", " helpers.bulk(es, buffer)\n", " buffer = []\n", "\n", " print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n", "\n", "if buffer:\n", " helpers.bulk(es, buffer)\n", "\n", "print(\"Total articles inserted: {}\".format(rows))\n" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Total articles inserted: 21499\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "X5RO-VNwzMAo" }, "source": [ "# Query data\n", "\n", "The following runs a query against Elasticsearch for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "ucd9mwSfFTMm", "colab": { "base_uri": "https://localhost:8080/", "height": 348 }, "outputId": "b21d6aff-6abe-48f5-9914-7b7fb8472adb" }, "source": [ "import pandas as pd\n", "\n", "from IPython.display import display, HTML\n", "\n", "pd.set_option(\"display.max_colwidth\", None)\n", "\n", "query = {\n", " \"_source\": [\"article\", \"title\", \"published\", \"reference\", \"text\"],\n", " \"size\": 5,\n", " \"query\": {\n", " \"query_string\": {\"query\": \"risk factors\"}\n", " }\n", "}\n", "\n", "results = []\n", "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]))\n", "\n", "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n", "\n", "display(HTML(df.to_html(index=False)))" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>Title</th>\n", " <th>Published</th>\n", " <th>Reference</th>\n", " <th>Match</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", " <td>2020-04-24 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", " </tr>\n", " <tr>\n", " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", " <td>2020-04-27 00:00:00</td>\n", " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", " <td>2020-07-23 00:00:00</td>\n", " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", " <td>2020-03-15 00:00:00</td>\n", " <td>https://doi.org/10.7150/ijbs.45134</td>\n", " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", " </tr>\n", " <tr>\n", " <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n", " <td>2020-05-11 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?rss=1</td>\n", " <td>In addition, many risk factors for covid-19 documented in the literature are highly correlated and it is not clear which may be independently related to risk.</td>\n", " </tr>\n", " </tbody>\n", "</table>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {} } ] }, { "cell_type": "markdown", "metadata": { "id": "ylxOKji1-9_K" }, "source": [ "# Derive columns with Extractive QA\n", "\n", "The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article." ] }, { "cell_type": "code", "metadata": { "id": "mwBTrCkcOM_H" }, "source": [ "%%capture\n", "from txtai.embeddings import Embeddings\n", "from txtai.pipeline import Extractor\n", "\n", "# Create embeddings model, backed by sentence-transformers & transformers\n", "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n", "\n", "# Create extractor instance using qa model designed for the CORD-19 dataset\n", "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Yv75Lh-cOpL9", "colab": { "base_uri": "https://localhost:8080/", "height": 400 }, "outputId": "adee88e1-02bf-4a20-febb-6d2c170a63f9" }, "source": [ "document = {\n", " \"_source\": [\"id\", \"name\", \"text\"],\n", " \"size\": 1000,\n", " \"query\": {\n", " \"term\": {\"article\": None}\n", " },\n", " \"sort\" : [\"id\"]\n", "}\n", "\n", "def sections(article):\n", " rows = []\n", "\n", " search = document.copy()\n", " search[\"query\"][\"term\"][\"article\"] = article\n", "\n", " for result in es.search(index=\"articles\", body=search)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", " name, text = source[\"name\"], source[\"text\"]\n", "\n", " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", " rows.append(text)\n", " \n", " return rows\n", "\n", "results = []\n", "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", "\n", " # Use QA extractor to derive additional columns\n", " answers = extractor([(\"Risk factors\", \"risk factor\", \"What are names of risk factors?\", False),\n", " (\"Locations\", \"city country state\", \"What are names of locations?\", False)], sections(source[\"article\"]))\n", "\n", " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]) + tuple([answer[1] for answer in answers]))\n", "\n", "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n", "\n", "display(HTML(df.to_html(index=False)))" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>Title</th>\n", " <th>Published</th>\n", " <th>Reference</th>\n", " <th>Match</th>\n", " <th>Risk Factors</th>\n", " <th>Locations</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n", " <td>2020-05-21 00:00:00</td>\n", " <td>https://doi.org/10.1002/cpt.1910</td>\n", " <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n", " <td>sex, obesity, genetic factors and mechanical factors</td>\n", " <td>None</td>\n", " </tr>\n", " <tr>\n", " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", " <td>2020-04-24 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", " <td>None</td>\n", " <td>Abbott, Abbott Park, Illinois</td>\n", " </tr>\n", " <tr>\n", " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", " <td>2020-04-27 00:00:00</td>\n", " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", " <td>None</td>\n", " <td>None</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", " <td>2020-07-23 00:00:00</td>\n", " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", " <td>Frailty and multimorbidity</td>\n", " <td>comorbidity groupings and the corresponding health conditions</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", " <td>2020-03-15 00:00:00</td>\n", " <td>https://doi.org/10.7150/ijbs.45134</td>\n", " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", " <td>Mandatory contact tracing and quarantine</td>\n", " <td>cities, provinces, and countries</td>\n", " </tr>\n", " </tbody>\n", "</table>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {} } ] } ] }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/neuml/txtai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server