TxtAI MCP Server

Overview Schema Related Servers Score Discussions

txtai
examples

06_Extractive_QA_with_Elasticsearch.ipynb•17.7 KiB

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "zzZbP0LM6m5z" }, "source": [ "# Extractive QA with Elasticsearch\n", "\n", "txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system." ] }, { "cell_type": "markdown", "metadata": { "id": "xk7t5Jcd6reO" }, "source": [ "# Install dependencies\n", "\n", "Install `txtai` and `Elasticsearch`." ] }, { "cell_type": "code", "metadata": { "id": "0y1UA4-q-YdA" }, "source": [ "%%capture\n", "\n", "# Install txtai and elasticsearch python client\n", "!pip install git+https://github.com/neuml/txtai elasticsearch\n", "\n", "# Download and extract elasticsearch\n", "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n", "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n", "!chown -R daemon:daemon elasticsearch-7.10.1" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "nKWz-C5gCJy8" }, "source": [ "Start an instance of Elasticsearch directly within this notebook. " ] }, { "cell_type": "code", "metadata": { "id": "3ZfJeWbM6wmj" }, "source": [ "import os\n", "from subprocess import Popen, PIPE, STDOUT\n", "\n", "# If issues are encountered with this section, ES can be manually started as follows:\n", "# ./elasticsearch-7.10.1/bin/elasticsearch\n", "\n", "# Start and wait for server\n", "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n", "!sleep 30" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TWEn4w68-D1y" }, "source": [ "# Download data\n", "\n", "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n", "\n", "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook." ] }, { "cell_type": "code", "metadata": { "id": "8tVrIqSq-KBa" }, "source": [ "%%capture\n", "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n", "!gunzip tests.gz\n", "!mv tests articles.sqlite" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hSWFzkCn61tM" }, "source": [ "# Load data into Elasticsearch\n", "\n", "The following block copies rows from SQLite to Elasticsearch." ] }, { "cell_type": "code", "metadata": { "id": "So-OBvUT61QD", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9647b8f8-8471-41bf-ccfa-a75306665638" }, "source": [ "import sqlite3\n", "\n", "import regex as re\n", "\n", "from elasticsearch import Elasticsearch, helpers\n", "\n", "# Connect to ES instance\n", "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n", "\n", "# Connection to database file\n", "db = sqlite3.connect(\"articles.sqlite\")\n", "cur = db.cursor()\n", "\n", "# Elasticsearch bulk buffer\n", "buffer = []\n", "rows = 0\n", "\n", "# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n", "cur.execute(\"SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null\")\n", "for row in cur:\n", " # Build dict of name-value pairs for fields\n", " article = dict(zip((\"id\", \"article\", \"title\", \"published\", \"reference\", \"name\", \"text\"), row))\n", " name = article[\"name\"]\n", "\n", " # Only process certain document sections\n", " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", " # Bulk action fields\n", " article[\"_id\"] = article[\"id\"]\n", " article[\"_index\"] = \"articles\"\n", "\n", " # Buffer article\n", " buffer.append(article)\n", "\n", " # Increment number of articles processed\n", " rows += 1\n", "\n", " # Bulk load every 1000 records\n", " if rows % 1000 == 0:\n", " helpers.bulk(es, buffer)\n", " buffer = []\n", "\n", " print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n", "\n", "if buffer:\n", " helpers.bulk(es, buffer)\n", "\n", "print(\"Total articles inserted: {}\".format(rows))\n" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Total articles inserted: 21499\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "X5RO-VNwzMAo" }, "source": [ "# Query data\n", "\n", "The following runs a query against Elasticsearch for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "ucd9mwSfFTMm", "colab": { "base_uri": "https://localhost:8080/", "height": 348 }, "outputId": "b21d6aff-6abe-48f5-9914-7b7fb8472adb" }, "source": [ "import pandas as pd\n", "\n", "from IPython.display import display, HTML\n", "\n", "pd.set_option(\"display.max_colwidth\", None)\n", "\n", "query = {\n", " \"_source\": [\"article\", \"title\", \"published\", \"reference\", \"text\"],\n", " \"size\": 5,\n", " \"query\": {\n", " \"query_string\": {\"query\": \"risk factors\"}\n", " }\n", "}\n", "\n", "results = []\n", "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]))\n", "\n", "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n", "\n", "display(HTML(df.to_html(index=False)))" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>Title</th>\n", " <th>Published</th>\n", " <th>Reference</th>\n", " <th>Match</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", " <td>2020-04-24 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", " </tr>\n", " <tr>\n", " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", " <td>2020-04-27 00:00:00</td>\n", " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", " <td>2020-07-23 00:00:00</td>\n", " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", " <td>2020-03-15 00:00:00</td>\n", " <td>https://doi.org/10.7150/ijbs.45134</td>\n", " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", " </tr>\n", " <tr>\n", " <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n", " <td>2020-05-11 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?rss=1</td>\n", " <td>In addition, many risk factors for covid-19 documented in the literature are highly correlated and it is not clear which may be independently related to risk.</td>\n", " </tr>\n", " </tbody>\n", "</table>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {} } ] }, { "cell_type": "markdown", "metadata": { "id": "ylxOKji1-9_K" }, "source": [ "# Derive columns with Extractive QA\n", "\n", "The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article." ] }, { "cell_type": "code", "metadata": { "id": "mwBTrCkcOM_H" }, "source": [ "%%capture\n", "from txtai.embeddings import Embeddings\n", "from txtai.pipeline import Extractor\n", "\n", "# Create embeddings model, backed by sentence-transformers & transformers\n", "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n", "\n", "# Create extractor instance using qa model designed for the CORD-19 dataset\n", "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Yv75Lh-cOpL9", "colab": { "base_uri": "https://localhost:8080/", "height": 400 }, "outputId": "adee88e1-02bf-4a20-febb-6d2c170a63f9" }, "source": [ "document = {\n", " \"_source\": [\"id\", \"name\", \"text\"],\n", " \"size\": 1000,\n", " \"query\": {\n", " \"term\": {\"article\": None}\n", " },\n", " \"sort\" : [\"id\"]\n", "}\n", "\n", "def sections(article):\n", " rows = []\n", "\n", " search = document.copy()\n", " search[\"query\"][\"term\"][\"article\"] = article\n", "\n", " for result in es.search(index=\"articles\", body=search)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", " name, text = source[\"name\"], source[\"text\"]\n", "\n", " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", " rows.append(text)\n", " \n", " return rows\n", "\n", "results = []\n", "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", " source = result[\"_source\"]\n", "\n", " # Use QA extractor to derive additional columns\n", " answers = extractor([(\"Risk factors\", \"risk factor\", \"What are names of risk factors?\", False),\n", " (\"Locations\", \"city country state\", \"What are names of locations?\", False)], sections(source[\"article\"]))\n", "\n", " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]) + tuple([answer[1] for answer in answers]))\n", "\n", "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n", "\n", "display(HTML(df.to_html(index=False)))" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>Title</th>\n", " <th>Published</th>\n", " <th>Reference</th>\n", " <th>Match</th>\n", " <th>Risk Factors</th>\n", " <th>Locations</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n", " <td>2020-05-21 00:00:00</td>\n", " <td>https://doi.org/10.1002/cpt.1910</td>\n", " <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n", " <td>sex, obesity, genetic factors and mechanical factors</td>\n", " <td>None</td>\n", " </tr>\n", " <tr>\n", " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", " <td>2020-04-24 00:00:00</td>\n", " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", " <td>None</td>\n", " <td>Abbott, Abbott Park, Illinois</td>\n", " </tr>\n", " <tr>\n", " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", " <td>2020-04-27 00:00:00</td>\n", " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", " <td>None</td>\n", " <td>None</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", " <td>2020-07-23 00:00:00</td>\n", " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", " <td>Frailty and multimorbidity</td>\n", " <td>comorbidity groupings and the corresponding health conditions</td>\n", " </tr>\n", " <tr>\n", " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", " <td>2020-03-15 00:00:00</td>\n", " <td>https://doi.org/10.7150/ijbs.45134</td>\n", " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", " <td>Mandatory contact tracing and quarantine</td>\n", " <td>cities, provinces, and countries</td>\n", " </tr>\n", " </tbody>\n", "</table>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {} } ] } ] }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/neuml/txtai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

06_Extractive_QA_with_Elasticsearch.ipynb•17.7 KiB