Skip to main content
Glama
48_Benefits_of_hybrid_search.ipynb64.5 kB
{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# Benefits of hybrid search\n", "\n", "Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.\n", "\n", "While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.\n", "\n", "Both methods have their merits. What if we combine them together to build a unified `hybrid` search capability? Can we get the best of both worlds?\n", "\n", "This notebook will explore the benefits of hybrid search." ], "metadata": { "id": "v4J3FxbUn9CT" } }, { "cell_type": "markdown", "source": [ "# Install dependencies\n", "\n", "Install `txtai` and all dependencies." ], "metadata": { "id": "W70a-UjTdDiA" } }, { "cell_type": "code", "source": [ "%%capture\n", "!pip install txtai pytrec_eval rank-bm25 elasticsearch\n", "!pip uninstall -y tensorflow" ], "metadata": { "id": "nfgwb14J4LO2" }, "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Introducing semantic, keyword and hybrid search\n", "\n", "Before diving into the benchmarks, let's briefly discuss how semantic and keyword search works.\n", "\n", "Semantic search uses large language models to vectorize inputs into arrays of numbers. Similar concepts will have similar values. The vectors are typically stored in a vector database, which is a system that specializes in storing these numerical arrays and finding matches. Vector search transforms an input query into a vector and then runs a search to find the best conceptual results.\n", "\n", "Keyword search tokenizes text into lists of tokens per document. These tokens are aggregated into token frequencies per document and stored in term frequency sparse arrays. At search time, the query is tokenized and the tokens of the query are compared to the tokens in the dataset. This is more a literal process. Keyword search is like string matching, it has no conceptual understanding, it matches on characters and bytes.\n", "\n", "Hybrid search combines the scores from semantic and keyword indexes. Given that semantic search scores are typically 0 - 1 and keyword search scores are unbounded, a method is needed to combine the results.\n", "\n", "The two methods supported in txtai are:\n", "\n", "- [Convex Combination](https://en.wikipedia.org/wiki/Convex_combination) when sparse scores are normalized\n", "- [Reciprocal Rank Fusion (RRF)](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) when sparse scores aren't normalized\n", "\n", "The default method in txtai is convex combination and we'll use that." ], "metadata": { "id": "gPeTqCflzP5B" } }, { "cell_type": "markdown", "source": [ "# Evaluating performance\n", "\n", "Now it's time to benchmark the results. For these tests, we'll use the BEIR dataset. We'll also use a [benchmarks script](https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py) from the txtai project. This benchmarks script has methods to work with the BEIR dataset.\n", "\n", "We'll select a subset of the BEIR sources for brevity. For each source, we'll benchmark a `bm25` index, an `embeddings` index and a `hybrid` or combined index." ], "metadata": { "id": "rKCRLFNh39hV" } }, { "cell_type": "code", "source": [ "%%capture\n", "import os\n", "\n", "# Get benchmarks script\n", "os.system(\"wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py\")\n", "\n", "# Create output directory\n", "os.makedirs(\"beir\", exist_ok=True)\n", "\n", "# Download subset of BEIR datasets\n", "datasets = [\"nfcorpus\", \"fiqa\", \"arguana\", \"scidocs\", \"scifact\"]\n", "for dataset in datasets:\n", " url = f\"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip\"\n", " os.system(f\"wget {url}\")\n", " os.system(f\"mv {dataset}.zip beir\")\n", " os.system(f\"unzip -d beir beir/{dataset}.zip\")\n", "\n", " # Remove existing benchmark data\n", "if os.path.exists(\"benchmarks.json\"):\n", " os.remove(\"benchmarks.json\")" ], "metadata": { "id": "IGKzkKWB60pg" }, "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now let's run the benchmarks." ], "metadata": { "id": "SEH7Og8LiWRd" } }, { "cell_type": "code", "source": [ "# Remove existing benchmark data\n", "if os.path.exists(\"benchmarks.json\"):\n", " os.remove(\"benchmarks.json\")\n", "\n", "# Runs benchmark evaluation\n", "def evaluate(method):\n", " for dataset in datasets:\n", " command = f\"python benchmarks.py beir {dataset} {method}\"\n", " print(command)\n", " os.system(command)\n", "\n", "# Calculate benchmarks\n", "for method in [\"bm25\", \"embed\", \"hybrid\"]:\n", " evaluate(method)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Hfpok07_5N1m", "outputId": "0c0c0906-5192-4b9e-b23a-d0f455106a70" }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "python benchmarks.py beir nfcorpus bm25\n", "python benchmarks.py beir fiqa bm25\n", "python benchmarks.py beir arguana bm25\n", "python benchmarks.py beir scidocs bm25\n", "python benchmarks.py beir scifact bm25\n", "python benchmarks.py beir nfcorpus embed\n", "python benchmarks.py beir fiqa embed\n", "python benchmarks.py beir arguana embed\n", "python benchmarks.py beir scidocs embed\n", "python benchmarks.py beir scifact embed\n", "python benchmarks.py beir nfcorpus hybrid\n", "python benchmarks.py beir fiqa hybrid\n", "python benchmarks.py beir arguana hybrid\n", "python benchmarks.py beir scidocs hybrid\n", "python benchmarks.py beir scifact hybrid\n" ] } ] }, { "cell_type": "code", "source": [ "import json\n", "import pandas as pd\n", "\n", "def benchmarks():\n", " # Read JSON lines data\n", " with open(\"benchmarks.json\") as f:\n", " data = f.read()\n", "\n", " df = pd.read_json(data, lines=True).sort_values(by=[\"source\", \"ndcg_cut_10\"], ascending=[True, False])\n", " return df[[\"source\", \"method\", \"ndcg_cut_10\", \"map_cut_10\", \"recall_10\", \"P_10\", \"index\", \"search\", \"memory\"]].reset_index(drop=True)\n", "\n", "# Load benchmarks dataframe\n", "df = benchmarks()" ], "metadata": { "id": "cpmNpwag73DW" }, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "source": [ "df[df.source == \"nfcorpus\"].reset_index(drop=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "DjApptqajCi_", "outputId": "6d444b6e-a487-4661-dc73-bd42eb03b165" }, "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " source method ndcg_cut_10 map_cut_10 recall_10 P_10 index \\\n", "0 nfcorpus hybrid 0.34531 0.13369 0.17437 0.25480 29.46 \n", "1 nfcorpus embed 0.30917 0.10810 0.15327 0.23591 33.64 \n", "2 nfcorpus bm25 0.30639 0.11728 0.14891 0.21734 2.72 \n", "\n", " search memory \n", "0 3.57 2900 \n", "1 3.33 2876 \n", "2 0.96 652 " ], "text/html": [ "\n", " <div id=\"df-f75c29d9-da14-44a5-98cd-ffab53966d9d\" class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>source</th>\n", " <th>method</th>\n", " <th>ndcg_cut_10</th>\n", " <th>map_cut_10</th>\n", " <th>recall_10</th>\n", " <th>P_10</th>\n", " <th>index</th>\n", " <th>search</th>\n", " <th>memory</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>nfcorpus</td>\n", " <td>hybrid</td>\n", " <td>0.34531</td>\n", " <td>0.13369</td>\n", " <td>0.17437</td>\n", " <td>0.25480</td>\n", " <td>29.46</td>\n", " <td>3.57</td>\n", " <td>2900</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>nfcorpus</td>\n", " <td>embed</td>\n", " <td>0.30917</td>\n", " <td>0.10810</td>\n", " <td>0.15327</td>\n", " <td>0.23591</td>\n", " <td>33.64</td>\n", " <td>3.33</td>\n", " <td>2876</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>nfcorpus</td>\n", " <td>bm25</td>\n", " <td>0.30639</td>\n", " <td>0.11728</td>\n", " <td>0.14891</td>\n", " <td>0.21734</td>\n", " <td>2.72</td>\n", " <td>0.96</td>\n", " <td>652</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", " <div class=\"colab-df-buttons\">\n", "\n", " <div class=\"colab-df-container\">\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f75c29d9-da14-44a5-98cd-ffab53966d9d')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", "\n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", " </svg>\n", " </button>\n", "\n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " .colab-df-buttons div {\n", " margin-bottom: 4px;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-f75c29d9-da14-44a5-98cd-ffab53966d9d button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-f75c29d9-da14-44a5-98cd-ffab53966d9d');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", "\n", "\n", "<div id=\"df-96b691af-6be6-416f-b614-605a1c4eba2f\">\n", " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-96b691af-6be6-416f-b614-605a1c4eba2f')\"\n", " title=\"Suggest charts.\"\n", " style=\"display:none;\">\n", "\n", "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <g>\n", " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", " </g>\n", "</svg>\n", " </button>\n", "\n", "<style>\n", " .colab-df-quickchart {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-quickchart:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", "</style>\n", "\n", " <script>\n", " async function quickchart(key) {\n", " const charts = await google.colab.kernel.invokeFunction(\n", " 'suggestCharts', [key], {});\n", " }\n", " (() => {\n", " let quickchartButtonEl =\n", " document.querySelector('#df-96b691af-6be6-416f-b614-605a1c4eba2f button');\n", " quickchartButtonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", " })();\n", " </script>\n", "</div>\n", " </div>\n", " </div>\n" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "code", "source": [ "df[df.source == \"fiqa\"].reset_index(drop=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "bSx6dXhLM66g", "outputId": "5cf284c0-8bf3-4167-8cca-40c41370fea7" }, "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " source method ndcg_cut_10 map_cut_10 recall_10 P_10 index search \\\n", "0 fiqa hybrid 0.36642 0.28846 0.43799 0.10340 233.90 68.42 \n", "1 fiqa embed 0.36071 0.28450 0.43188 0.10216 212.30 58.83 \n", "2 fiqa bm25 0.23559 0.17619 0.29855 0.06559 19.78 12.84 \n", "\n", " memory \n", "0 3073 \n", "1 2924 \n", "2 761 " ], "text/html": [ "\n", " <div id=\"df-2af7a1e0-88b6-4545-a46f-853c28e6ccb0\" class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>source</th>\n", " <th>method</th>\n", " <th>ndcg_cut_10</th>\n", " <th>map_cut_10</th>\n", " <th>recall_10</th>\n", " <th>P_10</th>\n", " <th>index</th>\n", " <th>search</th>\n", " <th>memory</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>fiqa</td>\n", " <td>hybrid</td>\n", " <td>0.36642</td>\n", " <td>0.28846</td>\n", " <td>0.43799</td>\n", " <td>0.10340</td>\n", " <td>233.90</td>\n", " <td>68.42</td>\n", " <td>3073</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>fiqa</td>\n", " <td>embed</td>\n", " <td>0.36071</td>\n", " <td>0.28450</td>\n", " <td>0.43188</td>\n", " <td>0.10216</td>\n", " <td>212.30</td>\n", " <td>58.83</td>\n", " <td>2924</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>fiqa</td>\n", " <td>bm25</td>\n", " <td>0.23559</td>\n", " <td>0.17619</td>\n", " <td>0.29855</td>\n", " <td>0.06559</td>\n", " <td>19.78</td>\n", " <td>12.84</td>\n", " <td>761</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", " <div class=\"colab-df-buttons\">\n", "\n", " <div class=\"colab-df-container\">\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2af7a1e0-88b6-4545-a46f-853c28e6ccb0')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", "\n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", " </svg>\n", " </button>\n", "\n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " .colab-df-buttons div {\n", " margin-bottom: 4px;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-2af7a1e0-88b6-4545-a46f-853c28e6ccb0 button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-2af7a1e0-88b6-4545-a46f-853c28e6ccb0');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", "\n", "\n", "<div id=\"df-5679fcfd-d40f-46e1-806c-43138d91ed8f\">\n", " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-5679fcfd-d40f-46e1-806c-43138d91ed8f')\"\n", " title=\"Suggest charts.\"\n", " style=\"display:none;\">\n", "\n", "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <g>\n", " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", " </g>\n", "</svg>\n", " </button>\n", "\n", "<style>\n", " .colab-df-quickchart {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-quickchart:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", "</style>\n", "\n", " <script>\n", " async function quickchart(key) {\n", " const charts = await google.colab.kernel.invokeFunction(\n", " 'suggestCharts', [key], {});\n", " }\n", " (() => {\n", " let quickchartButtonEl =\n", " document.querySelector('#df-5679fcfd-d40f-46e1-806c-43138d91ed8f button');\n", " quickchartButtonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", " })();\n", " </script>\n", "</div>\n", " </div>\n", " </div>\n" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "code", "source": [ "df[df.source == \"arguana\"].reset_index(drop=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "7LxFmbAEjKmP", "outputId": "4b02d802-0dc9-4411-dc3b-1792de67cf1a" }, "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " source method ndcg_cut_10 map_cut_10 recall_10 P_10 index \\\n", "0 arguana hybrid 0.48467 0.40101 0.75320 0.07532 37.80 \n", "1 arguana embed 0.47781 0.38781 0.76671 0.07667 34.11 \n", "2 arguana bm25 0.45713 0.37118 0.73471 0.07347 3.39 \n", "\n", " search memory \n", "0 21.22 2924 \n", "1 10.21 2910 \n", "2 10.95 663 " ], "text/html": [ "\n", " <div id=\"df-41957eb2-72ed-4ec3-ab02-8a414f55db1c\" class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>source</th>\n", " <th>method</th>\n", " <th>ndcg_cut_10</th>\n", " <th>map_cut_10</th>\n", " <th>recall_10</th>\n", " <th>P_10</th>\n", " <th>index</th>\n", " <th>search</th>\n", " <th>memory</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>arguana</td>\n", " <td>hybrid</td>\n", " <td>0.48467</td>\n", " <td>0.40101</td>\n", " <td>0.75320</td>\n", " <td>0.07532</td>\n", " <td>37.80</td>\n", " <td>21.22</td>\n", " <td>2924</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>arguana</td>\n", " <td>embed</td>\n", " <td>0.47781</td>\n", " <td>0.38781</td>\n", " <td>0.76671</td>\n", " <td>0.07667</td>\n", " <td>34.11</td>\n", " <td>10.21</td>\n", " <td>2910</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>arguana</td>\n", " <td>bm25</td>\n", " <td>0.45713</td>\n", " <td>0.37118</td>\n", " <td>0.73471</td>\n", " <td>0.07347</td>\n", " <td>3.39</td>\n", " <td>10.95</td>\n", " <td>663</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", " <div class=\"colab-df-buttons\">\n", "\n", " <div class=\"colab-df-container\">\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-41957eb2-72ed-4ec3-ab02-8a414f55db1c')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", "\n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", " </svg>\n", " </button>\n", "\n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " .colab-df-buttons div {\n", " margin-bottom: 4px;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-41957eb2-72ed-4ec3-ab02-8a414f55db1c button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-41957eb2-72ed-4ec3-ab02-8a414f55db1c');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", "\n", "\n", "<div id=\"df-b1b1edea-3e7f-4bc5-9b7d-da2bc7cd2ec3\">\n", " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b1b1edea-3e7f-4bc5-9b7d-da2bc7cd2ec3')\"\n", " title=\"Suggest charts.\"\n", " style=\"display:none;\">\n", "\n", "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <g>\n", " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", " </g>\n", "</svg>\n", " </button>\n", "\n", "<style>\n", " .colab-df-quickchart {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-quickchart:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", "</style>\n", "\n", " <script>\n", " async function quickchart(key) {\n", " const charts = await google.colab.kernel.invokeFunction(\n", " 'suggestCharts', [key], {});\n", " }\n", " (() => {\n", " let quickchartButtonEl =\n", " document.querySelector('#df-b1b1edea-3e7f-4bc5-9b7d-da2bc7cd2ec3 button');\n", " quickchartButtonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", " })();\n", " </script>\n", "</div>\n", " </div>\n", " </div>\n" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "source": [ "df[df.source == \"scidocs\"].reset_index(drop=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "ln7p-b9XNPmO", "outputId": "3ae0445f-6fda-4a51-b229-b3e85273bb78" }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " source method ndcg_cut_10 map_cut_10 recall_10 P_10 index \\\n", "0 scidocs embed 0.21718 0.12982 0.23217 0.1146 127.63 \n", "1 scidocs hybrid 0.21104 0.12450 0.22938 0.1134 138.00 \n", "2 scidocs bm25 0.15063 0.08756 0.15637 0.0772 13.07 \n", "\n", " search memory \n", "0 4.41 2929 \n", "1 6.43 2999 \n", "2 1.42 722 " ], "text/html": [ "\n", " <div id=\"df-49647868-b8b6-4cad-869e-2d58db32b996\" class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>source</th>\n", " <th>method</th>\n", " <th>ndcg_cut_10</th>\n", " <th>map_cut_10</th>\n", " <th>recall_10</th>\n", " <th>P_10</th>\n", " <th>index</th>\n", " <th>search</th>\n", " <th>memory</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>scidocs</td>\n", " <td>embed</td>\n", " <td>0.21718</td>\n", " <td>0.12982</td>\n", " <td>0.23217</td>\n", " <td>0.1146</td>\n", " <td>127.63</td>\n", " <td>4.41</td>\n", " <td>2929</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>scidocs</td>\n", " <td>hybrid</td>\n", " <td>0.21104</td>\n", " <td>0.12450</td>\n", " <td>0.22938</td>\n", " <td>0.1134</td>\n", " <td>138.00</td>\n", " <td>6.43</td>\n", " <td>2999</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>scidocs</td>\n", " <td>bm25</td>\n", " <td>0.15063</td>\n", " <td>0.08756</td>\n", " <td>0.15637</td>\n", " <td>0.0772</td>\n", " <td>13.07</td>\n", " <td>1.42</td>\n", " <td>722</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", " <div class=\"colab-df-buttons\">\n", "\n", " <div class=\"colab-df-container\">\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-49647868-b8b6-4cad-869e-2d58db32b996')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", "\n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", " </svg>\n", " </button>\n", "\n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " .colab-df-buttons div {\n", " margin-bottom: 4px;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-49647868-b8b6-4cad-869e-2d58db32b996 button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-49647868-b8b6-4cad-869e-2d58db32b996');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", "\n", "\n", "<div id=\"df-aa4b2fc1-4793-43cd-b93f-2ab8c5f7df89\">\n", " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-aa4b2fc1-4793-43cd-b93f-2ab8c5f7df89')\"\n", " title=\"Suggest charts.\"\n", " style=\"display:none;\">\n", "\n", "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <g>\n", " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", " </g>\n", "</svg>\n", " </button>\n", "\n", "<style>\n", " .colab-df-quickchart {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-quickchart:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", "</style>\n", "\n", " <script>\n", " async function quickchart(key) {\n", " const charts = await google.colab.kernel.invokeFunction(\n", " 'suggestCharts', [key], {});\n", " }\n", " (() => {\n", " let quickchartButtonEl =\n", " document.querySelector('#df-aa4b2fc1-4793-43cd-b93f-2ab8c5f7df89 button');\n", " quickchartButtonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", " })();\n", " </script>\n", "</div>\n", " </div>\n", " </div>\n" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "source": [ "df[df.source == \"scifact\"].reset_index(drop=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "CsHEwmV0NTjm", "outputId": "6e97c4e8-318f-42e4-d162-037646cc3ed3" }, "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " source method ndcg_cut_10 map_cut_10 recall_10 P_10 index \\\n", "0 scifact hybrid 0.71305 0.66773 0.83722 0.09367 39.51 \n", "1 scifact bm25 0.66324 0.61764 0.78761 0.08700 4.40 \n", "2 scifact embed 0.65149 0.60193 0.78972 0.08867 35.15 \n", "\n", " search memory \n", "0 2.35 2918 \n", "1 0.93 658 \n", "2 1.48 2889 " ], "text/html": [ "\n", " <div id=\"df-3f630f69-e077-4aa6-b9ab-b8151f936095\" class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>source</th>\n", " <th>method</th>\n", " <th>ndcg_cut_10</th>\n", " <th>map_cut_10</th>\n", " <th>recall_10</th>\n", " <th>P_10</th>\n", " <th>index</th>\n", " <th>search</th>\n", " <th>memory</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>scifact</td>\n", " <td>hybrid</td>\n", " <td>0.71305</td>\n", " <td>0.66773</td>\n", " <td>0.83722</td>\n", " <td>0.09367</td>\n", " <td>39.51</td>\n", " <td>2.35</td>\n", " <td>2918</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>scifact</td>\n", " <td>bm25</td>\n", " <td>0.66324</td>\n", " <td>0.61764</td>\n", " <td>0.78761</td>\n", " <td>0.08700</td>\n", " <td>4.40</td>\n", " <td>0.93</td>\n", " <td>658</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>scifact</td>\n", " <td>embed</td>\n", " <td>0.65149</td>\n", " <td>0.60193</td>\n", " <td>0.78972</td>\n", " <td>0.08867</td>\n", " <td>35.15</td>\n", " <td>1.48</td>\n", " <td>2889</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", " <div class=\"colab-df-buttons\">\n", "\n", " <div class=\"colab-df-container\">\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3f630f69-e077-4aa6-b9ab-b8151f936095')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", "\n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", " </svg>\n", " </button>\n", "\n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " .colab-df-buttons div {\n", " margin-bottom: 4px;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-3f630f69-e077-4aa6-b9ab-b8151f936095 button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-3f630f69-e077-4aa6-b9ab-b8151f936095');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", "\n", "\n", "<div id=\"df-46cb31e4-5978-45c3-afae-a71bbc8f46b3\">\n", " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-46cb31e4-5978-45c3-afae-a71bbc8f46b3')\"\n", " title=\"Suggest charts.\"\n", " style=\"display:none;\">\n", "\n", "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <g>\n", " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", " </g>\n", "</svg>\n", " </button>\n", "\n", "<style>\n", " .colab-df-quickchart {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-quickchart:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-quickchart:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", "</style>\n", "\n", " <script>\n", " async function quickchart(key) {\n", " const charts = await google.colab.kernel.invokeFunction(\n", " 'suggestCharts', [key], {});\n", " }\n", " (() => {\n", " let quickchartButtonEl =\n", " document.querySelector('#df-46cb31e4-5978-45c3-afae-a71bbc8f46b3 button');\n", " quickchartButtonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", " })();\n", " </script>\n", "</div>\n", " </div>\n", " </div>\n" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "source": [ "The sections above show the metrics per source and method.\n", "\n", "The table headers list the `source (dataset)`, `index method`, `NDCG@10`/`MAP@10`/`RECALL@10`/`P@10` accuracy metrics, `index time(s)`, `search time(s)` and `memory usage(MB)`. The tables are sorted by `NDCG@10` descending.\n", "\n", "Looking at the results, we can see that `hybrid` search often performs better than `embeddings` or `bm25` individually. In some cases, as with scidocs, the combination performs worse. But in the aggregate, the scores are better. This holds true for the entire BEIR dataset. For some sources, `bm25` does best, some `embeddings` but overall the combined `hybrid` scores do the best.\n", "\n", "Hybrid search isn't free though, it is slower as it has extra logic to combine the results. For individual queries, the results are often negligible." ], "metadata": { "id": "tU1eFDZUh0NQ" } }, { "cell_type": "markdown", "source": [ "# Wrapping up\n", "\n", "This notebook covered ways to improve search accuracy using a hybrid approach. We evaluated performance over a subset of the BEIR dataset to show how hybrid search, in many situations, can improve overall accuracy.\n", "\n", "Custom datasets can also be evaluated using this method as [specified in this link](https://github.com/beir-cellar/beir/wiki/Load-your-custom-dataset). This notebook and the associated benchmarks script can be reused to evaluate what method works best on your data.\n" ], "metadata": { "id": "f41NSYWc0dsy" } } ] }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/neuml/txtai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server