2-Creating-Vector-DB.ipynbā¢26.4 kB
{
"cells": [
{
"cell_type": "markdown",
"id": "1985e7fa-3c10-4522-a8b4-277dbf70149d",
"metadata": {},
"source": [
"# Creating a Vector DB\n",
"\n",
"Before creating our vector search, we need to create a database of vectors. Notebook will show you how to do this:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b82622bd-3964-4b05-bee4-1e5919e32300",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:IRISINSTALLDIR or ISC_PACKAGE_INSTALLDIR environment variable must be set\n",
"WARNING:root:Embedded Python not available\n",
"WARNING:root:Error importing pythonint: No module named 'pythonint'\n",
"WARNING:root:Embedded Python not available\n"
]
}
],
"source": [
"## Create iris cursor\n",
"## I've hidden the code for this in the file Utils.get_iris_connection - its very simple and shown in the previous tutorial\n",
"from Utils.get_iris_connection import get_cursor\n",
"cursor = get_cursor()"
]
},
{
"cell_type": "markdown",
"id": "9e1fa495-5730-4128-bc7b-300ccc6b1e80",
"metadata": {},
"source": [
"### Download data and load into pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8775802b-c36b-4700-bf3b-d8a51ffe3080",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ClinicalNotes</th>\n",
" <th>Patient</th>\n",
" <th>PatientID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4f7469746973204d65646961204576616c756174696f6e...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>446174653a20323032352d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>466f6c6c6f772d557020666f72204f7469746973204d65...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>446174653a20323032342d31312d32310a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>446174653a20323032342d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ClinicalNotes Patient PatientID\n",
"0 4f7469746973204d65646961204576616c756174696f6e... Patient/3 3\n",
"1 446174653a20323032352d30382d30360a50726f766964... Patient/3 3\n",
"2 466f6c6c6f772d557020666f72204f7469746973204d65... Patient/3 3\n",
"3 446174653a20323032342d31312d32310a50726f766964... Patient/3 3\n",
"4 446174653a20323032342d30382d30360a50726f766964... Patient/3 3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql = \"\"\"SELECT \n",
"DocumentReferenceContentAttachmentData, DocumentReferenceSubjectReference\n",
"FROM VectorSearchApp.DocumentReference\"\"\"\n",
"\n",
"cursor.execute(sql)\n",
"out = cursor.fetchall()\n",
"\n",
"import pandas as pd\n",
"## The result_set doesn't include column names so we will add them ourselvs\n",
"cols = [\"ClinicalNotes\", \"Patient\"] \n",
"\n",
"df = pd.DataFrame(out, columns=cols)\n",
"df[\"PatientID\"] = pd.to_numeric(df[\"Patient\"].astype(str).str.strip(\"Patient/\"))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "b0b61413-8c29-44cd-9c4a-d38b3f0def55",
"metadata": {},
"source": [
"### Decode the clinical notes into plain text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e395d29a-7c3e-48a2-ac6a-873c9358aa6f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ClinicalNotes</th>\n",
" <th>Patient</th>\n",
" <th>PatientID</th>\n",
" <th>NotesDecoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4f7469746973204d65646961204576616c756174696f6e...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Otitis Media Evaluation\\nDate: 2024-01-25\\nPro...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>446174653a20323032352d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2025-08-06\\nProvider: Dr. Jason347 Frami...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>466f6c6c6f772d557020666f72204f7469746973204d65...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Follow-Up for Otitis Media\\nDate: 2024-05-07\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>446174653a20323032342d31312d32310a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2024-11-21\\nProvider: Dr. Chin306 Kulas5...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>446174653a20323032342d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2024-08-06\\nProvider: Dr. Jason347 Frami...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ClinicalNotes Patient PatientID \\\n",
"0 4f7469746973204d65646961204576616c756174696f6e... Patient/3 3 \n",
"1 446174653a20323032352d30382d30360a50726f766964... Patient/3 3 \n",
"2 466f6c6c6f772d557020666f72204f7469746973204d65... Patient/3 3 \n",
"3 446174653a20323032342d31312d32310a50726f766964... Patient/3 3 \n",
"4 446174653a20323032342d30382d30360a50726f766964... Patient/3 3 \n",
"\n",
" NotesDecoded \n",
"0 Otitis Media Evaluation\\nDate: 2024-01-25\\nPro... \n",
"1 Date: 2025-08-06\\nProvider: Dr. Jason347 Frami... \n",
"2 Follow-Up for Otitis Media\\nDate: 2024-05-07\\n... \n",
"3 Date: 2024-11-21\\nProvider: Dr. Chin306 Kulas5... \n",
"4 Date: 2024-08-06\\nProvider: Dr. Jason347 Frami... "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import base64\n",
"\n",
"## here I am using a python lambda function (single-line function) to take a row of data \n",
"## and decode the ClinicalNotes column, and return it to a new column called NotesDecoded \n",
"df[\"NotesDecoded\"] = df[\"ClinicalNotes\"].apply(lambda x: bytes.fromhex(x).decode(\"utf-8\", errors=\"replace\"))\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "607bf543-b73b-45a0-916b-c738437b6bf5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Otitis Media Evaluation\n",
"Date: 2024-01-25\n",
"Provider: Dr. Lemuel304 Stokes453\n",
"Location: Beth Israel Deaconess Hospital ā Needham\n",
"Reason for Visit: Ear pain and irritability\n",
"Subjective:\n",
"Aurora presented with symptoms of ear discomfort, mild fever, and increased fussiness. Parent reports onset 2 days ago. No vomiting or diarrhea. No prior history of ear infections.\n",
"Objective:\n",
"\n",
"Vitals: Temp 37.8°C, BP 107/80 mmHg\n",
"Physical Exam:\n",
"\n",
"Tympanic membrane: Erythematous and bulging on the right side\n",
"No discharge noted\n",
"Mild tenderness on palpation of the mastoid process\n",
"Lungs clear, no respiratory distress\n",
"\n",
"\n",
"\n",
"Assessment:\n",
"\n",
"Acute Otitis Media ā Right ear\n",
"\n",
"Plan:\n",
"\n",
"Prescribed Amoxicillin 250 mg oral capsule, 1 capsule twice daily for 7 days\n",
"Supportive care: fluids, rest, acetaminophen for fever\n",
"Follow-up in 10 days or sooner if symptoms worsen\n",
"Educated parent on signs of complications\n"
]
}
],
"source": [
"## View one example of a note: \n",
"print(df[\"NotesDecoded\"][0])"
]
},
{
"cell_type": "markdown",
"id": "0828c5f2-919b-471c-bc00-298d862a6619",
"metadata": {},
"source": [
"### Encoding clinical notes to vectors\n",
"\n",
"Here we are using a pre-trained model from the hugging-faces `sentance_transformers` libary. I have done this using the model `all-MiniLM-L6-v2`. This is a basic sentence transformer model which outputs vectors with 384 dimensions. There are many models available, some of which will be more relevant to your needs. For example, if you are using a language other than english, you would likely need a multi-lingual model or a model in your specific language. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a01cd909-93b8-4443-9ff1-a8f9758dd48c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n",
"INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "82b3489bffc148f8997869cd2ee979ed",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Batches: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"# Load a pre-trained sentence transformer model. This model's output vectors are of size 384\n",
"\n",
"model = SentenceTransformer('all-MiniLM-L6-v2') \n",
"\n",
"# Generate embeddings for all descriptions at once. Batch processing makes it faster\n",
"embeddings = model.encode(df['NotesDecoded'].tolist(), normalize_embeddings=True)\n",
"\n",
"# Add the embeddings to the DataFrame\n",
"df['Notes_Vector'] = embeddings.tolist()"
]
},
{
"cell_type": "markdown",
"id": "6d6e84aa-dc0f-4f7f-9fb0-66e2a748cd7c",
"metadata": {},
"source": [
"As we are using health data, an example of a better model for this process would be [pritamdeka/S-PubMedBert-MS-MARCO](https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO), which is designed for medical text. This model has double the number of dimensions (768) as the model used above, which will slow the time taken to create the vectors. For this reason, I'm going to stick with the smaller model for now. \n",
"\n",
"If you do change the model used, remember to change the number of dimensions given when defining the SQL table below. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8017460f-f968-43c4-abb5-6e7f23f3d9e4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ClinicalNotes</th>\n",
" <th>Patient</th>\n",
" <th>PatientID</th>\n",
" <th>NotesDecoded</th>\n",
" <th>Notes_Vector</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4f7469746973204d65646961204576616c756174696f6e...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Otitis Media Evaluation\\nDate: 2024-01-25\\nPro...</td>\n",
" <td>[0.037052784115076065, -0.01652533747255802, 0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>446174653a20323032352d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2025-08-06\\nProvider: Dr. Jason347 Frami...</td>\n",
" <td>[-0.03362147510051727, 0.05015568435192108, 0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>466f6c6c6f772d557020666f72204f7469746973204d65...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Follow-Up for Otitis Media\\nDate: 2024-05-07\\n...</td>\n",
" <td>[-0.0073646255768835545, -0.02232077717781067,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>446174653a20323032342d31312d32310a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2024-11-21\\nProvider: Dr. Chin306 Kulas5...</td>\n",
" <td>[-0.02497003600001335, 0.051254600286483765, 0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>446174653a20323032342d30382d30360a50726f766964...</td>\n",
" <td>Patient/3</td>\n",
" <td>3</td>\n",
" <td>Date: 2024-08-06\\nProvider: Dr. Jason347 Frami...</td>\n",
" <td>[-0.006600924767553806, 0.04176628217101097, 0...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ClinicalNotes Patient PatientID \\\n",
"0 4f7469746973204d65646961204576616c756174696f6e... Patient/3 3 \n",
"1 446174653a20323032352d30382d30360a50726f766964... Patient/3 3 \n",
"2 466f6c6c6f772d557020666f72204f7469746973204d65... Patient/3 3 \n",
"3 446174653a20323032342d31312d32310a50726f766964... Patient/3 3 \n",
"4 446174653a20323032342d30382d30360a50726f766964... Patient/3 3 \n",
"\n",
" NotesDecoded \\\n",
"0 Otitis Media Evaluation\\nDate: 2024-01-25\\nPro... \n",
"1 Date: 2025-08-06\\nProvider: Dr. Jason347 Frami... \n",
"2 Follow-Up for Otitis Media\\nDate: 2024-05-07\\n... \n",
"3 Date: 2024-11-21\\nProvider: Dr. Chin306 Kulas5... \n",
"4 Date: 2024-08-06\\nProvider: Dr. Jason347 Frami... \n",
"\n",
" Notes_Vector \n",
"0 [0.037052784115076065, -0.01652533747255802, 0... \n",
"1 [-0.03362147510051727, 0.05015568435192108, 0.... \n",
"2 [-0.0073646255768835545, -0.02232077717781067,... \n",
"3 [-0.02497003600001335, 0.051254600286483765, 0... \n",
"4 [-0.006600924767553806, 0.04176628217101097, 0... "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## View output\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "623e5d97-d335-4ef6-9a04-aa699c8c9850",
"metadata": {},
"source": [
"### Create new table in IRIS\n",
"\n",
"Here we create a SQL query to create a new table in IRIS.\n",
"\n",
"We have to define the datatypes required for this table. The vectors are loaded in the IRIS-specific data type 'VECTOR'. We also specify the type of each element in the vector (Double - a floating point number) and the dimensionality of the vector (384).\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f4879161-dccb-47ac-a95e-dc55e14200bc",
"metadata": {},
"outputs": [],
"source": [
"table_name = \"VectorSearch.DocRefVectors\"\n",
"\n",
"create_table_query = f\"\"\"\n",
"CREATE TABLE {table_name} (\n",
"PatientID INTEGER,\n",
"ClinicalNotes LONGVARCHAR,\n",
"NotesVector VECTOR(DOUBLE, 384)\n",
")\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4e791be0-d3f6-4649-a66c-390bd816ae21",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cursor.execute(create_table_query)"
]
},
{
"cell_type": "markdown",
"id": "b5d590c2-8163-4838-a8e5-04bf6c92949c",
"metadata": {},
"source": [
"### Adding data\n",
"\n",
"We will now insert our vector dataset into the table we've generated. There are multiple ways of adding data to our table, here I will show two and do a very brief speed comparison.\n",
"\n",
"Firstly we insert each row individually, iterating over each row individually. For this iteration, we are using df.apply() to efficiently perform the same function to each row of the data table. \n",
"\n",
"Secondly, we use `cursor.executemany()` with a single query and a list of parameter lists to execute all the insertions at once. \n",
"\n",
"Importantly, the IRIS-SQL `TO_VECTOR()` function needs the vector to be in string format, so you will see both methods involve converting this data to a string before executing the query.,"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "23e4df0f-3d5a-45cf-8e07-bd777f809925",
"metadata": {},
"outputs": [],
"source": [
"## Create a reusable query string with ? placeholders for the values\n",
"insert_query = f\"INSERT INTO {table_name} ( PatientID, ClinicalNotes, NotesVector) values (?, ?, TO_VECTOR(?))\""
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "12182458-89d4-4954-9d9a-1f9159d7592b",
"metadata": {},
"outputs": [],
"source": [
"## Needed for a speed comparison\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "77dec808-3c1a-4099-ad64-19899cce9461",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Method 1 took 0.17098617553710938 Seconds\n"
]
}
],
"source": [
"## Method 1: Inserting rows one at time (iterating with df.apply())\n",
"st = time.time()\n",
"\n",
"def addRow(row): ## Create a function to insert each row\n",
" cursor.execute(insert_query, [ row[\"PatientID\"], row[\"NotesDecoded\"], str(row[\"Notes_Vector\"])])\n",
"## Apply the row insertion function to each row in the table (axis=1 specifies that we are iterating over rows, not columns). \n",
"df.apply(addRow, axis=1)\n",
"\n",
"print(f\"Method 1 took {time.time()-st} Seconds\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e252fcaa-e2f0-4a5c-baa9-83aeb22c7384",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Reset the table between the insertion methods\n",
"cursor.execute(f\"Drop TABLE {table_name}\" )\n",
"cursor.execute(create_table_query)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "31392d0f-c15b-44aa-b276-c7663f23cc9f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Method 2 took 0.05177664756774902 Seconds\n"
]
}
],
"source": [
"st = time.time()\n",
"df[\"Notes_Vector_str\"] = df[\"Notes_Vector\"].astype(str)\n",
"rows_list = df[[\"PatientID\", \"NotesDecoded\", \"Notes_Vector_str\"]].values.tolist()\n",
"\n",
"cursor.executemany(insert_query, rows_list)\n",
"print(f\"Method 2 took {time.time()-st} Seconds\")"
]
},
{
"cell_type": "markdown",
"id": "8556fe23-ad81-41c7-8762-627f30a3433e",
"metadata": {},
"source": [
"Both methods are quick because our table is very small. However, the second method is generally quicker. This speed boast might be useful if you are dealing with very large datasets. Either way though, we now have our data table complete with Vectors for use in a vector search! \n",
"\n",
"\n",
"### Querying table\n",
"\n",
"Before moving on to the vector search, lets quickly query the database to check everything looks as expected:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b1fc691e-6625-4f29-82c2-7f9140118fd1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PatientID</th>\n",
" <th>NotesDecoded</th>\n",
" <th>Notes_Vector</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>Otitis Media Evaluation\\nDate: 2024-01-25\\nPro...</td>\n",
" <td>.037052784115076065063,-.016525337472558021546...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>Date: 2025-08-06\\nProvider: Dr. Jason347 Frami...</td>\n",
" <td>-.033621475100517272949,.050155684351921081542...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Follow-Up for Otitis Media\\nDate: 2024-05-07\\n...</td>\n",
" <td>-.0073646255768835544586,-.0223207771778106689...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PatientID NotesDecoded \\\n",
"0 3 Otitis Media Evaluation\\nDate: 2024-01-25\\nPro... \n",
"1 3 Date: 2025-08-06\\nProvider: Dr. Jason347 Frami... \n",
"2 3 Follow-Up for Otitis Media\\nDate: 2024-05-07\\n... \n",
"\n",
" Notes_Vector \n",
"0 .037052784115076065063,-.016525337472558021546... \n",
"1 -.033621475100517272949,.050155684351921081542... \n",
"2 -.0073646255768835544586,-.0223207771778106689... "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_query = f\"Select TOP 3 * FROM {table_name}\"\n",
"cursor.execute(sql_query) \n",
"results = cursor.fetchall()\n",
"results_df = pd.DataFrame(results, columns= [\"PatientID\",\"NotesDecoded\", \"Notes_Vector\"] )\n",
"results_df.head()"
]
},
{
"cell_type": "markdown",
"id": "071d3dee",
"metadata": {},
"source": [
"### Next Steps\n",
"\n",
"Great, we've now got our database of clinical notes converted to Vector format based on semantic meaning, we can now continue by [implementing the vector search and prompting a Large Language model](./3-Vector-Search-LLM-Prompting.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}