txt2sql.ipynb•27 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!-- SPDX-License-Identifier: CC-BY-NC-SA-4.0 -->\n",
"\n",
"*This notebook is © [Braintrust Cookbook](https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data) and licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
" <br>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">Experiments: Text2SQL</h1>\n",
"\n",
"Building effective text-to-SQL systems requires rigorous evaluation and systematic experimentation. In this tutorial, we'll walk through the complete evaluation-driven development process, starting from scratch without pre-existing datasets of questions or expected responses.\n",
"\n",
"We'll use a movie database containing recent titles, ratings, box office performance, and metadata to demonstrate how to build, evaluate, and systematically improve a text-to-SQL system using Phoenix's experimentation framework. Think of Phoenix as your scientific laboratory, meticulously recording every experiment to help you build better AI systems."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"arize-phoenix>=11.0.0\" openai 'httpx<0.28' duckdb datasets pyarrow \"pydantic>=2.0.0\" openinference-instrumentation-openai --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first start a phoenix server to act as our evaluation dashboard and experiment tracker. This will be our central hub for observing, measuring, and improving our text-to-SQL system.\n",
"\n",
"Note: this step is not necessary if you already have a Phoenix server running."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import phoenix as px\n",
"\n",
"px.launch_app().view()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also setup tracing for OpenAI. Tracing is crucial for evaluation-driven development - it allows Phoenix to observe every step of our text-to-SQL pipeline, capturing inputs, outputs, and metrics like latency and cost that we'll use to systematically improve our system."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"tracer_provider = register(\n",
" endpoint=\"http://localhost:6006/v1/traces\", auto_instrument=True, verbose=False\n",
") # Instruments all openai calls\n",
"\n",
"tracer = tracer_provider.get_tracer(__name__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Initialize a Phoenix Client. This acts as the main entry point for interacting with the Phoenix API. This package can be installed independently of Phoenix itself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import AsyncClient\n",
"\n",
"px_client = AsyncClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, let's make sure we have our openai API key set up."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"if not os.getenv(\"OPENAI_API_KEY\"):\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"🔑 Enter your OpenAI API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Data\n",
"\n",
"We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"from datasets import load_dataset\n",
"\n",
"data = load_dataset(\"wykonos/movies\")[\"train\"]\n",
"\n",
"conn = duckdb.connect(database=\":memory:\", read_only=False)\n",
"conn.register(\"movies\", data.to_pandas())\n",
"\n",
"records = conn.query(\"SELECT * FROM movies LIMIT 5\").to_df().to_dict(orient=\"records\")\n",
"\n",
"for record in records:\n",
" print(record)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implement Text2SQL\n",
"\n",
"Let's start by implementing a simple text2sql logic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"openai_client = openai.AsyncClient()\n",
"\n",
"columns = conn.query(\"DESCRIBE movies\").to_df().to_dict(orient=\"records\")\n",
"\n",
"# We will use GPT4o to start\n",
"TASK_MODEL = \"gpt-4o\"\n",
"CONFIG = {\"model\": TASK_MODEL}\n",
"\n",
"\n",
"system_prompt = (\n",
" \"You are a SQL expert, and you are given a single table named movies with the following columns:\\n\"\n",
" f\"{','.join(column['column_name'] + ': ' + column['column_type'] for column in columns)}\\n\"\n",
" \"Write a SQL query corresponding to the user's request. Return just the query text, \"\n",
" \"with no formatting (backticks, markdown, etc.).\"\n",
")\n",
"\n",
"\n",
"@tracer.chain\n",
"async def generate_query(input):\n",
" response = await openai_client.chat.completions.create(\n",
" model=TASK_MODEL,\n",
" temperature=0,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": input,\n",
" },\n",
" ],\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = await generate_query(\"what was the most popular movie?\")\n",
"print(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Awesome, looks like the we are producing SQL! let's try running the query and see if we get the expected results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@tracer.tool\n",
"def execute_query(query):\n",
" return conn.query(query).fetchdf().to_dict(orient=\"records\")\n",
"\n",
"\n",
"execute_query(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Three Pillars of Evaluation\n",
"\n",
"Effective AI evaluation rests on three fundamental pillars:\n",
"\n",
"1. **Data**: Curated examples that represent real-world use cases\n",
"2. **Task**: The actual function or workflow being evaluated\n",
"3. **Evaluators**: Quantitative measures of performance\n",
"\n",
"Let's start by creating our **data** - a set of movie-related questions that we want our text-to-SQL system to handle correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions = [\n",
" \"Which Brad Pitt movie received the highest rating?\",\n",
" \"What is the top grossing Marvel movie?\",\n",
" \"What foreign-language fantasy movie was the most popular?\",\n",
" \"what are the best sci-fi movies of 2017?\",\n",
" \"What anime topped the box office in the 2010s?\",\n",
" \"Recommend a romcom that stars Paul Rudd.\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's store the data above as a versioned dataset in phoenix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = await px_client.datasets.create_dataset(\n",
" name=\"movie-example-questions\",\n",
" inputs=[{\"question\": question} for question in questions],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll define the task. The task is to generate SQL queries from natural language questions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@tracer.chain\n",
"async def text2sql(question):\n",
" query = await generate_query(question)\n",
" results = None\n",
" error = None\n",
" try:\n",
" results = execute_query(query)\n",
" except duckdb.Error as e:\n",
" error = str(e)\n",
"\n",
" return {\n",
" \"query\": query,\n",
" \"results\": results,\n",
" \"error\": error,\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we'll define the evaluation scores. We'll use the following simple functions to see if the generated SQL queries are correct. Note that `has_results` is a good metric here because we know that all the questions we added to the dataset can be answered via sql."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test if there are no sql execution errors\n",
"\n",
"\n",
"def no_error(output):\n",
" return 1.0 if output.get(\"error\") is None else 0.0\n",
"\n",
"\n",
"# Test if the query has results\n",
"def has_results(output):\n",
" results = output.get(\"results\")\n",
" has_results = results is not None and len(results) > 0\n",
" return 1.0 if has_results else 0.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's run the evaluation experiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import Any\n",
"\n",
"\n",
"async def task(input: Any):\n",
" return await text2sql(input[\"question\"])\n",
"\n",
"\n",
"experiment = await px_client.experiments.run_experiment(\n",
" dataset=ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def show_evaluation_summary(exp):\n",
" no_error_scores = [\n",
" run.result[\"score\"]\n",
" for run in exp[\"evaluation_runs\"]\n",
" if run.result and \"score\" in run.result and run.name == \"no_error\"\n",
" ]\n",
" has_results_scores = [\n",
" run.result[\"score\"]\n",
" for run in exp[\"evaluation_runs\"]\n",
" if run.result and \"score\" in run.result and run.name == \"has_results\"\n",
" ]\n",
"\n",
" print(\"📊 Evaluation Results:\")\n",
" if no_error_scores:\n",
" avg_no_error = sum(no_error_scores) / len(no_error_scores)\n",
" print(f\" No Error: {avg_no_error:.3f} (n={len(no_error_scores)})\")\n",
"\n",
" if has_results_scores:\n",
" avg_has_results = sum(has_results_scores) / len(has_results_scores)\n",
" print(f\" Has Results: {avg_has_results:.3f} (n={len(has_results_scores)})\")\n",
"\n",
"\n",
"show_evaluation_summary(experiment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! Let's see how our baseline model performed on the movie questions. We can analyze both successful queries and any failures to understand where improvements are needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interpreting the results\n",
"\n",
"Now that we ran the initial evaluation, let's analyze what might be causing any failures.\n",
"\n",
"From looking at the query where there are no results, genre-related queries might fail because the model doesn't know how genres are stored (e.g., \"Sci-Fi\" vs \"Science Fiction\")\n",
"\n",
"These types of issues would probably be improved by showing a sample of the data to the model (few-shot examples) since the data will show the LLM what is queryable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to improve the prompt with few-shot examples and see if we can get better results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"samples = conn.query(\"SELECT * FROM movies LIMIT 5\").to_df().to_dict(orient=\"records\")\n",
"\n",
"example_row = \"\\n\".join(\n",
" f\"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}\"\n",
" for column in columns\n",
")\n",
"\n",
"column_header = \" | \".join(column[\"column_name\"] for column in columns)\n",
"\n",
"few_shot_examples = \"\\n\".join(\n",
" \" | \".join(str(sample[column[\"column_name\"]]) for column in columns) for sample in samples\n",
")\n",
"\n",
"system_prompt = (\n",
" \"You are a SQL expert, and you are given a single table named `movies` with the following columns:\\n\\n\"\n",
" \"Column | Type | Example\\n\"\n",
" \"-------|------|--------\\n\"\n",
" f\"{example_row}\\n\"\n",
" \"\\n\"\n",
" \"Examples:\\n\"\n",
" f\"{column_header}\\n\"\n",
" f\"{few_shot_examples}\\n\"\n",
" \"\\n\"\n",
" \"Write a DuckDB SQL query corresponding to the user's request. \"\n",
" \"Return just the query text, with no formatting (backticks, markdown, etc.).\"\n",
")\n",
"\n",
"\n",
"async def generate_query(input):\n",
" response = await openai_client.chat.completions.create(\n",
" model=TASK_MODEL,\n",
" temperature=0,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": input,\n",
" },\n",
" ],\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"\n",
"print(await generate_query(\"what are the best sci-fi movies in the 2000s?\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment = await px_client.experiments.run_experiment(\n",
" dataset=ds, task=task, evaluators=[has_results, no_error], experiment_metadata=CONFIG\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"show_evaluation_summary(experiment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amazing. It looks like the LLM is generating a valid query for all questions. Let's try out using LLM as a judge to see how well it can assess the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"from phoenix.client.experiments import create_evaluator\n",
"\n",
"judge_instructions = \"\"\"\n",
"You are a judge that determines if a given question can be answered with the provided SQL query and results.\n",
"Make sure to ensure that the SQL query maps to the question accurately.\n",
"\n",
"Provide the label `correct` if the SQL query and results accurately answer the question.\n",
"Provide the label `invalid` if the SQL query does not map to the question or is not valid.\n",
"\"\"\"\n",
"\n",
"\n",
"@create_evaluator(kind=\"llm\")\n",
"async def qa_correctness(input: Any, output: Any) -> float:\n",
" question = input.get(\"question\")\n",
" query = output.get(\"query\")\n",
" results = output.get(\"results\")\n",
" response = await openai_client.chat.completions.create(\n",
" model=\"gpt-4o\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": judge_instructions},\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": f\"Question: {question}\\nSQL Query: {query}\\nSQL Results: {results}\",\n",
" },\n",
" ],\n",
" tool_choice=\"required\",\n",
" tools=[\n",
" {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"qa_correctness\",\n",
" \"description\": \"Determine if the SQL query and results accurately answer the question.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"explanation\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"Explain why the label is correct or invalid.\",\n",
" },\n",
" \"label\": {\"type\": \"string\", \"enum\": [\"correct\", \"invalid\"]},\n",
" },\n",
" },\n",
" },\n",
" }\n",
" ],\n",
" )\n",
" if response.choices[0].message.tool_calls is None:\n",
" raise ValueError(\"No tool call found in response\")\n",
" args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)\n",
" label = args[\"label\"]\n",
" score = 1 if label == \"correct\" else 0\n",
" return score\n",
"\n",
"\n",
"updated_experiment = await px_client.experiments.evaluate_experiment(\n",
" experiment=experiment, evaluators=[qa_correctness]\n",
")\n",
"\n",
"\n",
"def show_qa_correctness_results(exp):\n",
" qa_correctness_scores = [\n",
" run.result[\"score\"]\n",
" for run in exp[\"evaluation_runs\"]\n",
" if run.result and \"score\" in run.result and run.name == \"qa_correctness\"\n",
" ]\n",
"\n",
" print(\"📊 QA Correctness Evaluation Results:\")\n",
" if qa_correctness_scores:\n",
" avg_correctness = sum(qa_correctness_scores) / len(qa_correctness_scores)\n",
" print(f\" QA Correctness: {avg_correctness:.3f} (n={len(qa_correctness_scores)})\")\n",
"\n",
"\n",
"show_qa_correctness_results(updated_experiment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The LLM judge's scoring closely matches our manual evaluation, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement.\n",
"\n",
"The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that 'Anime' and 'Animation' are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in **Golden Dataset** for regression testing as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generating more data\n",
"\n",
"Let's generate some training data by having the model describe existing SQL queries from our dataset\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"\n",
"from pydantic import BaseModel\n",
"\n",
"\n",
"class Question(BaseModel):\n",
" sql: str\n",
" question: str\n",
"\n",
"\n",
"class Questions(BaseModel):\n",
" questions: List[Question]\n",
"\n",
"\n",
"sample_rows = \"\\n\".join(\n",
" f\"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}\"\n",
" for column in columns\n",
")\n",
"synthetic_data_prompt = f\"\"\"You are a SQL expert, and you are given a single table named movies with the following columns:\n",
"\n",
"Column | Type | Example\n",
"-------|------|--------\n",
"{sample_rows}\n",
"\n",
"Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as the\n",
"question that the query answers. Keep the questions bounded so that they are not too broad or too narrow.\"\"\"\n",
"\n",
"response = await openai_client.chat.completions.create(\n",
" model=\"gpt-4o\",\n",
" temperature=0,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": synthetic_data_prompt,\n",
" }\n",
" ],\n",
" tools=[\n",
" {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"generate_questions\",\n",
" \"description\": \"Generate SQL queries that would be interesting to ask about this table.\",\n",
" \"parameters\": Questions.model_json_schema(),\n",
" },\n",
" }\n",
" ],\n",
" tool_choice={\"type\": \"function\", \"function\": {\"name\": \"generate_questions\"}},\n",
")\n",
"\n",
"assert response.choices[0].message.tool_calls is not None\n",
"generated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[\n",
" \"questions\"\n",
"]\n",
"\n",
"print(\"Generated N questions: \", len(generated_questions))\n",
"print(\"First question: \", generated_questions[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"generated_dataset = []\n",
"for q in generated_questions:\n",
" try:\n",
" result = execute_query(q[\"sql\"])\n",
" example = {\n",
" \"input\": q[\"question\"],\n",
" \"expected\": {\n",
" \"results\": result or [],\n",
" \"query\": q[\"sql\"],\n",
" },\n",
" \"metadata\": {\n",
" \"category\": \"Generated\",\n",
" },\n",
" }\n",
" print(example)\n",
" generated_dataset.append(example)\n",
" except duckdb.Error as e:\n",
" print(f\"Query failed: {q['sql']}\", e)\n",
" print(\"Skipping...\")\n",
"\n",
"generated_dataset[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Awesome, let's crate a dataset with the new synthetic data.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"synthetic_dataset = await px_client.datasets.create_dataset(\n",
" name=\"movies-golden-synthetic\",\n",
" inputs=[{\"question\": example[\"input\"]} for example in generated_dataset],\n",
" outputs=[example[\"expected\"] for example in generated_dataset],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp = await px_client.experiments.run_experiment(\n",
" dataset=synthetic_dataset,\n",
" task=task,\n",
" evaluators=[no_error, has_results],\n",
" experiment_metadata=CONFIG,\n",
" timeout=60,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"show_evaluation_summary(exp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! We now have more data to work with. Here are some ways to improve it:\n",
"\n",
" - Review the generated data for issues\n",
" - Refine the prompt\n",
" - Show errors to the model\n",
"\n",
"This gives us a process to keep improving our system."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"In this tutorial, we built a text-to-SQL system for querying movie data. We started with basic examples and evaluators, then improved performance by adding few-shot examples as well as using an llm judge for evaluation.\n",
"\n",
"Key takeaways:\n",
"- Start with simple evaluators to catch basic issues\n",
"- Use few-shot examples to improve accuracy\n",
"- Generate more training data using LLMs\n",
"- Track progress with Phoenix's experiments\n",
"\n",
"You can further improve this system by adding better evaluators or handling edge cases."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}