{
"cells": [
{
"cell_type": "markdown",
"id": "c65bdb85",
"metadata": {},
"source": [
"# Distilling Knowledge into Tiny LLMs\n",
"\n",
"Large Language Models (LLMs) are the magic behind AI. These massive billion and trillion parameter models have been shown to generalize well when trained on enough data.\n",
"\n",
"A big problem is that they are hard to run and expensive. So many just call LLMs through APIs such as OpenAI or Claude. Additionally, in many instances, developers spend a lot of time with complex prompt logic hoping to cover all the edge cases and believe they need a model that's large enough to handle all the rules.\n",
"\n",
"If you truly want control over your business processes, running a local model is a better choice. And the good news is that it doesn't have to be a giant and expensive multi-billion parameter model. We can finetune LLMs to handle our specific business logic, which helps us take control and limit prompt complexity. \n",
"\n",
"This notebook will show how we can distill knowledge into tiny LLMs."
]
},
{
"cell_type": "markdown",
"id": "279ecd4b",
"metadata": {},
"source": [
"# Install dependencies\n",
"\n",
"Install `txtai` and all dependencies."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f7bca3a",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-train] datasets"
]
},
{
"cell_type": "markdown",
"id": "c7b1b4be",
"metadata": {},
"source": [
"# The LLM\n",
"\n",
"We'll use a [600M parameter Qwen3 model](https://hf.co/qwen/qwen3-0.6b) for this example. Our target task will be translating user requests into linux commands."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e1d95dd8",
"metadata": {},
"outputs": [],
"source": [
"from txtai import LLM\n",
"\n",
"llm = LLM(\"Qwen/Qwen3-0.6B\")"
]
},
{
"cell_type": "markdown",
"id": "8f276cfc",
"metadata": {},
"source": [
"Let's try one with the base model as it is."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "686c3983",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ps -e'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm(\"\"\"\n",
"Translate the following request into a linux command. Only print the command.\n",
"\n",
"Find number of logged in users\n",
"\"\"\", maxlength=1024)"
]
},
{
"cell_type": "markdown",
"id": "dcd92414",
"metadata": {},
"source": [
"As we can see, the model actually has a good understanding and at least prints a command. But in this case it's not correct. Let's get to fine-tuning!"
]
},
{
"cell_type": "markdown",
"id": "73666982",
"metadata": {},
"source": [
"# Finetuning the LLM with knowledge\n",
"\n",
"Yes, 600M parameters is small and we can't possibly expect it to do well with everything. But the good news is that we can distill knowledge into this tiny LLM and make it better. We'll use this [linux commands dataset](https://huggingface.co/datasets/mecha-org/linux-command-dataset) from the Hugging Face Hub. We'll also use this [training pipeline from txtai](https://neuml.github.io/txtai/pipeline/train/trainer).\n",
"\n",
"First, we'll create the training dataset. We'll use the same prompt strategy from above.\n",
"\n",
"```python\n",
"\"\"\"\n",
"Translate the following request into a linux command. Only print the command.\n",
"\n",
"{user request}\n",
"\"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "43d6b563",
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"from transformers import AutoTokenizer\n",
"\n",
"# LLM path\n",
"path = \"Qwen/Qwen3-0.6B\"\n",
"tokenizer = AutoTokenizer.from_pretrained(path)\n",
"\n",
"# Load the training dataset\n",
"dataset = load_dataset(\"mecha-org/linux-command-dataset\", split=\"train\")\n",
"\n",
"def prompt(row):\n",
" text = tokenizer.apply_chat_template([\n",
" {\"role\": \"system\", \"content\": \"Translate the following request into a linux command. Only print the command.\"},\n",
" {\"role\": \"user\", \"content\": row[\"input\"]},\n",
" {\"role\": \"assistant\", \"content\": row[\"output\"]}\n",
" ], tokenize=False, enable_thinking=False)\n",
"\n",
" return {\"text\": text}\n",
"\n",
"# Map to training prompts\n",
"train = dataset.map(prompt, remove_columns=[\"input\", \"output\"])"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7f71dce0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div>\n",
" \n",
" <progress value='210' max='210' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
" [210/210 01:12, Epoch 1/1]\n",
" </div>\n",
" <table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>Step</th>\n",
" <th>Training Loss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>50</td>\n",
" <td>0.625300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>100</td>\n",
" <td>0.490200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>150</td>\n",
" <td>0.403300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>200</td>\n",
" <td>0.391800</td>\n",
" </tr>\n",
" </tbody>\n",
"</table><p>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from txtai.pipeline import HFTrainer\n",
"\n",
"# Load the training pipeline\n",
"trainer = HFTrainer()\n",
"\n",
"# Train the model\n",
"# Set output_dir to save, trained in memory for this example\n",
"model = trainer(\n",
" \"Qwen/Qwen3-0.6B\",\n",
" train,\n",
" task=\"language-generation\",\n",
" maxlength=512,\n",
" bf16=True,\n",
" per_device_train_batch_size=4,\n",
" num_train_epochs=1,\n",
" logging_steps=50,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68b5bc4e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'who | wc -l'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from txtai import LLM\n",
"\n",
"llm = LLM(model)\n",
"\n",
"llm([\n",
" {\"role\": \"system\", \"content\": \"Translate the following request into a linux command. Only print the command.\"},\n",
" {\"role\": \"user\", \"content\": \"Find number of logged in users\"}\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "7427186d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ls ~/'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm([\n",
" {\"role\": \"system\", \"content\": \"Translate the following request into a linux command. Only print the command.\"},\n",
" {\"role\": \"user\", \"content\": \"List the files in my home directory\"}\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a141d63b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'zip -r data.zip data'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm([\n",
" {\"role\": \"system\", \"content\": \"Translate the following request into a linux command. Only print the command.\"},\n",
" {\"role\": \"user\", \"content\": \"Zip the data directory with all it's contents\"}\n",
"])"
]
},
{
"cell_type": "markdown",
"id": "fc7d0fda",
"metadata": {},
"source": [
"It even works well without the system prompt."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "a7d0c671",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'du -sh ~'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm(\"Calculate the total amount of disk space used for my home directory. Only print the total.\")"
]
},
{
"cell_type": "markdown",
"id": "48cfcacf",
"metadata": {},
"source": [
"# Wrapping up\n",
"\n",
"This notebook demonstrated how it's very straightforward to distill knowledge into LLMs with `txtai`. Don't always go for the giant LLM, spend a little time finetuning a tiny LLM, it is well worth it!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "local",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.19"
}
},
"nbformat": 4,
"nbformat_minor": 5
}