Skip to main content
Glama
8b-is
by 8b-is
evaluation.md3.16 kB
# Evaluating Language Models in RustyFlow Once a language model is trained, we need a way to measure its performance. This document explains the standard metric used for this purpose—**Perplexity**—and how to perform evaluation in the `RustyFlow` library. ## What is Perplexity? **Perplexity (PPL)** is the standard metric for evaluating language models. It measures how well a probability model predicts a sample. In simple terms, perplexity is a measure of "surprise." A lower perplexity score indicates that the model is less surprised by the text in the test set, meaning it predicts the sequence of words more accurately. ### How it's Calculated Perplexity is the exponentiated average cross-entropy loss. - **Cross-Entropy Loss**: For each token in a sequence, the loss measures the negative log probability the model assigned to the correct next token. A lower loss means the model was more confident about the correct token. - **Average Loss**: We calculate the average loss over an entire dataset (e.g., a validation or test set). - **Perplexity**: The final perplexity is calculated as `exp(average_loss)`. The formula is: `PPL = exp( (1/N) * Σ( -log P(word_i | context_i) ) )` Where `N` is the number of tokens in the test set. **Interpretation:** A perplexity of `K` means that, on average, the model is as confused as if it had to choose uniformly and independently from `K` possibilities for each token. A perfect model that assigns probability 1 to the correct next token would have a perplexity of 1. ## How to Evaluate in RustyFlow The `language_model` example has been updated to automatically perform evaluation after the training loop completes. ### Train/Validation/Test Splits Following standard machine learning practice, we split our data into three sets: 1. **Training Set**: Used to train the model's parameters. 2. **Validation Set**: Used to tune hyperparameters (like learning rate, model size, etc.) and check for overfitting during development. The model does not train on this data. 3. **Test Set**: A held-out set used only once to report the final performance of the fully trained and tuned model. This provides an unbiased measure of the model's generalization ability. ### The `language_model` Example - **Training**: The model is trained on the training portion of the selected dataset. - **Validation**: After training, the example automatically runs an evaluation on the validation set and reports the average loss and perplexity. - **Dataset Handling**: - For **`wikitext-2`**, the example uses the official `wiki.train.tokens` and `wiki.valid.tokens` files. To get a final benchmark score, you would evaluate on `wiki.test.tokens`. - For **`tinyshakespeare`** and custom text files, the example performs a 90/10 split to create training and validation sets. To run the demo and see the evaluation: ```bash # This will train on TinyShakespeare and then evaluate it. ./run.sh demo ``` The output will include a final section with the validation loss and perplexity. This allows you to see if changes to the model or hyperparameters are improving its predictive power.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/8b-is/smart-tree'

If you have feedback or need assistance with the MCP directory API, please join our Discord server