Smart Tree - ST

Overview Schema Related Servers Score Discussions

data_handling.md•5.63 KiB

# Data Handling in RustyFlow This document provides a guide on how to handle datasets for training models with the RustyFlow library. ## Configuring a Training Run Training is configured via a central `config.env` file and executed with the `train.sh` script. This approach allows you to manage all hyperparameters and settings in one place. ### The `config.env` File If `config.env` doesn't exist, you can create a default version by running: ```bash ./run.sh setup ``` This file contains key-value pairs for settings like: - `DATASET`: The dataset to use (`tinyshakespeare`, `wikitext-2`, `short`, or a path to a file). - `MODEL_PATH`: Where to save the trained model. - `NUM_EPOCHS`, `BATCH_SIZE`, `SEQ_LEN`, `LEARNING_RATE`: Training hyperparameters. ### Starting a Training Session To start training, simply run the `train.sh` script. It will automatically pick up the settings from `config.env`. ```bash ./train.sh ``` You can also manage multiple configurations by creating different `.env` files and passing the path as an argument: ```bash # Create a custom config for wikitext cp config.env wikitext_config.env # (edit wikitext_config.env to set DATASET="wikitext-2") # Run training with the custom config ./train.sh wikitext_config.env ``` ### Available Pre-configured Datasets - `short`: A small, in-memory paragraph for quick sanity checks. No download needed. - `tinyshakespeare`: (Default) The complete works of Shakespeare. A good starting point. **Requires download.** - `wikitext-2`: A more standard language modeling benchmark. **Included with the project.** ## Dataset Sources ### WikiText-2 (Included) The **WikiText-2** dataset is a well-known benchmark for language modeling. The pre-tokenized version is **included** directly in this project under the `data/wikitext-2/` directory and is tracked by Git. - **Files**: `wiki.train.tokens`, `wiki.valid.tokens`, `wiki.test.tokens`. - **Usage**: No download is necessary. To use it, set `DATASET="wikitext-2"` in your `config.env` and run `./train.sh`. ### TinyShakespeare (Downloadable) The **TinyShakespeare** dataset contains the complete works of Shakespeare. It's a good dataset for initial testing. - **Download**: You need to download it using the provided script. Ensure you have `curl` installed. ```bash ./run.sh get-data tinyshakespeare ``` This will create `data/tinyshakespeare/input.txt`. The `data/tinyshakespeare` directory is ignored by Git. - **Usage**: Set `DATASET="tinyshakespeare"` in your `config.env` and run `./train.sh`. **Note:** The `language_model` example uses the pre-tokenized `wiki.train.tokens` for training and `wiki.valid.tokens` for validation. A proper benchmark result should be reported on `wiki.test.tokens` after model development is complete. See `docs/evaluation.md` for more details. ## A Note on Data in Version Control The `wikitext-2` dataset is currently tracked by Git. This decision was made to ensure the project works "out of the box" after a fresh clone, without requiring users to download and set up data files separately. **Trade-offs:** - **Pros**: Maximum convenience. Anyone can clone the repository and immediately run training on a standard benchmark. - **Cons**: Increases the repository size. The `wikitext-2` dataset is approximately 5MB. For much larger datasets, this approach would be unsuitable. **Alternative (Compression):** An alternative approach would be to store a compressed archive (e.g., `data/wikitext-2.zip`) in Git and add the uncompressed directory (`data/wikitext-2/`) to `.gitignore`. A setup script would then be required to extract the data before the first training run. For this project, convenience was prioritized, but for projects with larger datasets, the compression method is recommended. ## How the `DataLoader` Works The current `DataLoader` (`src/data.rs`) is designed for simplicity. Here's a summary of its operation: 1. **Initialization**: `DataLoader::new(corpus, vocab, batch_size, seq_len)` - It takes the entire text corpus as a single `&str`. - It tokenizes the corpus into a flat vector of token IDs. - It then creates overlapping sequences of `seq_len` tokens. For a given sequence `tokens[i..i+seq_len]`, the target is `tokens[i+1..i+1+seq_len]`. 2. **Iteration**: - When you iterate over the `DataLoader`, it yields `DataBatch` structs. - Each `DataBatch` contains `inputs` and `targets` `Tensor`s, with a shape of `[batch_size, seq_len]`. - It processes the data in chunks of `batch_size`. For simplicity, it drops the final partial batch if the total number of sequences is not perfectly divisible by `batch_size`. ### Limitations and Future Improvements - **Memory Usage**: The current implementation loads the entire corpus into memory and generates all possible input/target sequences at once. This is fine for datasets like TinyShakespeare, but will not scale to gigabyte-sized datasets. - **Future Work**: A more advanced `DataLoader` would stream data from disk, tokenize on-the-fly, and use techniques like memory-mapped files to handle very large corpora efficiently. ## Using Your Own Dataset You can easily train on a custom text file. 1. **Prepare your data**: Place your training data into a single `.txt` file (e.g., `my_corpus.txt`). 2. **Update your config**: In `config.env`, set `DATASET="path/to/your/my_corpus.txt"`. 3. **Adjust Hyperparameters**: In the same `config.env` file, adjust `NUM_EPOCHS`, `BATCH_SIZE`, `SEQ_LEN`, etc., to suit the size and complexity of your dataset. 4. **Run training**: ```bash ./train.sh ``` This approach allows for flexible experimentation with different text sources while keeping all configuration in one place.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/8b-is/smart-tree'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

data_handling.md•5.63 KiB