Skip to main content
Glama
README.md6.03 kB
--- language: - en license: apache-2.0 tags: - zig - code - programming - dataset size_categories: - 10K<n<100K task_categories: - text-generation - question-answering --- # ZigNet Training Dataset **Curated dataset of Zig programming examples for LLM fine-tuning** This dataset was created for the [ZigNet](https://github.com/fulgidus/zignet) project to train language models on Zig programming language patterns, idioms, and documentation. ## Dataset Description - **Total Examples**: 13,756 - **Source**: Official Zig documentation (v0.13, v0.14, v0.15) - **Format**: JSONL (instruction-response pairs) - **Language**: English - **Zig Versions**: 0.13.0, 0.14.1, 0.15.1 ## Dataset Structure ### Files ``` data/training/ ├── dataset-train.jsonl # 9,629 examples (70%) ├── dataset-validation.jsonl # 2,063 examples (15%) ├── dataset-test.jsonl # 2,064 examples (15%) └── dataset-stats.json # Dataset statistics ``` ### Data Fields Each example is a JSON object with: ```json { "instruction": "Explain this Zig code feature", "input": "const x: i32 = 42;", "output": "This declares a constant variable `x` of type `i32` (32-bit signed integer) with value 42..." } ``` **Fields**: - `instruction`: Task description (explain/analyze/generate Zig code) - `input`: Zig code snippet or context - `output`: Expected response (explanation, corrected code, documentation) ## Dataset Statistics | Split | Examples | Size | | ---------- | ---------- | -------- | | Train | 9,629 | ~4.2MB | | Validation | 2,063 | ~900KB | | Test | 2,064 | ~900KB | | **Total** | **13,756** | **~6MB** | ### Version Distribution - **Zig 0.13.x**: ~4,200 examples - **Zig 0.14.x**: ~4,700 examples - **Zig 0.15.x**: ~4,856 examples ### Topic Coverage - ✅ Variables and types (const, var, comptime) - ✅ Functions and control flow - ✅ Structs, enums, unions - ✅ Generics and comptime programming - ✅ Error handling (error sets, try, catch) - ✅ Memory management (allocators, defer) - ✅ Pointers and slices - ✅ Async/await patterns - ✅ Build system (build.zig) - ✅ Standard library usage ## Data Collection ### Methodology 1. **Scraping**: Automated extraction from [ziglang.org/documentation](https://ziglang.org/documentation/) 2. **Parsing**: Extract code blocks with surrounding context 3. **Filtering**: Remove duplicates, invalid syntax, incomplete examples 4. **Augmentation**: Add instruction-response pairs for each example 5. **Validation**: Verify all examples compile with respective Zig versions ### Quality Control - 100% syntax-valid (verified with `zig ast-check`) - Manually reviewed for semantic correctness - Deduplicated across versions - Balanced distribution of difficulty levels ## Intended Use ### Primary Use Cases - Fine-tuning LLMs for Zig code understanding - Training code completion models - Building Zig documentation assistants - Research in programming language modeling ### Out of Scope - General-purpose code generation (dataset is Zig-specific) - Production-critical code without human review - Zig versions outside 0.13-0.15 range ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("fulgidus/zignet-training-dataset") # Load specific split train = load_dataset("fulgidus/zignet-training-dataset", split="train") val = load_dataset("fulgidus/zignet-training-dataset", split="validation") test = load_dataset("fulgidus/zignet-training-dataset", split="test") ``` ### Manual Loading (JSONL) ```python import json def load_jsonl(filepath): with open(filepath, 'r') as f: return [json.loads(line) for line in f] train_data = load_jsonl("dataset-train.jsonl") ``` ### Training Example (Transformers) ```python from transformers import AutoModelForCausalLM, TrainingArguments, Trainer from datasets import load_dataset dataset = load_dataset("fulgidus/zignet-training-dataset") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") training_args = TrainingArguments( output_dir="./zignet-model", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=2e-4, ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() ``` ## Versioning The dataset is versioned to track Zig language evolution: - **v1.0** (Oct 2025): Zig 0.13-0.15, 13,756 examples - Future updates will include newer Zig versions as released ## Limitations - **Version-specific syntax**: Some examples may not work across all Zig versions - **Documentation-focused**: Real-world production patterns underrepresented - **English only**: Comments and explanations in English - **No executable context**: Snippets may require additional imports/setup ## Ethical Considerations - **Source**: All examples from official Zig documentation (permissive license) - **Attribution**: Zig project acknowledged (MIT license compatible) - **Bias**: Dataset reflects official documentation bias toward teaching/examples - **Privacy**: No user data, only public documentation ## Citation ```bibtex @dataset{zignet_dataset2025, author = {fulgidus}, title = {ZigNet Training Dataset: Zig Code Examples from Official Documentation}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/fulgidus/zignet-training-dataset} } ``` ## License Apache-2.0 (same as Zig project and base documentation) ## Acknowledgments - **Zig Language**: [ziglang.org](https://ziglang.org) - Andrew Kelley and contributors - **Documentation**: Official Zig documentation team - **Tools**: HuggingFace Datasets, Transformers --- **Project**: [github.com/fulgidus/zignet](https://github.com/fulgidus/zignet) **Model**: [fulgidus/zignet-qwen2.5-coder-7b](https://huggingface.co/fulgidus/zignet-qwen2.5-coder-7b) **Author**: fulgidus **Date**: October 2025

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fulgidus/zignet'

If you have feedback or need assistance with the MCP directory API, please join our Discord server