estimate_training_cost
Estimate GPU-hours, wall-clock time, cost, and sharding strategy for training runs. Supports pretraining, fine-tuning, LoRA, and RL with automatic parallelism recommendations.
Instructions
Estimate GPU-hours, wall-clock time, cost, and sharding strategy for a training run.
Covers pre-training, continual pre-training, full SFT, parameter-efficient fine-tuning (LoRA / QLoRA), and RL. Uses Chinchilla scaling laws for pre-training compute estimates. LoRA/QLoRA train only small adapters, so they need far less VRAM and fewer GPUs than full fine-tuning (QLoRA quantizes the base to 4-bit). Also returns a recommended parallelism strategy (DDP / FSDP-ZeRO-3 / tensor+pipeline parallel) based on model footprint, GPU VRAM, and interconnect.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| model_params_b | Yes | Model size in billions of parameters (e.g. 7 for 7B). | |
| training_type | No | One of pretrain, continual_pretrain, sft, lora, qlora, rl. | sft |
| dataset_tokens | No | Number of training tokens. Uses sensible defaults if omitted. | |
| gpu_key | No | GPU type key (h100_sxm, a100_80gb_sxm, h200_sxm, rtx_4090, l40s). | h100_sxm |
| num_gpus | No | Override GPU count. Auto-calculated from VRAM if omitted. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| training_type | Yes | ||
| model_params_b | Yes | ||
| dataset_tokens | Yes | ||
| gpu_type | Yes | ||
| gpu_count | Yes | ||
| mfu | Yes | ||
| total_flops_exaflops | Yes | ||
| effective_gpu_hours | Yes | ||
| wall_clock_days | Yes | ||
| vram_required_gb | Yes | ||
| cloud_costs | Yes | ||
| onprem_cost_usd | Yes | ||
| onprem_capex_usd | Yes | ||
| chinchilla_optimal_tokens | Yes | ||
| parallelism_strategy | Yes | ||
| parallelism_degrees | Yes | ||
| parallelism_framework | Yes | ||
| notes | Yes |