estimate_inference_cost
Compare cloud API and self-hosted inference costs for any token volume, with break-even analysis and options for quantization and latency targets.
Instructions
Compare cloud API and self-hosted inference costs for a given token volume.
Returns monthly cost for all major API providers and self-hosted options, with break-even analysis. Self-hosted sizing accounts for two levers:
quantization (fp8/int8/int4) shrinks model VRAM (so fewer GPUs per replica) and lifts throughput, at a small quality cost.
the latency target sizes how many replicas are needed to serve the daily output volume at peak load — so an option is only "cheaper" if it can actually keep up. Each self-hosted option reports per-replica topology, replicas_needed, and total GPUs.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| daily_input_tokens | Yes | Average input tokens per day. | |
| daily_output_tokens | Yes | Average output tokens per day. | |
| daily_images | No | Number of images processed per day (for vision/multimodal workloads). When non-zero, image costs are added to the monthly bill for vision-capable models. | |
| use_case | No | general | |
| quality | No | high | |
| latency | No | Target responsiveness (realtime/near_realtime/batch/offline) — drives replica sizing. | near_realtime |
| quantization | No | Self-hosted serving precision (none/fp8/int8/int4). | none |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| daily_input_tokens | Yes | ||
| daily_output_tokens | Yes | ||
| daily_images | Yes | ||
| monthly_input_tokens | Yes | ||
| monthly_output_tokens | Yes | ||
| quantization | Yes | ||
| required_throughput_tps | Yes | ||
| api_options | Yes | ||
| self_hosted_options | Yes | ||
| cheapest_api | Yes | ||
| cheapest_self_hosted | Yes | ||
| recommendation | Yes |