Skip to main content
Glama

quantize

Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format with bit width selection (2-8). Reduces model size for deployment on Ollama, vLLM, LM Studio, or llama.cpp.

Instructions

Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format.

This is a heavy operation that downloads and compresses the model. Requires appropriate backend dependencies to be installed.

Args: model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct') or local path to a model directory. format: Output format — gguf, gptq, or awq. Default: gguf. bits: Quantization bit width — 2, 3, 4, 5, or 8. Default: 4. output_dir: Directory to write output files. Default: temp directory. target: Deployment target. ollama/llamacpp/lmstudio force GGUF, vllm forces AWQ.

Returns: Quantization result with file paths, sizes, and compression ratios.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modelYes
formatNogguf
bitsNo
output_dirNo
targetNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • MCP tool handler for 'quantize'. Decorated with @mcp.tool(), accepts model, format, bits, output_dir, and target parameters. Dispatches to quantize_model() and builds the response.
    @mcp.tool()
    def quantize(
        model: str,
        format: Literal["gguf", "gptq", "awq"] = "gguf",
        bits: Literal[2, 3, 4, 5, 8] = 4,
        output_dir: str | None = None,
        target: Literal["ollama", "vllm", "llamacpp", "lmstudio"] | None = None,
    ) -> dict[str, Any]:
        """Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format.
    
        This is a heavy operation that downloads and compresses the model.
        Requires appropriate backend dependencies to be installed.
    
        Args:
            model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct')
                   or local path to a model directory.
            format: Output format — gguf, gptq, or awq. Default: gguf.
            bits: Quantization bit width — 2, 3, 4, 5, or 8. Default: 4.
            output_dir: Directory to write output files. Default: temp directory.
            target: Deployment target. ollama/llamacpp/lmstudio force GGUF, vllm forces AWQ.
    
        Returns:
            Quantization result with file paths, sizes, and compression ratios.
        """
        # Resolve target overrides
        fmt = format.lower()
        if target:
            target = target.lower()
            if target == "ollama":
                fmt = "gguf"
            elif target == "vllm":
                fmt = "awq"
            elif target in ("llamacpp", "lmstudio"):
                fmt = "gguf"
    
        if fmt not in SUPPORTED_FORMATS:
            return {
                "error": f"Unsupported format '{fmt}'. Use one of: {SUPPORTED_FORMATS}",
            }
        if bits not in SUPPORTED_BITS:
            return {
                "error": f"Unsupported bit width {bits}. Use one of: {SUPPORTED_BITS}",
            }
    
        # Get model info for the report
        model_info = get_model_info(model)
        if not model_info.get("found"):
            return {
                "error": f"Model not found: {model_info.get('error', 'unknown')}",
                "model": model,
            }
    
        # Set up output directory
        if not output_dir:
            model_slug = model.replace("/", "-").replace(".", "-")
            output_dir = os.path.join(
                tempfile.gettempdir(), "turboquant", f"{model_slug}-{fmt}-{bits}bit"
            )
        os.makedirs(output_dir, exist_ok=True)
    
        # Run quantization
        result = quantize_model(model, fmt, bits, output_dir)
    
        # Build response
        response = {
            "model": model,
            "architecture": model_info.get("arch", "unknown"),
            "parameters": model_info.get("params_human", "unknown"),
            "original_size": model_info.get("size_human", "unknown"),
            "target_bits": bits,
            "format": fmt,
            "theoretical_compression": f"{estimate_compression(16, bits):.1f}x",
        }
    
        if result["success"]:
            response["success"] = True
            response["output_file"] = result["file"]
            response["output_size"] = result.get("size_human", "unknown")
            response["output_size_bytes"] = result.get("size", 0)
    
            original_bytes = model_info.get("size_bytes", 0)
            if original_bytes and result.get("size"):
                actual = original_bytes / result["size"]
                response["actual_compression"] = f"{actual:.1f}x"
    
            if result.get("quant_type"):
                response["quant_type"] = result["quant_type"]
    
            # Generate Ollama Modelfile if target is ollama
            if target == "ollama" and fmt == "gguf":
                modelfile_path = generate_ollama_modelfile(
                    result["file"], model_info, output_dir
                )
                model_name = model.split("/")[-1].lower().replace(".", "-")
                quant_type = result.get("quant_type", "Q4_K_M")
                response["ollama"] = {
                    "modelfile": modelfile_path,
                    "import_command": f"cd {output_dir} && ollama create {model_name}-{quant_type.lower()} -f Modelfile",
                    "run_command": f"ollama run {model_name}-{quant_type.lower()}",
                }
        else:
            response["success"] = False
            response["error"] = result.get("error", "Unknown error")
            if result.get("install_cmd"):
                response["install_cmd"] = result["install_cmd"]
    
        return response
  • Dispatcher function that routes to the appropriate quantization backend (GGUF, GPTQ, or AWQ) based on format.
    def quantize_model(
        model_id: str, fmt: str, bits: int, output_dir: str
    ) -> dict[str, Any]:
        """Dispatch quantization to the correct backend.
    
        Args:
            model_id: HuggingFace model ID or local path.
            fmt: One of 'gguf', 'gptq', 'awq'.
            bits: Quantization bit width (2, 3, 4, 5, or 8).
            output_dir: Directory to write output files.
    
        Returns:
            Result dict with success status and file info.
        """
        if fmt not in SUPPORTED_FORMATS:
            return {
                "success": False,
                "error": f"Unsupported format '{fmt}'. Use one of: {SUPPORTED_FORMATS}",
            }
        if bits not in SUPPORTED_BITS:
            return {
                "success": False,
                "error": f"Unsupported bit width {bits}. Use one of: {SUPPORTED_BITS}",
            }
    
        dispatch = {
            "gguf": quantize_gguf,
            "gptq": quantize_gptq,
            "awq": quantize_awq,
        }
    
        return dispatch[fmt](model_id, bits, output_dir)
  • GGUF quantization backend. Tries llama-cpp-python convert + llama-quantize binary, then falls back to convert_hf_to_gguf.py.
    def quantize_gguf(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
        """Quantize model to GGUF format using llama.cpp.
    
        Tries multiple methods in order:
        1. llama-cpp-python convert + llama-quantize binary
        2. convert_hf_to_gguf.py from llama.cpp source
        """
        quant_type = GGUF_QUANT_TYPES.get(bits, "Q4_K_M")
        output_file = os.path.join(output_dir, f"model-{quant_type}.gguf")
        os.makedirs(output_dir, exist_ok=True)
    
        # Method 1: Try llama-cpp-python convert + llama-quantize
        try:
            fp16_file = os.path.join(output_dir, "model-fp16.gguf")
            cmd_convert = [
                sys.executable,
                "-m",
                "llama_cpp.convert",
                "--outfile",
                fp16_file,
                "--outtype",
                "f16",
                model_id,
            ]
            result = subprocess.run(
                cmd_convert, capture_output=True, text=True, timeout=3600
            )
    
            if result.returncode == 0 and os.path.exists(fp16_file):
                cmd_quant = ["llama-quantize", fp16_file, output_file, quant_type]
                result = subprocess.run(
                    cmd_quant, capture_output=True, text=True, timeout=3600
                )
    
                if result.returncode == 0 and os.path.exists(output_file):
                    os.remove(fp16_file)
                    return {
                        "success": True,
                        "file": output_file,
                        "size": os.path.getsize(output_file),
                        "size_human": format_size(os.path.getsize(output_file)),
                        "format": "gguf",
                        "quant_type": quant_type,
                        "bits": bits,
                    }
        except (FileNotFoundError, subprocess.TimeoutExpired):
            pass
    
        # Method 2: Try convert_hf_to_gguf.py from llama.cpp
        try:
            convert_script = shutil.which("convert_hf_to_gguf.py")
            if not convert_script:
                for candidate in [
                    os.path.expanduser("~/llama.cpp/convert_hf_to_gguf.py"),
                    "/opt/llama.cpp/convert_hf_to_gguf.py",
                ]:
                    if os.path.exists(candidate):
                        convert_script = candidate
                        break
    
            if convert_script:
                cmd = [
                    sys.executable,
                    convert_script,
                    model_id,
                    "--outfile",
                    output_file,
                    "--outtype",
                    quant_type.lower(),
                ]
                result = subprocess.run(
                    cmd, capture_output=True, text=True, timeout=3600
                )
                if result.returncode == 0 and os.path.exists(output_file):
                    return {
                        "success": True,
                        "file": output_file,
                        "size": os.path.getsize(output_file),
                        "size_human": format_size(os.path.getsize(output_file)),
                        "format": "gguf",
                        "quant_type": quant_type,
                        "bits": bits,
                    }
        except Exception:
            pass
    
        return {
            "success": False,
            "format": "gguf",
            "bits": bits,
            "error": (
                "GGUF quantization requires llama.cpp tools. "
                "Install: pip install llama-cpp-python, or build llama.cpp from source."
            ),
            "install_cmd": "pip install llama-cpp-python",
        }
  • GPTQ quantization backend using auto-gptq library with c4 calibration data.
    def quantize_gptq(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
        """Quantize model using GPTQ via auto-gptq.
    
        Requires: torch, transformers, auto-gptq, datasets
        Uses c4 calibration data (128 samples, 2048 max length).
        """
        os.makedirs(output_dir, exist_ok=True)
    
        try:
            from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
            from transformers import AutoTokenizer
        except ImportError:
            return {
                "success": False,
                "format": "gptq",
                "bits": bits,
                "error": "GPTQ requires: pip install auto-gptq transformers datasets torch",
                "install_cmd": "pip install auto-gptq transformers datasets torch",
            }
    
        try:
            tokenizer = AutoTokenizer.from_pretrained(model_id)
    
            quantize_config = BaseQuantizeConfig(
                bits=bits,
                group_size=128,
                damp_percent=0.1,
                desc_act=False,
            )
    
            model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
    
            # Prepare calibration data from c4
            from datasets import load_dataset
    
            dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
            calibration_data = []
            for i, example in enumerate(dataset):
                if i >= 128:
                    break
                tokenized = tokenizer(
                    example["text"],
                    return_tensors="pt",
                    truncation=True,
                    max_length=2048,
                )
                calibration_data.append(tokenized.input_ids)
    
            model.quantize(calibration_data)
    
            output_path = os.path.join(output_dir, f"model-gptq-{bits}bit")
            model.save_quantized(output_path)
            tokenizer.save_pretrained(output_path)
    
            total_size = sum(
                os.path.getsize(os.path.join(output_path, f))
                for f in os.listdir(output_path)
                if f.endswith((".safetensors", ".bin"))
            )
    
            return {
                "success": True,
                "file": output_path,
                "size": total_size,
                "size_human": format_size(total_size),
                "format": "gptq",
                "bits": bits,
                "group_size": 128,
            }
    
        except Exception as e:
            return {
                "success": False,
                "format": "gptq",
                "bits": bits,
                "error": str(e),
            }
  • AWQ quantization backend using autoawq library with GEMM kernel.
    def quantize_awq(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
        """Quantize model using AWQ via autoawq.
    
        Requires: torch, transformers, autoawq
        Uses GEMM kernel with group_size=128.
        """
        os.makedirs(output_dir, exist_ok=True)
    
        try:
            from awq import AutoAWQForCausalLM
            from transformers import AutoTokenizer
        except ImportError:
            return {
                "success": False,
                "format": "awq",
                "bits": bits,
                "error": "AWQ requires: pip install autoawq transformers torch",
                "install_cmd": "pip install autoawq transformers torch",
            }
    
        try:
            model = AutoAWQForCausalLM.from_pretrained(model_id)
            tokenizer = AutoTokenizer.from_pretrained(model_id)
    
            quant_config = {
                "zero_point": True,
                "q_group_size": 128,
                "w_bit": bits,
                "version": "GEMM",
            }
    
            model.quantize(tokenizer, quant_config=quant_config)
    
            output_path = os.path.join(output_dir, f"model-awq-{bits}bit")
            model.save_quantized(output_path)
            tokenizer.save_pretrained(output_path)
    
            total_size = sum(
                os.path.getsize(os.path.join(output_path, f))
                for f in os.listdir(output_path)
                if f.endswith((".safetensors", ".bin"))
            )
    
            return {
                "success": True,
                "file": output_path,
                "size": total_size,
                "size_human": format_size(total_size),
                "format": "awq",
                "bits": bits,
                "group_size": 128,
            }
    
        except Exception as e:
            return {
                "success": False,
                "format": "awq",
                "bits": bits,
                "error": str(e),
            }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description must convey behavioral traits. It states the operation is heavy (downloads and compresses), but does not mention potential destruction (e.g., overwriting files in output_dir), idempotency, or authorization needs. The behavior is adequately described for basic use but lacks depth.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-organized with separate sections for task description and parameters. It uses bullet-like 'Args' and 'Returns' formatting, making it scannable. No unnecessary sentences; each adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema (return value explained) and 5 parameters with 1 required, the description covers all essentials. However, it could be more complete by mentioning prerequisites (e.g., installed backends) or potential errors. Still, it provides sufficient context for an agent to use the tool correctly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate entirely. It adds meaning to each parameter: model (HF ID or local path), format (enum with default), bits (enum with default), output_dir (default temp), and target (deployment targets with format constraints). The note about target forcing format is extra context beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it quantizes HuggingFace models to specific formats (GGUF, GPTQ, AWQ). The verb 'quantize' is specific and the resource is well-defined. Sibling tools (check, evaluate, info, push, recommend) do not overlap, making this tool's purpose distinct.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description notes that this is a heavy operation requiring dependencies, setting expectations. It provides default values and examples, but does not explicitly state when to use this tool over alternatives or when not to use it. The absence of sibling overlap makes this less critical.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ShipItAndPray/mcp-turboquant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server