mcp-turboquant

quantize

Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format with bit width selection (2-8). Reduces model size for deployment on Ollama, vLLM, LM Studio, or llama.cpp.

Instructions

Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format.

This is a heavy operation that downloads and compresses the model. Requires appropriate backend dependencies to be installed.

Args: model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct') or local path to a model directory. format: Output format — gguf, gptq, or awq. Default: gguf. bits: Quantization bit width — 2, 3, 4, 5, or 8. Default: 4. output_dir: Directory to write output files. Default: temp directory. target: Deployment target. ollama/llamacpp/lmstudio force GGUF, vllm forces AWQ.

Returns: Quantization result with file paths, sizes, and compression ratios.

Input Schema

TableJSON Schema

Name	Required	Default
`model`	Yes
`format`	No	gguf
`bits`	No
`output_dir`	No
`target`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Implementation Reference

mcp_turboquant/server.py:183-289 (handler)

MCP tool handler for 'quantize'. Decorated with @mcp.tool(), accepts model, format, bits, output_dir, and target parameters. Dispatches to quantize_model() and builds the response.

@mcp.tool()
def quantize(
    model: str,
    format: Literal["gguf", "gptq", "awq"] = "gguf",
    bits: Literal[2, 3, 4, 5, 8] = 4,
    output_dir: str | None = None,
    target: Literal["ollama", "vllm", "llamacpp", "lmstudio"] | None = None,
) -> dict[str, Any]:
    """Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format.

    This is a heavy operation that downloads and compresses the model.
    Requires appropriate backend dependencies to be installed.

    Args:
        model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct')
               or local path to a model directory.
        format: Output format — gguf, gptq, or awq. Default: gguf.
        bits: Quantization bit width — 2, 3, 4, 5, or 8. Default: 4.
        output_dir: Directory to write output files. Default: temp directory.
        target: Deployment target. ollama/llamacpp/lmstudio force GGUF, vllm forces AWQ.

    Returns:
        Quantization result with file paths, sizes, and compression ratios.
    """
    # Resolve target overrides
    fmt = format.lower()
    if target:
        target = target.lower()
        if target == "ollama":
            fmt = "gguf"
        elif target == "vllm":
            fmt = "awq"
        elif target in ("llamacpp", "lmstudio"):
            fmt = "gguf"

    if fmt not in SUPPORTED_FORMATS:
        return {
            "error": f"Unsupported format '{fmt}'. Use one of: {SUPPORTED_FORMATS}",
        }
    if bits not in SUPPORTED_BITS:
        return {
            "error": f"Unsupported bit width {bits}. Use one of: {SUPPORTED_BITS}",
        }

    # Get model info for the report
    model_info = get_model_info(model)
    if not model_info.get("found"):
        return {
            "error": f"Model not found: {model_info.get('error', 'unknown')}",
            "model": model,
        }

    # Set up output directory
    if not output_dir:
        model_slug = model.replace("/", "-").replace(".", "-")
        output_dir = os.path.join(
            tempfile.gettempdir(), "turboquant", f"{model_slug}-{fmt}-{bits}bit"
        )
    os.makedirs(output_dir, exist_ok=True)

    # Run quantization
    result = quantize_model(model, fmt, bits, output_dir)

    # Build response
    response = {
        "model": model,
        "architecture": model_info.get("arch", "unknown"),
        "parameters": model_info.get("params_human", "unknown"),
        "original_size": model_info.get("size_human", "unknown"),
        "target_bits": bits,
        "format": fmt,
        "theoretical_compression": f"{estimate_compression(16, bits):.1f}x",
    }

    if result["success"]:
        response["success"] = True
        response["output_file"] = result["file"]
        response["output_size"] = result.get("size_human", "unknown")
        response["output_size_bytes"] = result.get("size", 0)

        original_bytes = model_info.get("size_bytes", 0)
        if original_bytes and result.get("size"):
            actual = original_bytes / result["size"]
            response["actual_compression"] = f"{actual:.1f}x"

        if result.get("quant_type"):
            response["quant_type"] = result["quant_type"]

        # Generate Ollama Modelfile if target is ollama
        if target == "ollama" and fmt == "gguf":
            modelfile_path = generate_ollama_modelfile(
                result["file"], model_info, output_dir
            )
            model_name = model.split("/")[-1].lower().replace(".", "-")
            quant_type = result.get("quant_type", "Q4_K_M")
            response["ollama"] = {
                "modelfile": modelfile_path,
                "import_command": f"cd {output_dir} && ollama create {model_name}-{quant_type.lower()} -f Modelfile",
                "run_command": f"ollama run {model_name}-{quant_type.lower()}",
            }
    else:
        response["success"] = False
        response["error"] = result.get("error", "Unknown error")
        if result.get("install_cmd"):
            response["install_cmd"] = result["install_cmd"]

    return response

mcp_turboquant/quantize.py:278-309 (handler)

Dispatcher function that routes to the appropriate quantization backend (GGUF, GPTQ, or AWQ) based on format.

def quantize_model(
    model_id: str, fmt: str, bits: int, output_dir: str
) -> dict[str, Any]:
    """Dispatch quantization to the correct backend.

    Args:
        model_id: HuggingFace model ID or local path.
        fmt: One of 'gguf', 'gptq', 'awq'.
        bits: Quantization bit width (2, 3, 4, 5, or 8).
        output_dir: Directory to write output files.

    Returns:
        Result dict with success status and file info.
    """
    if fmt not in SUPPORTED_FORMATS:
        return {
            "success": False,
            "error": f"Unsupported format '{fmt}'. Use one of: {SUPPORTED_FORMATS}",
        }
    if bits not in SUPPORTED_BITS:
        return {
            "success": False,
            "error": f"Unsupported bit width {bits}. Use one of: {SUPPORTED_BITS}",
        }

    dispatch = {
        "gguf": quantize_gguf,
        "gptq": quantize_gptq,
        "awq": quantize_awq,
    }

    return dispatch[fmt](model_id, bits, output_dir)

mcp_turboquant/quantize.py:39-134 (handler)

GGUF quantization backend. Tries llama-cpp-python convert + llama-quantize binary, then falls back to convert_hf_to_gguf.py.

def quantize_gguf(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
    """Quantize model to GGUF format using llama.cpp.

    Tries multiple methods in order:
    1. llama-cpp-python convert + llama-quantize binary
    2. convert_hf_to_gguf.py from llama.cpp source
    """
    quant_type = GGUF_QUANT_TYPES.get(bits, "Q4_K_M")
    output_file = os.path.join(output_dir, f"model-{quant_type}.gguf")
    os.makedirs(output_dir, exist_ok=True)

    # Method 1: Try llama-cpp-python convert + llama-quantize
    try:
        fp16_file = os.path.join(output_dir, "model-fp16.gguf")
        cmd_convert = [
            sys.executable,
            "-m",
            "llama_cpp.convert",
            "--outfile",
            fp16_file,
            "--outtype",
            "f16",
            model_id,
        ]
        result = subprocess.run(
            cmd_convert, capture_output=True, text=True, timeout=3600
        )

        if result.returncode == 0 and os.path.exists(fp16_file):
            cmd_quant = ["llama-quantize", fp16_file, output_file, quant_type]
            result = subprocess.run(
                cmd_quant, capture_output=True, text=True, timeout=3600
            )

            if result.returncode == 0 and os.path.exists(output_file):
                os.remove(fp16_file)
                return {
                    "success": True,
                    "file": output_file,
                    "size": os.path.getsize(output_file),
                    "size_human": format_size(os.path.getsize(output_file)),
                    "format": "gguf",
                    "quant_type": quant_type,
                    "bits": bits,
                }
    except (FileNotFoundError, subprocess.TimeoutExpired):
        pass

    # Method 2: Try convert_hf_to_gguf.py from llama.cpp
    try:
        convert_script = shutil.which("convert_hf_to_gguf.py")
        if not convert_script:
            for candidate in [
                os.path.expanduser("~/llama.cpp/convert_hf_to_gguf.py"),
                "/opt/llama.cpp/convert_hf_to_gguf.py",
            ]:
                if os.path.exists(candidate):
                    convert_script = candidate
                    break

        if convert_script:
            cmd = [
                sys.executable,
                convert_script,
                model_id,
                "--outfile",
                output_file,
                "--outtype",
                quant_type.lower(),
            ]
            result = subprocess.run(
                cmd, capture_output=True, text=True, timeout=3600
            )
            if result.returncode == 0 and os.path.exists(output_file):
                return {
                    "success": True,
                    "file": output_file,
                    "size": os.path.getsize(output_file),
                    "size_human": format_size(os.path.getsize(output_file)),
                    "format": "gguf",
                    "quant_type": quant_type,
                    "bits": bits,
                }
    except Exception:
        pass

    return {
        "success": False,
        "format": "gguf",
        "bits": bits,
        "error": (
            "GGUF quantization requires llama.cpp tools. "
            "Install: pip install llama-cpp-python, or build llama.cpp from source."
        ),
        "install_cmd": "pip install llama-cpp-python",
    }

mcp_turboquant/quantize.py:137-213 (handler)

GPTQ quantization backend using auto-gptq library with c4 calibration data.

def quantize_gptq(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
    """Quantize model using GPTQ via auto-gptq.

    Requires: torch, transformers, auto-gptq, datasets
    Uses c4 calibration data (128 samples, 2048 max length).
    """
    os.makedirs(output_dir, exist_ok=True)

    try:
        from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
        from transformers import AutoTokenizer
    except ImportError:
        return {
            "success": False,
            "format": "gptq",
            "bits": bits,
            "error": "GPTQ requires: pip install auto-gptq transformers datasets torch",
            "install_cmd": "pip install auto-gptq transformers datasets torch",
        }

    try:
        tokenizer = AutoTokenizer.from_pretrained(model_id)

        quantize_config = BaseQuantizeConfig(
            bits=bits,
            group_size=128,
            damp_percent=0.1,
            desc_act=False,
        )

        model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)

        # Prepare calibration data from c4
        from datasets import load_dataset

        dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
        calibration_data = []
        for i, example in enumerate(dataset):
            if i >= 128:
                break
            tokenized = tokenizer(
                example["text"],
                return_tensors="pt",
                truncation=True,
                max_length=2048,
            )
            calibration_data.append(tokenized.input_ids)

        model.quantize(calibration_data)

        output_path = os.path.join(output_dir, f"model-gptq-{bits}bit")
        model.save_quantized(output_path)
        tokenizer.save_pretrained(output_path)

        total_size = sum(
            os.path.getsize(os.path.join(output_path, f))
            for f in os.listdir(output_path)
            if f.endswith((".safetensors", ".bin"))
        )

        return {
            "success": True,
            "file": output_path,
            "size": total_size,
            "size_human": format_size(total_size),
            "format": "gptq",
            "bits": bits,
            "group_size": 128,
        }

    except Exception as e:
        return {
            "success": False,
            "format": "gptq",
            "bits": bits,
            "error": str(e),
        }

mcp_turboquant/quantize.py:216-275 (handler)

AWQ quantization backend using autoawq library with GEMM kernel.

def quantize_awq(model_id: str, bits: int, output_dir: str) -> dict[str, Any]:
    """Quantize model using AWQ via autoawq.

    Requires: torch, transformers, autoawq
    Uses GEMM kernel with group_size=128.
    """
    os.makedirs(output_dir, exist_ok=True)

    try:
        from awq import AutoAWQForCausalLM
        from transformers import AutoTokenizer
    except ImportError:
        return {
            "success": False,
            "format": "awq",
            "bits": bits,
            "error": "AWQ requires: pip install autoawq transformers torch",
            "install_cmd": "pip install autoawq transformers torch",
        }

    try:
        model = AutoAWQForCausalLM.from_pretrained(model_id)
        tokenizer = AutoTokenizer.from_pretrained(model_id)

        quant_config = {
            "zero_point": True,
            "q_group_size": 128,
            "w_bit": bits,
            "version": "GEMM",
        }

        model.quantize(tokenizer, quant_config=quant_config)

        output_path = os.path.join(output_dir, f"model-awq-{bits}bit")
        model.save_quantized(output_path)
        tokenizer.save_pretrained(output_path)

        total_size = sum(
            os.path.getsize(os.path.join(output_path, f))
            for f in os.listdir(output_path)
            if f.endswith((".safetensors", ".bin"))
        )

        return {
            "success": True,
            "file": output_path,
            "size": total_size,
            "size_human": format_size(total_size),
            "format": "awq",
            "bits": bits,
            "group_size": 128,
        }

    except Exception as e:
        return {
            "success": False,
            "format": "awq",
            "bits": bits,
            "error": str(e),
        }

Tool Definition Quality

A4.3/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description must convey behavioral traits. It states the operation is heavy (downloads and compresses), but does not mention potential destruction (e.g., overwriting files in output_dir), idempotency, or authorization needs. The behavior is adequately described for basic use but lacks depth.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-organized with separate sections for task description and parameters. It uses bullet-like 'Args' and 'Returns' formatting, making it scannable. No unnecessary sentences; each adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema (return value explained) and 5 parameters with 1 required, the description covers all essentials. However, it could be more complete by mentioning prerequisites (e.g., installed backends) or potential errors. Still, it provides sufficient context for an agent to use the tool correctly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate entirely. It adds meaning to each parameter: model (HF ID or local path), format (enum with default), bits (enum with default), output_dir (default temp), and target (deployment targets with format constraints). The note about target forcing format is extra context beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it quantizes HuggingFace models to specific formats (GGUF, GPTQ, AWQ). The verb 'quantize' is specific and the resource is well-defined. Sibling tools (check, evaluate, info, push, recommend) do not overlap, making this tool's purpose distinct.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description notes that this is a heavy operation requiring dependencies, setting expectations. It provides default values and examples, but does not explicitly state when to use this tool over alternatives or when not to use it. The absence of sibling overlap makes this less critical.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ShipItAndPray/mcp-turboquant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server