Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| prompts | {
"listChanged": false
} |
| resources | {
"subscribe": false,
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| infoA | Get model info from HuggingFace — parameters, size, architecture. Lightweight call using the HuggingFace API. No GPU or heavy dependencies required. Args: model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct') or local path to a model directory. Returns: Model metadata including architecture, parameter count, size, hidden dimensions, number of layers, vocabulary size, and context length. |
| checkA | Check available quantization backends on this system. Reports which quantization engines (GGUF/GPTQ/AWQ) are installed, whether PyTorch and transformers are available, GPU information (CUDA or Apple MPS), and system RAM. No arguments required. Lightweight system check. Returns: Dictionary of available backends and hardware info. |
| recommendA | Recommend best quantization format and bit width for a model. Analyzes the model size and your hardware (GPU VRAM, Apple Silicon, system RAM) to suggest the optimal format (GGUF/GPTQ/AWQ) and bit width (2-8). Ranked recommendations with use-case explanations. Args: model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct') or local path to a model directory. Returns: Ranked recommendations with format, bits, reasoning, and use cases. |
| quantizeA | Quantize a HuggingFace model to GGUF, GPTQ, or AWQ format. This is a heavy operation that downloads and compresses the model. Requires appropriate backend dependencies to be installed. Args: model: HuggingFace model ID (e.g. 'meta-llama/Llama-3.1-8B-Instruct') or local path to a model directory. format: Output format — gguf, gptq, or awq. Default: gguf. bits: Quantization bit width — 2, 3, 4, 5, or 8. Default: 4. output_dir: Directory to write output files. Default: temp directory. target: Deployment target. ollama/llamacpp/lmstudio force GGUF, vllm forces AWQ. Returns: Quantization result with file paths, sizes, and compression ratios. |
| evaluateA | Run perplexity evaluation on a quantized model. Measures model quality after quantization using perplexity scoring. Lower perplexity = better quality. Includes a quality assessment (EXCELLENT/GOOD/FAIR/DEGRADED/POOR). Args: model_path: Path to the quantized model file (GGUF) or directory (GPTQ/AWQ). format: Format of the quantized model. One of 'gguf', 'gptq', 'awq'. bits: Bit width used during quantization (for quality context). Returns: Perplexity score, quality assessment, and evaluation metadata. |
| pushA | Push a quantized model to HuggingFace Hub. Uploads all model files from the output directory to a HuggingFace repository. Generates a model card (README.md) with metadata. Requires HuggingFace authentication (huggingface-cli login or HF_TOKEN). Args: repo_id: HuggingFace repository ID (e.g. 'username/model-GGUF-4bit'). model_dir: Local directory containing the quantized model files. model: Original model ID for the model card (optional). bits: Bit width used during quantization (for model card metadata). Returns: Upload result with repository URL and file count. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |