ZeroFuse
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@ZeroFuseabliterate meta-llama/Llama-3.1-8B-Instruct"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
โก ZeroFuse
Point it at a model. It conducts trials, identifies the refusal architecture, and abliterates it.
Automated, capability-preserving refusal removal for open-weight transformer LLMs โ a direct weight edit that produces a standard Hugging Face model with zero inference-time overhead.
ZeroFuse turns guardrail removal into a one-command, fully-automated optimization problem โ no hand-picking layers, no guessing strengths, no retraining. It estimates the model's refusal direction, orthogonalizes it out of the residual-writing weights, and uses a two-objective search to preserve capability. The output is a standard Hugging Face checkpoint you can load, quantize, or serve like any other.
Table of Contents
Related MCP server: CatoBot autoexperiment MCP Server
Why ZeroFuse
Most abliteration workflows are manual: you pick a layer, eyeball a strength coefficient, run the model, check whether it still refuses, and repeat โ often degrading the model's general capabilities along the way. ZeroFuse replaces that loop with a principled, automated search.
Fully automatic โ no hand-picking layers, directions, or strengths. You point it at a model and it conducts the trials.
Capability-preserving by design โ KL divergence from the original model is an explicit optimization objective, co-minimized alongside refusals, not an afterthought.
A real weight edit, not a runtime adapter โ orthogonalizes the refusal direction directly out of attention
o_projand MLPdown_proj(W' = W โ strength ยท r(rแตW)). The saved model has zero inference-time overhead: no LoRA to load, no runtime hooks, no wrapper.Pareto-front control โ a two-objective Optuna TPE search hands you the full trade-off curve. Pick the point you want: fewest refusals, lowest KL, or the knee.
Grounded in published research โ difference-of-means refusal direction (Arditi et al. 2024) with optional projected refinement (grimjim 2025) to reduce collateral damage.
Broad model support โ dense models, MoE (including per-expert
down_proj), and many multimodal nestings.Resumable โ Optuna studies are journaled to disk; re-run the same command to continue where you left off.
Agent-native โ ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. Quiet by default, verbose on request.
One command, or one import โ
zerofuse --model <hf-id-or-path>, orfrom zerofuse import abliterate.
๐ One Command
# Clone and install (editable)
git clone https://github.com/junainfinity/ZeroFuse.git
cd ZeroFuse
pip install -e . # core
pip install -e ".[mcp]" # + the agent/MCP server (optional)
# Point it at any Hugging Face model id or local path
zerofuse --model meta-llama/Llama-3.1-8B-InstructThat's the whole loop. ZeroFuse:
Loads the target model and captures residual-stream activations.
Identifies the refusal direction via difference-of-means on harmful vs. harmless prompts.
Conducts trials โ a two-objective Optuna search over layers and strengths, co-minimizing refusals and KL divergence.
Abliterates by orthogonalizing the chosen direction out of the weights.
Writes a standard Hugging Face model directory you can load with
from_pretrainedโ no special runtime required.
# Resume an interrupted run โ same command, picks up the journaled study
zerofuse --model meta-llama/Llama-3.1-8B-Instruct
# Quiet (only high-level phases) โ or fully verbose
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --quiet
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --verboseZeroFuse needs enough memory to load and run forward passes on the target model. Plan capacity for the model you point it at.
๐ฌ How It Works
ZeroFuse implements the published "refusal direction" line of research as a clean-room MIT build, wrapped in an automated optimizer.
1. Estimate the refusal direction
It captures residual-stream activations on a set of harmful and harmless prompts and takes the difference of means. The unit refusal direction is:
$$ r ;=; \frac{\mu_{\text{harmful}} - \mu_{\text{harmless}}}{\lVert \mu_{\text{harmful}} - \mu_{\text{harmless}} \rVert} $$
where $\mu_{\text{harmful}}$ and $\mu_{\text{harmless}}$ are the mean residual-stream activations over harmful and harmless prompts respectively (Arditi et al., 2024). An optional projected refinement step (grimjim, 2025) sharpens the estimate to reduce collateral damage.
2. Orthogonalize it out of the weights
Rather than subtract the direction at runtime, ZeroFuse edits the weights that write into the residual stream so they can no longer contribute along $r$:
$$ W' ;=; W ;-; \text{strength} \cdot r,(r^{\top} W) $$
This is applied to the attention output projection (o_proj) and the MLP down-projection (down_proj), including MoE experts. The scalar strength controls how much of the $r$-component is removed: at strength = 1 this is a full orthogonal projection that removes the component entirely; smaller values remove it partially. Because the edit lives in the weights, the resulting model is indistinguishable in shape and speed from the original.
3. Search the Pareto front
Choosing layers and strengths by hand is the hard part โ so ZeroFuse doesn't. It runs an Optuna TPE multi-objective search that co-minimizes two objectives:
$$ \min ;\big(; N_{\text{refusals}}, ;; D_{\mathrm{KL}}(P_{\text{orig}} ,\Vert, P_{\text{edited}}) ;\big) $$
$N_{\text{refusals}}$ โ how often the edited model still refuses, scored by the evaluator.
$D_{\mathrm{KL}}$ โ how far the edited model's output distribution has drifted from the original, as a proxy for lost capability.
The result is a Pareto front of non-dominated configurations. You choose the operating point that fits your goal โ fewest refusals, lowest KL, or the knee of the curve โ and ZeroFuse materializes that exact weight edit.
refusals
^
| x
| x
| x <- knee
| x x
| x x x
+-------------------> KL divergence
(each x = a non-dominated trial on the Pareto front)โ๏ธ ZeroFuse vs. the Alternatives
Capability | ZeroFuse | Manual abliteration | Fine-tuning |
Setup effort | One command: point it at an HF id or path; layers and strengths are picked automatically | Hand-select target layers, directions, and strengths through trial and error | Assemble a dataset, configure a training run, and manage compute |
Weights vs. runtime | Direct weight edit โ orthogonalizes the refusal direction out of | Also a weight edit, but applied manually with chosen parameters | Updates weights via gradient descent over a training corpus |
Capability preservation | KL divergence from the original model is an explicit optimization objective | Depends on the operator's manual tuning; no built-in capability objective | Risk of catastrophic forgetting; mitigation depends on data and hyperparameters |
Tuning the trade-off | Two-objective Optuna TPE search yields a Pareto front; pick fewest refusals, lowest KL, or the knee | Re-run by hand and eyeball results; no systematic Pareto search | Adjust data mix and hyperparameters and retrain to shift the trade-off |
Inference-time overhead | None โ output is a standard Hugging Face model | None if done as a weight edit; runtime adapters add overhead | None for a full fine-tune; LoRA adapters add overhead unless merged |
Compute cost | Runs trials and a KL/refusal search; no gradient-based retraining | Low compute, but high human time per iteration | Highest โ training compute proportional to model and dataset size |
Resumability | Optuna studies journaled to disk; re-run the same command to continue | Manual โ depends on your own bookkeeping | Checkpoint-based resume, depending on the training framework |
Agent / automation | Ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client | None built in | None built in |
Output format | Standard Hugging Face model โ load, quantize, or serve like any other | Modified model; format depends on the tooling used | Standard weights or a LoRA adapter, depending on method |
Model support | Dense, MoE (per-expert | Whatever the operator manually implements support for | Broad, subject to framework support for the architecture |
๐ค Agent-native / MCP
ZeroFuse ships a built-in Model Context Protocol server, so an agent can drive the whole pipeline as a tool. It works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP-compatible client.
Install the optional dependency and add it to your MCP client config:
pip install -e ".[mcp]"{
"mcpServers": {
"zerofuse": {
"command": "zerofuse-mcp"
// installed alongside the CLI by `pip install -e ".[mcp]"`
}
}
}It exposes a single abliterate tool and is designed to be a well-behaved citizen of an agent's context window:
Quiet by default. The harness sees only high-level phases โ
identifying refusal architecture,conducting trials,abliteratingโ not a firehose of internals.Opt-in detail. Per-trial metrics, layer choices, and KL traces are emitted at MCP debug log level and surface only if the harness opts in to debug logs.
Override when you want it. A
verboseargument forces full detail regardless of log level.
This keeps long-running optimization runs legible to an agent instead of flooding it with token-heavy progress chatter. See docs/agents.html for per-harness setup.
๐ Python API
Everything the CLI does is available as a library:
from zerofuse import abliterate
# One call: returns the saved HF model dir + the Pareto front to pick from.
result = abliterate("meta-llama/Llama-3.1-8B-Instruct", n_trials=100)
print(result.selected.refusals, result.selected.kl, result.output_dir)Or build a full configuration explicitly:
from zerofuse import ZeroFuseConfig, run
config = ZeroFuseConfig.from_dict({
"model": "meta-llama/Llama-3.1-8B-Instruct",
"optimization": {"n_trials": 100},
})
result = run(config, selection="knee")๐งฉ Supported Models
ZeroFuse is built to work on most open-weight transformer models you point it at:
Architecture | Support |
Dense transformer LLMs | โ Supported |
Mixture-of-Experts (per-expert | โ Supported |
Multimodal nestings with a transformer LLM backbone | โ Many supported |
Pure state-space models | โ Out of scope |
Because architectures vary, ZeroFuse is designed to generalize across these families rather than guaranteed to abliterate every model โ it adapts to the residual-writing weights it finds.
๐ Project Structure
ZeroFuse/
โโโ src/zerofuse/
โ โโโ config.py # Run configuration & defaults (TOML + CLI)
โ โโโ prompts.py # Harmful / harmless prompt loading + batching
โ โโโ directions.py # Pure math: difference-of-means, projected refinement
โ โโโ model.py # Loading, activation capture, weight orthogonalization
โ โโโ evaluator.py # Scoring: refusal detection + KL divergence
โ โโโ optimizer.py # Optuna TPE search + Pareto-front selection
โ โโโ pipeline.py # End-to-end orchestration
โ โโโ reporting.py # Quiet-by-default progress (phases vs. details)
โ โโโ cli.py # `zerofuse` command-line entrypoint
โ โโโ mcp_server.py # Model Context Protocol server (agent-native)
โโโ docs/ # Self-contained HTML documentation site
โโโ config/default.toml # Fully-commented configuration template
โโโ tests/ # Unit tests for the pure-logic partsEach module has a single responsibility. directions.py is pure math โ no model objects, easy to test and audit. model.py is the only place weights are touched.
โ FAQ
It estimates the model's "refusal direction" via difference-of-means of residual-stream activations on harmful vs. harmless prompts (Arditi et al. 2024, arXiv:2406.11717), then orthogonalizes that direction out of the residual-writing weights โ attention o_proj and MLP down_proj, including MoE experts โ using W' = W โ strength ยท r(rแตW). No gradient-based training is involved; it's a direct edit to the existing weights.
ZeroFuse is built to minimize that. KL divergence from the original model is an explicit optimization objective alongside the number of refusals, and a two-objective Optuna TPE search produces a Pareto front so you can choose how to balance fewest refusals against lowest KL. There's also an optional projected refinement step (grimjim 2025) designed to reduce collateral damage. As a v0.1.0 project, these are design goals rather than independently benchmarked guarantees.
It's designed to work on most open-weight transformer models you point it at โ dense models, MoE models (including per-expert down_proj), and many multimodal nestings. Pure state-space models are out of scope.
You get a standard Hugging Face model โ the refusal behavior is edited into the weights themselves, not delivered as a runtime LoRA adapter. That means zero inference-time overhead: you can load, quantize, and serve it exactly like any other Hugging Face model.
Install with pip install -e . and run zerofuse --model <hf-id-or-path>, or use the Python API with from zerofuse import abliterate. ZeroFuse also ships an MCP server (pip install -e ".[mcp]") that works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. It's quiet by default โ the harness sees only high-level phases like "identifying refusal architecture," "conducting trials," and "abliterating," with internal details emitted at MCP debug log level and shown only if the harness opts in (a verbose flag overrides). Runs are resumable: Optuna studies are journaled to disk, so re-running the same command continues where you left off.
ZeroFuse is MIT-licensed โ an independent clean-room build from published papers and an Apache-2.0 reference implementation that copies no copyleft tool, with citations documented in NOTICE.md. The MIT license covers only the tool, not the models you produce. Because ZeroFuse reduces a model's guardrails, you are responsible for complying with the base model's license and acceptable-use policy, applicable law, and any platform terms that apply to the models you create and deploy.
๐ Status
v0.1.0 โ new project. ZeroFuse is early. The method is grounded in published research and the implementation is built to preserve capability, but it has not yet been independently benchmarked at scale. Where this README says designed to or built to, that is a deliberate statement that the claim is true by construction, not yet third-party-verified. No benchmark numbers, star counts, or testimonials are presented here because there aren't any to honestly report yet. Issues and reproductions welcome.
๐ก๏ธ Responsible Use
ZeroFuse reduces or removes safety guardrails from model weights. That capability carries real responsibility.
You are responsible for compliance with the base model's license and acceptable-use policy, all applicable law, and the terms of any platform you deploy on.
The MIT license covers this tool only โ it does not grant you any rights over, or responsibility for, the models you produce or process. Those are governed by the original model's license.
Use it on models you are permitted to modify, for purposes you are permitted to pursue.
Removing guardrails does not remove accountability. Think before you point it at something.
๐ Provenance & License
ZeroFuse is an independent, clean-room implementation built from published papers and an Apache-2.0 reference implementation. It does not copy, vendor, or derive from any copyleft tool. Citations and attributions are documented in NOTICE.md.
The tool is released under the MIT License. The MIT license covers the tool only โ not the models you produce with it.
๐ Citations
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. arXiv:2406.11717
Jim Lai (grimjim) (2025). Projected & norm-preserving refinements of the refusal direction for reduced collateral damage. Hugging Face blog.
See NOTICE.md for the full reference list and attributions.
๐ค Contributing
PRs are welcome. Good first contributions: new model-family adapters, additional refusal evaluators, prompt-set improvements, and docs. Please keep directions.py pure and confine weight mutation to model.py.
ZeroFuse ยท MIT ยท built with PyTorch ยท Optuna ยท ๐ค Transformers
Point it at a model. It does the rest.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/junainfinity/ZeroFuse'
If you have feedback or need assistance with the MCP directory API, please join our Discord server