Skip to main content
Glama

โšก ZeroFuse

Point it at a model. It conducts trials, identifies the refusal architecture, and abliterates it.

Automated, capability-preserving refusal removal for open-weight transformer LLMs โ€” a direct weight edit that produces a standard Hugging Face model with zero inference-time overhead.

License: MIT Python 3.11+ Built with PyTorch ๐Ÿค— Transformers Optuna Agent-native: MCP Status: v0.1.0


ZeroFuse turns guardrail removal into a one-command, fully-automated optimization problem โ€” no hand-picking layers, no guessing strengths, no retraining. It estimates the model's refusal direction, orthogonalizes it out of the residual-writing weights, and uses a two-objective search to preserve capability. The output is a standard Hugging Face checkpoint you can load, quantize, or serve like any other.

Table of Contents

Related MCP server: CatoBot autoexperiment MCP Server

Why ZeroFuse

Most abliteration workflows are manual: you pick a layer, eyeball a strength coefficient, run the model, check whether it still refuses, and repeat โ€” often degrading the model's general capabilities along the way. ZeroFuse replaces that loop with a principled, automated search.

  • Fully automatic โ€” no hand-picking layers, directions, or strengths. You point it at a model and it conducts the trials.

  • Capability-preserving by design โ€” KL divergence from the original model is an explicit optimization objective, co-minimized alongside refusals, not an afterthought.

  • A real weight edit, not a runtime adapter โ€” orthogonalizes the refusal direction directly out of attention o_proj and MLP down_proj (W' = W โˆ’ strength ยท r(rแต€W)). The saved model has zero inference-time overhead: no LoRA to load, no runtime hooks, no wrapper.

  • Pareto-front control โ€” a two-objective Optuna TPE search hands you the full trade-off curve. Pick the point you want: fewest refusals, lowest KL, or the knee.

  • Grounded in published research โ€” difference-of-means refusal direction (Arditi et al. 2024) with optional projected refinement (grimjim 2025) to reduce collateral damage.

  • Broad model support โ€” dense models, MoE (including per-expert down_proj), and many multimodal nestings.

  • Resumable โ€” Optuna studies are journaled to disk; re-run the same command to continue where you left off.

  • Agent-native โ€” ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. Quiet by default, verbose on request.

  • One command, or one import โ€” zerofuse --model <hf-id-or-path>, or from zerofuse import abliterate.

๐Ÿš€ One Command

# Clone and install (editable)
git clone https://github.com/junainfinity/ZeroFuse.git
cd ZeroFuse
pip install -e .                # core
pip install -e ".[mcp]"         # + the agent/MCP server (optional)

# Point it at any Hugging Face model id or local path
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

That's the whole loop. ZeroFuse:

  1. Loads the target model and captures residual-stream activations.

  2. Identifies the refusal direction via difference-of-means on harmful vs. harmless prompts.

  3. Conducts trials โ€” a two-objective Optuna search over layers and strengths, co-minimizing refusals and KL divergence.

  4. Abliterates by orthogonalizing the chosen direction out of the weights.

  5. Writes a standard Hugging Face model directory you can load with from_pretrained โ€” no special runtime required.

# Resume an interrupted run โ€” same command, picks up the journaled study
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

# Quiet (only high-level phases) โ€” or fully verbose
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --quiet
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --verbose
NOTE

ZeroFuse needs enough memory to load and run forward passes on the target model. Plan capacity for the model you point it at.

๐Ÿ”ฌ How It Works

ZeroFuse implements the published "refusal direction" line of research as a clean-room MIT build, wrapped in an automated optimizer.

1. Estimate the refusal direction

It captures residual-stream activations on a set of harmful and harmless prompts and takes the difference of means. The unit refusal direction is:

$$ r ;=; \frac{\mu_{\text{harmful}} - \mu_{\text{harmless}}}{\lVert \mu_{\text{harmful}} - \mu_{\text{harmless}} \rVert} $$

where $\mu_{\text{harmful}}$ and $\mu_{\text{harmless}}$ are the mean residual-stream activations over harmful and harmless prompts respectively (Arditi et al., 2024). An optional projected refinement step (grimjim, 2025) sharpens the estimate to reduce collateral damage.

2. Orthogonalize it out of the weights

Rather than subtract the direction at runtime, ZeroFuse edits the weights that write into the residual stream so they can no longer contribute along $r$:

$$ W' ;=; W ;-; \text{strength} \cdot r,(r^{\top} W) $$

This is applied to the attention output projection (o_proj) and the MLP down-projection (down_proj), including MoE experts. The scalar strength controls how much of the $r$-component is removed: at strength = 1 this is a full orthogonal projection that removes the component entirely; smaller values remove it partially. Because the edit lives in the weights, the resulting model is indistinguishable in shape and speed from the original.

3. Search the Pareto front

Choosing layers and strengths by hand is the hard part โ€” so ZeroFuse doesn't. It runs an Optuna TPE multi-objective search that co-minimizes two objectives:

$$ \min ;\big(; N_{\text{refusals}}, ;; D_{\mathrm{KL}}(P_{\text{orig}} ,\Vert, P_{\text{edited}}) ;\big) $$

  • $N_{\text{refusals}}$ โ€” how often the edited model still refuses, scored by the evaluator.

  • $D_{\mathrm{KL}}$ โ€” how far the edited model's output distribution has drifted from the original, as a proxy for lost capability.

The result is a Pareto front of non-dominated configurations. You choose the operating point that fits your goal โ€” fewest refusals, lowest KL, or the knee of the curve โ€” and ZeroFuse materializes that exact weight edit.

refusals
  ^
  |  x
  |   x
  |     x  <- knee
  |        x x
  |            x x x
  +-------------------> KL divergence
   (each x = a non-dominated trial on the Pareto front)

โš–๏ธ ZeroFuse vs. the Alternatives

Capability

ZeroFuse

Manual abliteration

Fine-tuning

Setup effort

One command: point it at an HF id or path; layers and strengths are picked automatically

Hand-select target layers, directions, and strengths through trial and error

Assemble a dataset, configure a training run, and manage compute

Weights vs. runtime

Direct weight edit โ€” orthogonalizes the refusal direction out of o_proj and down_proj

Also a weight edit, but applied manually with chosen parameters

Updates weights via gradient descent over a training corpus

Capability preservation

KL divergence from the original model is an explicit optimization objective

Depends on the operator's manual tuning; no built-in capability objective

Risk of catastrophic forgetting; mitigation depends on data and hyperparameters

Tuning the trade-off

Two-objective Optuna TPE search yields a Pareto front; pick fewest refusals, lowest KL, or the knee

Re-run by hand and eyeball results; no systematic Pareto search

Adjust data mix and hyperparameters and retrain to shift the trade-off

Inference-time overhead

None โ€” output is a standard Hugging Face model

None if done as a weight edit; runtime adapters add overhead

None for a full fine-tune; LoRA adapters add overhead unless merged

Compute cost

Runs trials and a KL/refusal search; no gradient-based retraining

Low compute, but high human time per iteration

Highest โ€” training compute proportional to model and dataset size

Resumability

Optuna studies journaled to disk; re-run the same command to continue

Manual โ€” depends on your own bookkeeping

Checkpoint-based resume, depending on the training framework

Agent / automation

Ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client

None built in

None built in

Output format

Standard Hugging Face model โ€” load, quantize, or serve like any other

Modified model; format depends on the tooling used

Standard weights or a LoRA adapter, depending on method

Model support

Dense, MoE (per-expert down_proj), and many multimodal nestings; pure state-space out of scope

Whatever the operator manually implements support for

Broad, subject to framework support for the architecture

๐Ÿค– Agent-native / MCP

ZeroFuse ships a built-in Model Context Protocol server, so an agent can drive the whole pipeline as a tool. It works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP-compatible client.

Install the optional dependency and add it to your MCP client config:

pip install -e ".[mcp]"
{
  "mcpServers": {
    "zerofuse": {
      "command": "zerofuse-mcp"
      // installed alongside the CLI by `pip install -e ".[mcp]"`
    }
  }
}

It exposes a single abliterate tool and is designed to be a well-behaved citizen of an agent's context window:

  • Quiet by default. The harness sees only high-level phases โ€” identifying refusal architecture, conducting trials, abliterating โ€” not a firehose of internals.

  • Opt-in detail. Per-trial metrics, layer choices, and KL traces are emitted at MCP debug log level and surface only if the harness opts in to debug logs.

  • Override when you want it. A verbose argument forces full detail regardless of log level.

This keeps long-running optimization runs legible to an agent instead of flooding it with token-heavy progress chatter. See docs/agents.html for per-harness setup.

๐Ÿ Python API

Everything the CLI does is available as a library:

from zerofuse import abliterate

# One call: returns the saved HF model dir + the Pareto front to pick from.
result = abliterate("meta-llama/Llama-3.1-8B-Instruct", n_trials=100)
print(result.selected.refusals, result.selected.kl, result.output_dir)

Or build a full configuration explicitly:

from zerofuse import ZeroFuseConfig, run

config = ZeroFuseConfig.from_dict({
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "optimization": {"n_trials": 100},
})
result = run(config, selection="knee")

๐Ÿงฉ Supported Models

ZeroFuse is built to work on most open-weight transformer models you point it at:

Architecture

Support

Dense transformer LLMs

โœ… Supported

Mixture-of-Experts (per-expert down_proj)

โœ… Supported

Multimodal nestings with a transformer LLM backbone

โœ… Many supported

Pure state-space models

โŒ Out of scope

Because architectures vary, ZeroFuse is designed to generalize across these families rather than guaranteed to abliterate every model โ€” it adapts to the residual-writing weights it finds.

๐Ÿ“ Project Structure

ZeroFuse/
โ”œโ”€โ”€ src/zerofuse/
โ”‚   โ”œโ”€โ”€ config.py        # Run configuration & defaults (TOML + CLI)
โ”‚   โ”œโ”€โ”€ prompts.py       # Harmful / harmless prompt loading + batching
โ”‚   โ”œโ”€โ”€ directions.py    # Pure math: difference-of-means, projected refinement
โ”‚   โ”œโ”€โ”€ model.py         # Loading, activation capture, weight orthogonalization
โ”‚   โ”œโ”€โ”€ evaluator.py     # Scoring: refusal detection + KL divergence
โ”‚   โ”œโ”€โ”€ optimizer.py     # Optuna TPE search + Pareto-front selection
โ”‚   โ”œโ”€โ”€ pipeline.py      # End-to-end orchestration
โ”‚   โ”œโ”€โ”€ reporting.py     # Quiet-by-default progress (phases vs. details)
โ”‚   โ”œโ”€โ”€ cli.py           # `zerofuse` command-line entrypoint
โ”‚   โ””โ”€โ”€ mcp_server.py    # Model Context Protocol server (agent-native)
โ”œโ”€โ”€ docs/                # Self-contained HTML documentation site
โ”œโ”€โ”€ config/default.toml  # Fully-commented configuration template
โ””โ”€โ”€ tests/               # Unit tests for the pure-logic parts

Each module has a single responsibility. directions.py is pure math โ€” no model objects, easy to test and audit. model.py is the only place weights are touched.

โ“ FAQ

It estimates the model's "refusal direction" via difference-of-means of residual-stream activations on harmful vs. harmless prompts (Arditi et al. 2024, arXiv:2406.11717), then orthogonalizes that direction out of the residual-writing weights โ€” attention o_proj and MLP down_proj, including MoE experts โ€” using W' = W โˆ’ strength ยท r(rแต€W). No gradient-based training is involved; it's a direct edit to the existing weights.

ZeroFuse is built to minimize that. KL divergence from the original model is an explicit optimization objective alongside the number of refusals, and a two-objective Optuna TPE search produces a Pareto front so you can choose how to balance fewest refusals against lowest KL. There's also an optional projected refinement step (grimjim 2025) designed to reduce collateral damage. As a v0.1.0 project, these are design goals rather than independently benchmarked guarantees.

It's designed to work on most open-weight transformer models you point it at โ€” dense models, MoE models (including per-expert down_proj), and many multimodal nestings. Pure state-space models are out of scope.

You get a standard Hugging Face model โ€” the refusal behavior is edited into the weights themselves, not delivered as a runtime LoRA adapter. That means zero inference-time overhead: you can load, quantize, and serve it exactly like any other Hugging Face model.

Install with pip install -e . and run zerofuse --model <hf-id-or-path>, or use the Python API with from zerofuse import abliterate. ZeroFuse also ships an MCP server (pip install -e ".[mcp]") that works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. It's quiet by default โ€” the harness sees only high-level phases like "identifying refusal architecture," "conducting trials," and "abliterating," with internal details emitted at MCP debug log level and shown only if the harness opts in (a verbose flag overrides). Runs are resumable: Optuna studies are journaled to disk, so re-running the same command continues where you left off.

ZeroFuse is MIT-licensed โ€” an independent clean-room build from published papers and an Apache-2.0 reference implementation that copies no copyleft tool, with citations documented in NOTICE.md. The MIT license covers only the tool, not the models you produce. Because ZeroFuse reduces a model's guardrails, you are responsible for complying with the base model's license and acceptable-use policy, applicable law, and any platform terms that apply to the models you create and deploy.

๐Ÿ“Œ Status

v0.1.0 โ€” new project. ZeroFuse is early. The method is grounded in published research and the implementation is built to preserve capability, but it has not yet been independently benchmarked at scale. Where this README says designed to or built to, that is a deliberate statement that the claim is true by construction, not yet third-party-verified. No benchmark numbers, star counts, or testimonials are presented here because there aren't any to honestly report yet. Issues and reproductions welcome.

๐Ÿ›ก๏ธ Responsible Use

ZeroFuse reduces or removes safety guardrails from model weights. That capability carries real responsibility.

  • You are responsible for compliance with the base model's license and acceptable-use policy, all applicable law, and the terms of any platform you deploy on.

  • The MIT license covers this tool only โ€” it does not grant you any rights over, or responsibility for, the models you produce or process. Those are governed by the original model's license.

  • Use it on models you are permitted to modify, for purposes you are permitted to pursue.

Removing guardrails does not remove accountability. Think before you point it at something.

๐Ÿ“œ Provenance & License

ZeroFuse is an independent, clean-room implementation built from published papers and an Apache-2.0 reference implementation. It does not copy, vendor, or derive from any copyleft tool. Citations and attributions are documented in NOTICE.md.

The tool is released under the MIT License. The MIT license covers the tool only โ€” not the models you produce with it.

๐Ÿ“š Citations

  • Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. arXiv:2406.11717

  • Jim Lai (grimjim) (2025). Projected & norm-preserving refinements of the refusal direction for reduced collateral damage. Hugging Face blog.

See NOTICE.md for the full reference list and attributions.

๐Ÿค Contributing

PRs are welcome. Good first contributions: new model-family adapters, additional refusal evaluators, prompt-set improvements, and docs. Please keep directions.py pure and confine weight mutation to model.py.


ZeroFuse ยท MIT ยท built with PyTorch ยท Optuna ยท ๐Ÿค— Transformers

Point it at a model. It does the rest.

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
โ€“Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/junainfinity/ZeroFuse'

If you have feedback or need assistance with the MCP directory API, please join our Discord server