Skip to main content
Glama

tool-smith

LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.

Python PEFT MCP Apple Silicon License

The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.

Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.


The result (real, from python -m toolsmith.eval)

Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:

metric (all 97)

base

LoRA-tuned

Δ

valid JSON

95.9%

100.0%

+4.1

schema-valid call

19.6%

94.8%

+75.2

correct tool

36.1%

85.6%

+49.5

exact args

4.1%

74.2%

+70.1

fully correct

4.1%

74.2%

+70.1

On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

base vs tuned

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.

Generalization to hand-written (non-templated) inputs

The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:

metric (12 hand-written)

base

tuned

Δ

schema-valid

25.0%

100.0%

+75.0

correct tool

41.7%

83.3%

+41.6

fully correct

8.3%

58.3%

+50.0

The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.

The training run (real loss curve)

LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

training loss

Related MCP server: one-mcp

Quickstart

pip install -e .                                   # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt              # torch/transformers/peft/trl/... for training

python -m toolsmith.data.build      # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train           # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval            # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo    # offline recovery demo -> logs/run-demo.jsonl

Serve the tuned model over MCP

python -m toolsmith.mcp_server      # stdio; exposes route_to_tool(request) + the 8 tools

mcp.json for Claude Desktop / Cursor:

{
  "mcpServers": {
    "tool-smith": {
      "command": "python",
      "args": ["-m", "toolsmith.mcp_server"],
      "cwd": "/path/to/tool-smith"
    }
  }
}

route_to_tool("What's the weather in Tokyo?"){"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.

The agent loop (validation + recovery + observability)

agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:

ok=True recovery=none              | What's the weather in Tokyo?        (tuned, 1 attempt)
ok=True recovery=repair_retry      | Pack for Berlin? ... rain there.    (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds...     (base failed x3 -> fallback router)

python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.

Layout

toolsmith/
  schema.py        # the fixed 8-tool toolbox + JSON validator (one source of truth)
  data/build.py    # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
  train.py         # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
  router.py        # load base (+adapter) and turn a request into a tool-call string
  eval.py          # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
  grade.py         # exact tool/args grading (no LLM judge needed for routing)
  mcp_server.py    # FastMCP: route_to_tool + 8 mock tools
  agent.py         # validate / repair-retry / frontier-fallback loop + JSONL logging
  logs_report.py   # summarize agent runs
data/              # committed train/test jsonl
artifacts/         # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/              # committed real agent traces
tests/             # pytest (grading + agent recovery), model-free

Limitations & next steps

Stated plainly — knowing the limits is part of the work:

  • Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.

  • Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form fully_correct is 58% vs 74% templated.

  • Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.

  • Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).

  • Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.

  • Every number here comes from an actual local run, regenerable (fixed seed; requirements-train.txt pins the exact stack). No placeholder figures.

License

MIT

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ganlin770/tool-smith'

If you have feedback or need assistance with the MCP directory API, please join our Discord server