How do I use tool-smith?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@tool-smith What's the weather in Tokyo?" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

tool-smith

by ganlin770

Overview Schema Related Servers Score Discussions

Python

Local

tool-smith

LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.

Python PEFT MCP Apple Silicon License

The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.

Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.

The result (real, from `python -m toolsmith.eval`)

Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:

metric (all 97)	base	LoRA-tuned	Δ
valid JSON	95.9%	100.0%	+4.1
schema-valid call	19.6%	94.8%	+75.2
correct tool	36.1%	85.6%	+49.5
exact args	4.1%	74.2%	+70.1
fully correct	4.1%	74.2%	+70.1

On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

base vs tuned

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.

Generalization to hand-written (non-templated) inputs

The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:

metric (12 hand-written)	base	tuned	Δ
schema-valid	25.0%	100.0%	+75.0
correct tool	41.7%	83.3%	+41.6
fully correct	8.3%	58.3%	+50.0

The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.

The training run (real loss curve)

LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

training loss

Related MCP server: agentvet-mcp

Quickstart

pip install -e .                                   # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt              # torch/transformers/peft/trl/... for training

python -m toolsmith.data.build      # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train           # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval            # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo    # offline recovery demo -> logs/run-demo.jsonl

Serve the tuned model over MCP

python -m toolsmith.mcp_server      # stdio; exposes route_to_tool(request) + the 8 tools
# or containerized (installs the inference stack, pulls the base model on first run):
docker build -t tool-smith . && docker run --rm -i tool-smith

mcp.json for Claude Desktop / Cursor:

{
  "mcpServers": {
    "tool-smith": {
      "command": "python",
      "args": ["-m", "toolsmith.mcp_server"],
      "cwd": "/path/to/tool-smith"
    }
  }
}

route_to_tool("What's the weather in Tokyo?") → {"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.

The agent loop (validation + recovery + observability)

agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:

ok=True recovery=none              | What's the weather in Tokyo?        (tuned, 1 attempt)
ok=True recovery=repair_retry      | Pack for Berlin? ... rain there.    (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds...     (base failed x3 -> fallback router)

python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.

Layout

toolsmith/
  schema.py        # the fixed 8-tool toolbox + JSON validator (one source of truth)
  data/build.py    # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
  train.py         # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
  router.py        # load base (+adapter) and turn a request into a tool-call string
  eval.py          # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
  grade.py         # exact tool/args grading (no LLM judge needed for routing)
  mcp_server.py    # FastMCP: route_to_tool + 8 mock tools
  agent.py         # validate / repair-retry / frontier-fallback loop + JSONL logging
  logs_report.py   # summarize agent runs
data/              # committed train/test jsonl
artifacts/         # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/              # committed real agent traces
tests/             # pytest (grading + agent recovery), model-free

Limitations & next steps

Stated plainly — knowing the limits is part of the work:

Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form fully_correct is 58% vs 74% templated.
Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
Every number here comes from an actual local run, regenerable (fixed seed; requirements-train.txt pins the exact stack). No placeholder figures.

License

MIT

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ganlin770/tool-smith'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

tool-smith

The result (real, from python -m toolsmith.eval)

Generalization to hand-written (non-templated) inputs

The training run (real loss curve)

Quickstart

Serve the tuned model over MCP

The agent loop (validation + recovery + observability)

Layout

Limitations & next steps

License

Maintenance

Resources

Looking for Admin?

Latest Blog Posts

MCP directory API

The result (real, from `python -m toolsmith.eval`)