tool-smith
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@tool-smithWhat's the weather in Tokyo?"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
tool-smith
LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.
The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.
Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.
The result (real, from python -m toolsmith.eval)
Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:
metric (all 97) | base | LoRA-tuned | Δ |
valid JSON | 95.9% | 100.0% | +4.1 |
schema-valid call | 19.6% | 94.8% | +75.2 |
correct tool | 36.1% | 85.6% | +49.5 |
exact args | 4.1% | 74.2% | +70.1 |
fully correct | 4.1% | 74.2% | +70.1 |
On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.
Generalization to hand-written (non-templated) inputs
The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:
metric (12 hand-written) | base | tuned | Δ |
schema-valid | 25.0% | 100.0% | +75.0 |
correct tool | 41.7% | 83.3% | +41.6 |
fully correct | 8.3% | 58.3% | +50.0 |
The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.
The training run (real loss curve)
LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

Related MCP server: one-mcp
Quickstart
pip install -e . # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt # torch/transformers/peft/trl/... for training
python -m toolsmith.data.build # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo # offline recovery demo -> logs/run-demo.jsonlServe the tuned model over MCP
python -m toolsmith.mcp_server # stdio; exposes route_to_tool(request) + the 8 toolsmcp.json for Claude Desktop / Cursor:
{
"mcpServers": {
"tool-smith": {
"command": "python",
"args": ["-m", "toolsmith.mcp_server"],
"cwd": "/path/to/tool-smith"
}
}
}route_to_tool("What's the weather in Tokyo?") → {"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.
The agent loop (validation + recovery + observability)
agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:
ok=True recovery=none | What's the weather in Tokyo? (tuned, 1 attempt)
ok=True recovery=repair_retry | Pack for Berlin? ... rain there. (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds... (base failed x3 -> fallback router)python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.
Layout
toolsmith/
schema.py # the fixed 8-tool toolbox + JSON validator (one source of truth)
data/build.py # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
train.py # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
router.py # load base (+adapter) and turn a request into a tool-call string
eval.py # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
grade.py # exact tool/args grading (no LLM judge needed for routing)
mcp_server.py # FastMCP: route_to_tool + 8 mock tools
agent.py # validate / repair-retry / frontier-fallback loop + JSONL logging
logs_report.py # summarize agent runs
data/ # committed train/test jsonl
artifacts/ # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/ # committed real agent traces
tests/ # pytest (grading + agent recovery), model-freeLimitations & next steps
Stated plainly — knowing the limits is part of the work:
Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form
fully_correctis 58% vs 74% templated.Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
Every number here comes from an actual local run, regenerable (fixed seed;
requirements-train.txtpins the exact stack). No placeholder figures.
License
MIT
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ganlin770/tool-smith'
If you have feedback or need assistance with the MCP directory API, please join our Discord server