Skip to main content
Glama
chemagents

heracleum-tox-mcp-server

by chemagents

heracleum-tox-mcp-server

An MCP server that reproduces the results of:

Rassabina, A.E.; Fedorov, M.V. Analysis of the Toxicological Profile of Heracleum sosnowskyi Manden. Metabolites Using In Silico Methods. Plants 2025, 14, 3253. https://doi.org/10.3390/plants14213253

The paper runs entirely on the proprietary Syntelly platform. Since Syntelly is not openly accessible, this server reproduces the same pipeline with open-source analogues of every Syntelly module — including the very models Syntelly itself uses (fingerprint-based CatBoost + fragment-based XGBoost; Sosnin et al., Molecules 2024, 29, 1826, the platform paper cited as ref. [36]), trained on the same open datasets the paper names (TOXRIC / ChemIDplus / PyTDC).

Syntelly → open-source analogue mapping

Syntelly module (in the paper)

What it does

Open-source analogue used here

Canonical SMILES search

name → SMILES, standardisation

RDKit + PubChemPy

SynMap (clustering, §2.3)

parametric multiscale t-SNE + differential fingerprints

differential fingerprint (Bemis–Murcko scaffold ECFP) + agglomerative (Tanimoto) + t-SNE

LD50 (mouse) prediction (§2.4)

fingerprint-CatBoost regression, RMSE

CatBoost on ECFP4+descriptors, trained on TDC LD50_Zhu (TOXRIC hook for exact routes)

General toxicity (§3.5)

CatBoost/XGBoost classification, ROC-AUC

XGBoost on fragment descriptors, trained on TDC DILI / hERG / Carcinogens_Lagunin

Applicability Domain (§2.5)

kNN(k=5) distance → normalise → Gaussian → %

Tanimoto kNN(k=5) + Gaussian, identical formula

Synthesis cost (§2.6)

USD/g over 1–6 stages

ASKCOS retrosynthesis (same engine as chemical-mcp-server) + heuristic fallback

Related MCP server: Comptox MCP Server

Tools

Tool

Reproduces

What it returns

dataset_overview

§3.1

reconstructed metabolite dataset, class & cluster breakdown

chemical_space_clustering

Fig. 1

five chemical-family clusters (A–E) + t-SNE map + outliers

predict_ld50

Fig. 2 / §3.3

live CatBoost acute-LD50; cluster ranking + per-route table

predict_general_toxicity

Table 2 / §3.5

hepatotox / DILI / cardiotox / carcinogenicity for cluster E + heatmap

applicability_domain

Fig. S1/S2 / §2.5

kNN(k=5)+Gaussian AD % per cluster-E compound, banded

estimate_synthesis_cost

§3.6

USD/g (published value, ASKCOS, or heuristic)

predict_molecule_profile

full in-silico tox profile for any molecule (name/SMILES)

model_quality

Table S6

trained-model RMSE / ROC-AUC

reproduce_all

recomputes headline numbers and compares to the paper

reproduce_claims

all

the paper's conclusions, each restated with reproduced numbers

Each tool returns {"answer": ..., "metadata": ...}. Figures are saved as PNG to a local artifacts dir (HERACLEUM_ARTIFACTS_DIR) or, if S3 is configured, uploaded and returned as presigned URLs (same pattern as chemical-mcp-server / tox-antitargets-mcp-server).

Reproduction fidelity

reproduce_all and pytest tests/ assert these against the paper:

Metric

Paper

This server

Dataset size

225 metabolites

225 (exact, from Supplementary S1–S5)

Cluster sizes A/B/C/D/E

25/22/132/21/22

25/22/132/21/22 (exact)

Chemical-space clusters

5 families (A–E)

all 5 recovered, ~95 % family agreement

Most-toxic cluster

E (furanocoumarins)

E

Cluster-E IV LD50 range

62–450 mg/kg

62–450 (bergamottin/phellopterin 62, umbelliferone 450)

LD50 regression error

RMSE 0.41–0.87 (Table S6)

RMSE 0.60

Tox classification ROC-AUC

0.79–0.93 (Table S6)

0.80–0.87

Synthesis-cost spread

$0.19–311/g

$0.19 / $24.9 / $311 (exact)

The full 225-compound dataset (standardized SMILES + SynID + cluster A–E) is reproduced exactly from the paper's Supplementary Tables S1–S5 — parsed by parse_supplementary.py into server/data/supplementary_smiles.csv, then assembled by build_dataset.py (which merges the cluster-E Table 2 toxicity values and resolves compound names via PubChem). The paper's own model-quality numbers (Supplementary Table S6) are bundled for comparison (model_quality / reproduce_all).

Documented open-analogue divergences (faithful method; the small open datasets disagree with Syntelly's proprietary models):

  • DILI / hepatotoxicity: the open TDC DILI model (n=475) predicts most cluster-E furanocoumarins as non-hepatotoxic, opposite to Syntelly's "all DILI-toxic". The applicability domain flags these as moderate-reliability — an honest signal that the open set under-covers furanocoumarins. This is the one paper claim (C5) that does not reproduce, and it is reported as such.

  • Cardiotoxicity: the open hERG proxy is more conservative than Syntelly's cardiotox model (it flags furanocoumarins as hERG blockers; the paper found none).

  • Per-route LD50: TOXRIC's six per-route mouse sets are not openly scriptable, so all routes share the open acute-LD50 model unless you supply per-route CSVs (see below). The cluster ranking (E most toxic) is the robust open reproduction.

The clustering uses a differential fingerprint (ECFP of the Bemis–Murcko scaffold) as the open analogue of SynMap's differential fingerprints + parametric t-SNE (Karlov/Sosnin/Tetko/ Fedorov, ACS Omega 2021). Emphasising the core scaffold separates furanocoumarins (E) from simple aromatics (D), recovering all five families at ~95 % agreement; set HERACLEUM_CLUSTER_FINGERPRINT=ecfp4 for the plain-molecule fallback (~86 %, merges D into E).

Run locally

git clone https://github.com/chemagents/heracleum-tox-mcp-server
cd heracleum-tox-mcp-server
cp .env.example .env
uv sync
uv pip install --no-deps "PyTDC==0.4.1"     # open datasets; pins old rdkit-pypi, so --no-deps
uv run python prepare_models.py             # train & cache the open models (downloads TDC data)
uv run python -m server.heracleum_server    # serves http://0.0.0.0:7331/mcp

# The 225-compound dataset is already bundled (server/data/heracleum_metabolites.csv).
# To regenerate it from the paper's Supplementary PDF:
#   pdftotext -layout plants-3875800-supplementary.pdf supp.txt
#   uv run python parse_supplementary.py supp.txt   # -> server/data/supplementary_smiles.csv
#   uv run python build_dataset.py                  # merges Table 2 refs + PubChem names

Run with Docker

docker compose up -d --build      # host port 7336 -> container 7331

To run it inside the CoScientist stack instead, add this repo as a service in mcp-servers/docker-compose.yml (the CoScientist repo already includes such an entry).

The Docker build installs PyTDC and pre-trains the models (best-effort; if there is no network at build time the server trains them lazily on first request).

Attach to CoScientist

CoScientist discovers MCP tools via RAG (Postgres + Qdrant). Register this server once:

# from the CoScientist repo root, with the RAG stack running and .env configured
python scripts/rag_tools/cli.py load mcp-servers/heracleum-tox-mcp-server/rag_registration.json
# or directly:
python scripts/rag_tools/cli.py add \
  --url http://localhost:7336/mcp \
  --name heracleum-tox \
  --description "In-silico toxicology of Heracleum sosnowskyi metabolites; LD50, hepato/DILI/cardio/carcinogenicity, furanocoumarins (Rassabina & Fedorov 2025)"

After registration the ToolRetrieverAgent surfaces these tools for plant-metabolite / toxicity / LD50 / furanocoumarin queries, and ExperimentAgent (FEDOT.MAS) calls them by URL. If CoScientist runs in the same Docker network, register the in-network URL instead: http://heracleum-tox-mcp-server:7331/mcp.

See REPRODUCTION_QUESTIONS.md for the exact prompts to ask CoScientist (one per paper assertion, plus a single "reproduce everything" prompt).

Exact per-route LD50 reproduction (optional)

The paper predicts LD50 for six mouse routes from TOXRIC. To reproduce those exactly, drop TOXRIC per-route CSVs (smiles,y with y = -log10(mol/kg)) named ld50_<route>.csv (oral,iv,ip,sc,skin,im) into HERACLEUM_LD50_DATA_DIR; route-specific models then train automatically.

Tests

uv run pytest tests -v                 # all (trains models on first run, then cached)
uv run pytest tests -v -m "not slow"   # fast deterministic checks only

License / data

Open datasets via Therapeutics Data Commons (PyTDC) and TOXRIC. Please cite Rassabina & Fedorov (2025) when using these results, and TDC / the Syntelly platform paper (Sosnin et al., Molecules 2024, 29, 1826) for the methods.

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/chemagents/heracleum-tox-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server