Provides a production-ready path for scalable model explainability and data access through integration with Databricks and Unity Catalog.
Enables users to query machine learning model behavior and individual prediction reasoning using natural language directly within the VS Code Copilot interface.
Facilitates integration with Kedro-based machine learning pipelines to provide narrative explainability for model outputs within existing workflows.
Supports integration with MLflow model registries to track and generate human-readable explanations for registered machine learning models.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@xai-toolkitExplain why customer 8821 was predicted to churn"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
xai-toolkit
ML model explainability as plain-English narratives, exposed via MCP.
Ask your model why in natural language. Get a deterministic, reproducible English answer backed by SHAP analysis — directly inside VS Code Copilot.
User: "Why was sample 42 classified as degraded?"
Copilot: The model classified this sample as degraded (probability: 0.91)
primarily because of 3 factors: total_acid_number = 4.8 (pushing toward
the positive class by +0.28), water_content_ppm = 312.0 (+0.19),
and viscosity_40c = 48.2 (+0.14). The top opposing factor is
flash_point = 215.0 (pushing away from the positive class by -0.06).No SHAP plots to decipher. No code to write. English that a decision-maker can act on.
Quick Start
# 1. Install dependencies
uv sync
# 2. Train the models (run once)
uv run python scripts/train_lubricant_model.py
# 3. Run the test suite (should show 100+ passing)
uv run python -m pytest tests/ -v
# 4. Open VS Code — the MCP server starts automatically via .vscode/mcp.json
# Open Copilot chat in agent mode and ask:
# "What models are available?"Requirements: Python 3.11+, uv, VS Code with GitHub Copilot
What It Does
Seven MCP tools answer the most common explainability questions:
Ask | Tool | Returns |
"What models are available?" |
| Model IDs, types, accuracy |
"Tell me about the data" |
| Sample count, class distribution, stats |
"What does this model do?" |
| Model type, accuracy, top 5 features |
"Which features matter most?" |
| Ranked feature importance with magnitudes |
"Why was sample N classified as X?" |
| Narrative + SHAP bar chart |
"Show me the full SHAP breakdown" |
| Narrative + waterfall plot |
"How does feature F affect predictions?" |
| Narrative + PDP/ICE plot |
Every response includes a complete audit trail: model ID, timestamp, tool version, and a SHA256 hash of the input data. Same question + same data = same answer, every time.
Architecture
VS Code Copilot (Sonnet 4.5)
│ natural language question
▼
MCP Client (built into Copilot)
│ JSON-RPC over stdio
▼
xai-toolkit MCP Server (server.py — thin adapter only)
│
├── explainers.py SHAP values, PDP/ICE, global importance, data hashing
├── narrators.py Structured data → deterministic English paragraphs
├── plots.py matplotlib → base64 PNG (bar, waterfall, PDP+ICE)
├── schemas.py Pydantic contracts — single source of truth
└── registry.py ModelRegistry — load and serve multiple model typesDesign principle: The LLM is the presenter, not the analyst. All computation and narrative generation happens in pure Python. The LLM chooses the right tool and wraps the pre-computed result conversationally. This guarantees reproducibility — the LLM cannot hallucinate SHAP values.
Project Layout
xai-mcp/
├── src/xai_toolkit/ # Source package
├── tests/ # 100+ pytest tests
├── docs/
│ ├── decisions/ # 7 Architecture Decision Records
│ └── scalability-path.md # PoC → Production roadmap
├── scenarios/ # YAML acceptance criteria (day1–day5)
├── scripts/ # Model training scripts
├── models/ # Trained model artifacts
├── data/ # Test datasets
└── .vscode/mcp.json # MCP server configurationRunning Tests
# Full test suite
uv run python -m pytest tests/ -v
# Write snapshot golden files (run once after first install)
uv run python -m pytest tests/test_snapshots.py --snapshot-update -v
# Just the fast unit tests (no model loading)
uv run python -m pytest tests/test_narrators.py tests/test_explainers.py tests/test_reproducibility.py -v
# Second model integration tests (requires trained RF model)
uv run python -m pytest tests/test_second_model.py -vAdding a New Model
Train and save your model following the convention in
scripts/train_lubricant_model.pyAdd one line to the startup block in
server.py:registry.load_from_disk("your_model_id", MODELS_DIR, DATA_DIR)Run
uv run python -m pytest tests/test_second_model.py -v— all tools should work for your new model with zero code changes
See AGENTS.md for full coding standards and architecture guidance.
Architecture Decision Records
Seven decisions documented in docs/decisions/:
ADR | Decision |
001 | Pure functions separated from MCP layer |
002 | Deterministic narratives — no LLM calls |
003 | stdio → Streamable HTTP migration path |
004 | Pydantic schemas as single source of truth |
005 | Consistent tool output structure |
006 | ModelRegistry pattern |
007 | Single-agent architecture |
Production Path
This local PoC becomes a production service by changing infrastructure, not code.
See docs/scalability-path.md for the full roadmap,
including Databricks integration, MLflow model registry, Unity Catalog data access,
and the integration path with the existing Kedro-based XAI pipeline.
Estimated effort to production: 2–4 weeks (platform team, not application code).
Related
FastMCP — MCP server framework used here
SHAP — explainability library
Model Context Protocol — the standard this implements
Resources
Looking for Admin?
Admins can modify the Dockerfile, update the server description, and track usage metrics. If you are the server author, to access the admin panel.