Feature Evaluation MCP Server
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Feature Evaluation MCP Serverload customer.csv and analyze feature importance"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Feature Evaluation MCP Server
An MCP (Model Context Protocol) server that provides a complete toolkit for evaluating, selecting, and comparing features in classification datasets. Built with FastMCP, scikit-learn, pandas, and matplotlib.
Quick Start
1. Install
cd FeatureEngineering
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt2. Run the server
python feature_eval_server.pyThe server starts in stdio transport mode, ready for any MCP client.
3. Connect from Claude Code
Add this to your MCP config (~/.claude/claude_desktop_config.json or project .mcp.json):
{
"mcpServers": {
"feature-eval": {
"command": "/home/jai/MachineLearnin-Repo/FeatureEngineering/.venv/bin/python",
"args": ["/home/jai/MachineLearnin-Repo/FeatureEngineering/feature_eval_server.py"]
}
}
}Tools Reference (13 tools)
Data Loading
Tool | Description |
| Load any CSV file with a specified target column |
| Load a built-in dataset ( |
| Descriptive statistics, missing values, and dtypes |
| Show all currently loaded datasets |
Feature Importance
Tool | Description |
| Importance via Random Forest, Gradient Boosting, or Decision Tree |
| Permutation-based importance on a held-out test set |
| Univariate scores: ANOVA F-test, Chi-squared, or Mutual Information |
Correlation Analysis
Tool | Description |
| Pairwise correlation heatmap with high-correlation pair detection |
| Each feature's correlation with the target variable |
Feature Selection
Tool | Description |
| RFE using Logistic Regression or Decision Tree |
| Top-K univariate feature selection |
Model Evaluation
Tool | Description |
| Train/test split + cross-validation with full classification report |
| Compare CV accuracy across different feature subsets |
Every tool that produces a visualization returns a
chart_base64_pngfield containing a PNG image encoded in base64.
Dataset Compatibility
The server works with any tabular classification dataset, not just the built-in samples.
Automatic preprocessing at load time
Data issue | How it's handled |
Categorical features (text columns) | Ordinal-encoded automatically — all tools see numeric data |
Missing values | Imputed using median (numeric) at load time |
Non-numeric target (e.g. | Label-encoded to integers automatically |
What the server does NOT support
Regression tasks — classification only (target must be discrete classes)
Multi-label targets — single target column only
Image/text/time-series features — tabular data only
How to Interpret Results — A Complete Guide
This section explains what each analysis technique measures, how to read the numbers, and what decisions to make based on the results. We use the Mailbox Compromise Detection case study as a running example.
Understanding the Problem
We have 5,000 email accounts with 20 features each. The goal is to classify whether an account is compromised (hacked) or legitimate. The dataset is imbalanced: 4,000 legit vs 1,000 compromised (4:1 ratio).
The 20 features fall into categories:
Login behavior (6 features): how the user logs in (countries, IPs, times, devices)
Email sending patterns (6 features): what emails look like (volume, recipients, links)
Account anomalies (5 features): suspicious changes (rules, forwarding, MFA)
Noise features (3 features): metadata that shouldn't predict compromise (mailbox size, account age, department)
The key question: Which of these 20 features actually matter for detecting compromise?
Step 1 — Loading and Preprocessing
load_csv("data/mailbox_compromise.csv", target_column="compromised", dataset_name="mailbox"){
"shape": [5000, 21],
"categorical_features_encoded": ["department"],
"missing_values_imputed": 450,
"class_distribution": {"0": 4000, "1": 1000}
}How to interpret:
shape: [5000, 21]— 5,000 rows (accounts) and 21 columns (20 features + 1 target). This is a reasonably sized dataset for ML.categorical_features_encoded: ["department"]— Thedepartmentcolumn contained text values like "Engineering", "Sales", etc. The server automatically converted these to numbers (0, 1, 2...) so ML algorithms can process them.missing_values_imputed: 450— 450 cells had missing data (NaN). These were filled with the median value of their column. This prevents models from crashing on NaN but you should be aware that imputation adds artificial values.class_distribution: {"0": 4000, "1": 1000}— The classes are imbalanced (80% legit, 20% compromised). This means a model that always predicts "legit" would score 80% accuracy. Any useful model must beat this baseline significantly.
Decision: The 80/20 split means you should pay attention to recall for class 1 (compromised) — a model with 95% accuracy could still be missing half of all compromised accounts.
Step 2 — Feature Importance (Random Forest)
feature_importance_tree("mailbox", method="random_forest")Rank Feature Importance
1. emails_sent_24h 0.3162
2. external_recipients_24h 0.2096
3. emails_with_links_24h 0.1564
4. send_spike_ratio 0.1278
5. emails_with_attachments_24h 0.0726
6. inbox_rules_changed_7d 0.0348
...
18. mailbox_size_mb 0.0002
19. password_changed_24h 0.0002
20. department 0.0001What it measures: Random Forest builds many decision trees and tracks how much each feature reduces classification error when used as a split point. Features that create the cleanest separations between classes get higher importance scores. All scores sum to 1.0.
How to interpret the numbers:
Importance = 0.3162 for
emails_sent_24hmeans this single feature is responsible for ~32% of the model's decision-making power. It is the strongest individual predictor.Top 5 features sum to 0.88 — meaning 88% of the model's classification ability comes from just 5 of 20 features. This is a strong sign that most features can be dropped.
Importance < 0.001 (features 15-20) means these features almost never help distinguish compromised from legit accounts. The model rarely picks them as split points.
departmentat 0.0001 — The department an employee works in has virtually no bearing on whether their account is compromised. Attackers don't care if you're in Sales or Engineering.
What the gap tells you: There's a clear "elbow" between feature #5 (0.0726) and feature #6 (0.0348) — importance drops by half. This suggests the natural boundary is around 5 features.
Caveat: RF importance is biased toward high-cardinality features (features with many unique values). A continuous feature like emails_sent_24h (many values) may score higher than a binary feature like forwarding_rule_added (only 0/1) even if both are equally useful. That's why we use multiple methods.
Step 3 — Feature Importance (Gradient Boosting)
feature_importance_tree("mailbox", method="gradient_boosting") 1. emails_sent_24h 0.9687
2. external_recipients_24h 0.0150
3. emails_with_links_24h 0.0111What it measures: Gradient Boosting builds trees sequentially — each tree corrects the mistakes of the previous one. Importance measures how much each feature contributed across all correction steps.
How to interpret:
0.9687 for
emails_sent_24h— GB found that nearly all classification errors can be fixed by this single feature. It alone solves 97% of the problem.Why so concentrated vs RF? GB is greedier — it uses the single best feature first, then only looks at others for residual errors. RF spreads usage across correlated features more evenly. Both are valid views.
Decision: When one method says "this feature is dominant" and another says "these 5 features are all important," it usually means the top feature is genuinely powerful, and the others provide overlapping information. Don't dismiss the others — they act as backup signals.
Step 4 — Permutation Importance
permutation_importance_analysis("mailbox", n_repeats=10) 1. emails_sent_24h mean_decrease=0.0441 std=0.0029
2. external_recipients_24h mean_decrease=0.0009
3. emails_with_links_24h mean_decrease=0.0006
...
4-20. (all remaining features) mean_decrease=0.0000What it measures: After training a model, we randomly shuffle one feature's values and check how much accuracy drops. If accuracy drops a lot, the feature was important. If it doesn't change, the feature was irrelevant (or redundant with others).
How to interpret the numbers:
mean_decrease=0.0441means that shufflingemails_sent_24hcaused accuracy to drop by 4.41 percentage points (e.g., from 100% to 95.6%). This is a significant real-world impact.std=0.0029means across 10 shuffles, the drop was consistent (low variance). This result is reliable.mean_decrease=0.0000for features 4-20 does NOT necessarily mean these features are useless in isolation. It means that given the other features already in the model, removing them makes no difference. They are redundant — their information is already captured by the top features.
Key insight — Redundancy vs Uselessness:
login_time_deviation_hrsshows 0.0 permutation importance here but scored well in ANOVA (F=1,965). It IS predictive on its own, but adds nothing whenemails_sent_24his already present.mailbox_size_mbshows 0.0 here AND near-zero everywhere else. It is genuinely useless.
Decision: Permutation importance on a test set is the most honest measure of real-world impact. Use it as the final arbiter when tree-importance and statistical tests disagree.
Step 5 — ANOVA F-Test Scores
statistical_feature_scores("mailbox", method="f_classif") 1. emails_sent_24h F=25,747 p≈0
2. external_recipients_24h F=16,418 p≈0
...
18. department F=1.19 p=0.275
19. account_age_days F=1.07 p=0.301
20. mailbox_size_mb F=0.65 p=0.420What it measures: ANOVA F-test asks: "Is the mean value of this feature significantly different between the classes?" For each feature, it compares the variance between classes to the variance within classes. A high F-score means the classes have very different distributions for that feature.
How to interpret the numbers:
F-score: Higher = more separation between classes. There's no universal threshold — compare features against each other. F=25,747 vs F=1.19 is a 20,000x difference, which is massive.
p-value: The probability that this score happened by chance.
p ≈ 0 (features 1-17): The difference is statistically significant. There is a real relationship between this feature and the target.
p = 0.275 (department): There's a 27.5% chance this apparent pattern is just noise. By convention, p > 0.05 means not significant — we cannot confidently say this feature helps.
p = 0.420 (mailbox_size_mb): 42% chance of being noise. Clearly not useful.
How to read the tiers:
F-Score Range | Meaning in this dataset |
> 10,000 | Strong predictor — clear class separation |
1,000 - 10,000 | Moderate predictor — useful but overlapping distributions |
100 - 1,000 | Weak predictor — some signal but lots of overlap |
< 10 (p > 0.05) | Not significant — this feature is noise |
Caveat: ANOVA only measures linear separation. A feature that perfectly separates classes in a non-linear way (e.g., compromised accounts have EITHER very high OR very low values) might score poorly on F-test but well on Mutual Information.
Step 6 — Mutual Information Scores
statistical_feature_scores("mailbox", method="mutual_info") 1. emails_sent_24h MI=0.487
2. external_recipients_24h MI=0.445
3. emails_with_links_24h MI=0.421
4. send_spike_ratio MI=0.417
5. emails_with_attachments_24h MI=0.371
--- gap ---
6. login_countries_7d MI=0.146
...
18. department MI=0.002
19. account_age_days MI=0.000
20. mailbox_size_mb MI=0.000What it measures: Mutual Information (MI) measures how much knowing a feature's value reduces uncertainty about the class. Unlike ANOVA, MI captures any kind of relationship — linear, non-linear, categorical, or complex. MI = 0 means the feature is completely independent of the target. Higher = more informative.
How to interpret the numbers:
MI = 0.487 — Knowing
emails_sent_24heliminates about 49% of the uncertainty about whether an account is compromised. This is very high.MI = 0.000 for
mailbox_size_mb— Knowing the mailbox size tells you absolutely nothing about whether the account is compromised. Zero information gained.The gap between 0.371 and 0.146 (feature 5 to 6) is the same "elbow" we saw in RF importance. The top 5 features are in a different league.
Why MI and ANOVA agree here: When relationships are mostly linear (which they are in this dataset), ANOVA and MI tend to agree. When they disagree, trust MI for the more complete picture.
Step 7 — Correlation Matrix
correlation_matrix("mailbox", threshold=0.7)emails_sent_24h <-> external_recipients_24h r=0.80
emails_sent_24h <-> send_spike_ratio r=0.79
emails_sent_24h <-> emails_with_links_24h r=0.78
external_recipients_24h <-> emails_with_links_24h r=0.75
emails_sent_24h <-> emails_with_attachments_24h r=0.74
external_recipients_24h <-> emails_with_attachments_24h r=0.71What it measures: Pearson correlation (r) measures the linear relationship between two features. r = +1 means they move together perfectly, r = -1 means they move in opposite directions, r = 0 means no linear relationship.
How to interpret the numbers:
r value | Meaning |
0.9 - 1.0 | Near-duplicate features — one can replace the other |
0.7 - 0.9 | Strongly correlated — significant redundancy |
0.4 - 0.7 | Moderately correlated — some shared information |
0.0 - 0.4 | Weakly or not correlated — independent information |
What this tells us about the mailbox data:
All 6 correlated pairs are email-sending features. This makes logical sense: when an attacker sends a burst of emails, ALL sending metrics go up together — more emails means more recipients, more links, more attachments, higher spike ratio.
r=0.80 between
emails_sent_24handexternal_recipients_24hmeans they carry 80% overlapping information. Including both in a model adds only ~20% new information over using one alone.No login features are correlated with email features (r < 0.4), meaning they provide independent signals. Even if login features are weaker individually, they add unique information.
Decision — What to do with correlated features:
For model accuracy: Correlated features usually don't hurt tree-based models, but they inflate importance scores and make the model harder to interpret.
For interpretability: Pick ONE representative from each correlated cluster. From the sending cluster,
emails_sent_24halone captures most of the signal.For linear models (Logistic Regression): High correlation causes unstable coefficients. Dropping correlated features or using regularization helps.
Step 8 — Feature-Target Correlation
target_correlation("mailbox") 1. emails_sent_24h +0.9151
2. external_recipients_24h +0.8756
3. emails_with_links_24h +0.8533
4. emails_with_attachments_24h +0.8058
5. send_spike_ratio +0.7397
...
14. device_diversity_7d +0.4227
15. oauth_consent_granted_7d +0.3502
...
18. department -0.0155
19. account_age_days +0.0146
20. mailbox_size_mb -0.0114What it measures: How strongly each individual feature correlates with the target variable (compromised = 0 or 1). This is the most direct measure of "does this feature move with the outcome?"
How to interpret the numbers:
+0.9151 for
emails_sent_24h— A near-perfect positive correlation. As emails sent increases, the probability of being compromised increases almost linearly. This is the single most direct indicator.+0.4227 for
device_diversity_7d— A moderate positive correlation. Compromised accounts tend to show more device diversity, but there's substantial overlap with legitimate accounts (travelers, people with multiple devices).-0.0155 for
department— Essentially zero. The tiny negative sign is meaningless at this magnitude — it's just random noise.All signs are positive (except noise features) — This makes sense because all suspicious behaviors (more emails, more logins, more rule changes) are higher for compromised accounts.
How to read the tiers in this dataset:
Correlation | Features | Interpretation |
> 0.7 | emails_sent, external_recipients, links, attachments, spike_ratio | Primary indicators — individually sufficient for detection |
0.4 - 0.7 | login_time, failed_logins, countries, legacy_protocol, inbox_rules, new_IPs, forwarding, pct_external, device_diversity | Secondary indicators — useful for edge cases and model robustness |
0.3 - 0.4 | oauth_consent, mfa_disabled, password_changed | Weak indicators — some signal but too noisy to rely on alone |
< 0.05 | department, account_age, mailbox_size | No signal — safe to drop |
Step 9 — Recursive Feature Elimination (RFE)
recursive_feature_elimination("mailbox", n_features_to_select=8, estimator="logistic_regression")Selected (rank 1): login_time_deviation_hrs, emails_sent_24h,
external_recipients_24h, pct_external_recipients,
emails_with_attachments_24h, emails_with_links_24h,
send_spike_ratio, inbox_rules_changed_7d
Eliminated: legacy_protocol (rank 2), login_countries (rank 3),
... mailbox_size_mb (rank 12), account_age_days (rank 13)What it measures: RFE starts with all features, trains a model, and removes the least important feature. It repeats this process until only the desired number of features remain. The rank number tells you the order of elimination — rank 1 = kept, rank 13 = eliminated first.
How to interpret:
Rank 1 features — These are the features that survived all rounds of elimination. The model needs them.
Rank 13 (
account_age_days) — This was the first feature eliminated, meaning it contributed the least to the Logistic Regression model's performance.Why RFE picked
pct_external_recipientsbut RF importance ranked it lower: RFE uses Logistic Regression, which weights features differently than Random Forest. LR benefits frompct_external_recipientsbecause it captures a normalized ratio that helps the linear model.
Decision: RFE gives you a production-ready feature set. If you need exactly N features for a deployment, use RFE's selection. It accounts for feature interactions that univariate tests (ANOVA, MI) miss.
Step 10 — Select K Best
select_k_best("mailbox", k=10, score_func="mutual_info")What it measures: Ranks all features by their individual MI score and picks the top K. Unlike RFE, this is univariate — it evaluates each feature independently, ignoring interactions.
When to use SelectKBest vs RFE:
Method | Pros | Cons |
SelectKBest | Fast, no model training needed | Ignores feature interactions |
RFE | Considers interactions, model-specific | Slower, sensitive to model choice |
Decision: Use SelectKBest for quick screening. Use RFE for final feature selection before deployment.
Step 11 — Model Evaluation (All Features vs Selected Features)
evaluate_model("mailbox", model="random_forest") # all 20 features
evaluate_model("mailbox", features=[top 8], model="random_forest") # 8 featuresAll 20 features:
{
"test_accuracy": 1.0000,
"cv_mean_accuracy": 1.0000,
"cv_std": 0.0000,
"Legit": {"precision": 1.0, "recall": 1.0, "f1": 1.0},
"Compromised": {"precision": 1.0, "recall": 1.0, "f1": 1.0}
}Top 8 features only:
{
"test_accuracy": 1.0000,
"cv_mean_accuracy": 1.0000
}How to interpret each metric:
test_accuracy— Percentage of correct predictions on the held-out 30% test set. 1.0 = 100% correct. This tells you how well the model performs on data it hasn't seen.cv_mean_accuracy— Average accuracy across 5-fold cross-validation. The dataset is split into 5 parts; the model trains on 4 and tests on 1, rotating 5 times. This is more reliable than a single train/test split because it tests on every data point.cv_std— Standard deviation across the 5 folds. Low std (e.g., 0.000) means performance is consistent regardless of which data is in the test set. High std (e.g., 0.05+) means the model is sensitive to which examples it sees.precision— Of all accounts the model flagged as compromised, what percentage actually were? Precision = 1.0 means zero false positives — no legitimate accounts were wrongly flagged.recall— Of all actually compromised accounts, what percentage did the model catch? Recall = 1.0 means zero false negatives — no compromised accounts slipped through.f1-score— The harmonic mean of precision and recall. F1 = 1.0 means both precision and recall are perfect. F1 is the single best metric for imbalanced datasets.
The critical finding: Both the 20-feature model and the 8-feature model achieve identical performance. This proves that 12 features are completely redundant. Fewer features = faster predictions, simpler model, easier to explain to stakeholders, less data to collect.
Step 12 — Classifier Comparison
random_forest 100.0%
logistic_regression 100.0%
gradient_boosting 99.9%
decision_tree 99.8%How to interpret:
All classifiers score 99.8%+ — This means the signal in the data is so strong that even simple models can find it. When all models agree, the pattern is robust and not an artifact of any single algorithm.
Decision Tree at 99.8% — A single decision tree (no ensemble) nearly matches Random Forest. This means the decision boundary is simple enough to express as a few if/else rules. This is great for interpretability — you could explain the model to a non-technical auditor.
Logistic Regression at 100% — A linear model achieves perfect accuracy. This means the classes are linearly separable in the top-8 feature space. No complex non-linear model is needed.
Decision: For production, choose based on your priority:
Interpretability → Logistic Regression or Decision Tree
Robustness → Random Forest or Gradient Boosting
Speed → Decision Tree (single tree, fastest inference)
Step 13 — Feature Subset Comparison
Subset Features CV Accuracy
top_1 1 99.42%
top_2 2 99.84%
top_3 3 99.98%
top_5 5 100.0%
top_8 8 100.0%
top_20 20 100.0%What it measures: This incrementally adds features (ranked by RF importance) and measures how accuracy changes. It answers: "How many features do I actually need?"
How to interpret:
top_1 = 99.42% —
emails_sent_24halone correctly classifies 99.42% of accounts. Only ~29 out of 5,000 accounts are misclassified. This single feature is extraordinarily powerful.top_1 → top_2 = +0.42% — Adding
external_recipients_24hfixes about half of the remaining errors. Meaningful improvement.top_2 → top_3 = +0.14% — Adding
emails_with_links_24hfixes most of the remaining errors. Smaller but still valuable.top_3 → top_5 = +0.02% — Two more features push accuracy to 100%. Marginal but achieves perfection.
top_5 → top_20 = +0.00% — Adding 15 more features changes nothing. They are completely redundant.
How to find the "elbow" (optimal feature count):
Look for where accuracy gains become negligible. Here:
top_1 to top_3: each feature adds meaningful accuracy (+0.42%, +0.14%)
top_3 to top_5: small but reaches 100% (+0.02%)
top_5 to top_20: zero gain
The elbow is at 3-5 features. Beyond 5, you're adding complexity with no accuracy benefit.
Decision: In practice, choose the smallest subset that meets your accuracy requirement:
Need 99.4%+ accuracy? Use 1 feature.
Need 99.8%+ accuracy? Use 2 features.
Need 100% accuracy? Use 5 features.
Never need 20 features for this problem.
Putting It All Together — Cross-Method Consensus
The most reliable findings are those that multiple methods agree on. Here's the consensus view:
Feature | RF Imp. | GB Imp. | Permutation | ANOVA F | MI | Target Corr | RFE | Verdict |
emails_sent_24h | #1 | #1 | #1 | #1 | #1 | #1 | kept | Must-have |
external_recipients_24h | #2 | #2 | #2 | #2 | #2 | #2 | kept | Must-have |
emails_with_links_24h | #3 | #3 | #3 | #3 | #3 | #3 | kept | Must-have |
send_spike_ratio | #4 | — | — | #5 | #4 | #5 | kept | Valuable |
emails_with_attachments_24h | #5 | — | — | #4 | #5 | #4 | kept | Valuable |
inbox_rules_changed_7d | #6 | — | — | #10 | #7 | #10 | kept | Moderate |
mailbox_size_mb | #18 | — | — | #20 | #20 | #20 | eliminated | Drop |
account_age_days | — | — | — | #19 | #19 | #19 | eliminated | Drop |
department | #20 | — | — | #18 | #18 | #18 | eliminated | Drop |
Reading this table:
A feature ranked #1-5 across all methods is a reliable predictor.
A feature ranked #15-20 across all methods is confirmed noise.
A feature ranked high by some methods and low by others deserves investigation — it may be non-linearly useful or redundant with a stronger feature.
Key Takeaways for Mailbox Compromise Detection
Email sending patterns are the strongest signal — Volume, recipients, and links dominate all importance rankings. When an attacker takes over a mailbox, the first thing they do is send emails (phishing, spam, BEC scams), creating an unmistakable spike.
Login anomalies are secondary — New IPs, odd hours, and multiple countries are useful individually but redundant when email patterns are present. They help catch compromised accounts that haven't started sending yet.
Account metadata is noise — Mailbox size, account age, and department have zero predictive power. Attackers don't target based on these attributes.
Feature reduction works — Dropping 75% of features (20 → 5) loses zero accuracy while making the model faster, simpler, and easier to explain.
Simple models suffice — Even Logistic Regression achieves 100% with the right features. Complex deep learning models are unnecessary for this problem.
Correlated features tell a story — The 6 correlated email features all spike together during an attack, representing a single underlying event (mass mailing burst). Understanding this clustering helps build intuition about the threat.
Run the Case Study
source .venv/bin/activate
python generate_mailbox_dataset.py # creates data/mailbox_compromise.csv
python demo_mailbox_compromise.py # runs the full 15-step pipelineCharts are saved to demo_charts/mailbox/ (10 PNG files including importance plots, correlation heatmap, and confusion matrices).
Project Structure
FeatureEngineering/
feature_eval_server.py # MCP server (13 tools)
generate_mailbox_dataset.py # Mailbox compromise dataset generator
demo_mailbox_compromise.py # Mailbox compromise case study demo
demo.py # Generic demo (Iris)
data/ # Generated datasets
mailbox_compromise.csv
demo_charts/ # PNG charts generated by demos
mailbox/ # Mailbox case study charts (10 PNGs)
requirements.txt # Python dependencies
.venv/ # Virtual environment
README.mdRequirements
Python 3.12+
Dependencies:
mcp,scikit-learn,pandas,numpy,matplotlib,seaborn
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/jaivardhan1209/FeatureEngineering'
If you have feedback or need assistance with the MCP directory API, please join our Discord server