loading…
Search for a command to run...
loading…
Enables comprehensive feature engineering for classification datasets with automated preprocessing and 13 specialized analysis tools. Supports feature importanc
Enables comprehensive feature engineering for classification datasets with automated preprocessing and 13 specialized analysis tools. Supports feature importance calculation, correlation analysis, recursive feature elimination, and model evaluation with integrated visualization capabilities.
An MCP (Model Context Protocol) server that provides a complete toolkit for evaluating, selecting, and comparing features in classification datasets. Built with FastMCP, scikit-learn, pandas, and matplotlib.
cd FeatureEngineering
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python feature_eval_server.py
The server starts in stdio transport mode, ready for any MCP client.
Add this to your MCP config (~/.claude/claude_desktop_config.json or project .mcp.json):
{
"mcpServers": {
"feature-eval": {
"command": "/home/jai/MachineLearnin-Repo/FeatureEngineering/.venv/bin/python",
"args": ["/home/jai/MachineLearnin-Repo/FeatureEngineering/feature_eval_server.py"]
}
}
}
| Tool | Description |
|---|---|
load_csv |
Load any CSV file with a specified target column |
load_sample_dataset |
Load a built-in dataset (iris, wine, breast_cancer) |
dataset_summary |
Descriptive statistics, missing values, and dtypes |
list_datasets |
Show all currently loaded datasets |
| Tool | Description |
|---|---|
feature_importance_tree |
Importance via Random Forest, Gradient Boosting, or Decision Tree |
permutation_importance_analysis |
Permutation-based importance on a held-out test set |
statistical_feature_scores |
Univariate scores: ANOVA F-test, Chi-squared, or Mutual Information |
| Tool | Description |
|---|---|
correlation_matrix |
Pairwise correlation heatmap with high-correlation pair detection |
target_correlation |
Each feature's correlation with the target variable |
| Tool | Description |
|---|---|
recursive_feature_elimination |
RFE using Logistic Regression or Decision Tree |
select_k_best |
Top-K univariate feature selection |
| Tool | Description |
|---|---|
evaluate_model |
Train/test split + cross-validation with full classification report |
compare_feature_subsets |
Compare CV accuracy across different feature subsets |
Every tool that produces a visualization returns a
chart_base64_pngfield containing a PNG image encoded in base64.
The server works with any tabular classification dataset, not just the built-in samples.
| Data issue | How it's handled |
|---|---|
| Categorical features (text columns) | Ordinal-encoded automatically — all tools see numeric data |
| Missing values | Imputed using median (numeric) at load time |
Non-numeric target (e.g. "spam" / "ham") |
Label-encoded to integers automatically |
This section explains what each analysis technique measures, how to read the numbers, and what decisions to make based on the results. We use the Mailbox Compromise Detection case study as a running example.
We have 5,000 email accounts with 20 features each. The goal is to classify whether an account is compromised (hacked) or legitimate. The dataset is imbalanced: 4,000 legit vs 1,000 compromised (4:1 ratio).
The 20 features fall into categories:
The key question: Which of these 20 features actually matter for detecting compromise?
load_csv("data/mailbox_compromise.csv", target_column="compromised", dataset_name="mailbox")
{
"shape": [5000, 21],
"categorical_features_encoded": ["department"],
"missing_values_imputed": 450,
"class_distribution": {"0": 4000, "1": 1000}
}
How to interpret:
shape: [5000, 21] — 5,000 rows (accounts) and 21 columns (20 features + 1 target). This is a reasonably sized dataset for ML.categorical_features_encoded: ["department"] — The department column contained text values like "Engineering", "Sales", etc. The server automatically converted these to numbers (0, 1, 2...) so ML algorithms can process them.missing_values_imputed: 450 — 450 cells had missing data (NaN). These were filled with the median value of their column. This prevents models from crashing on NaN but you should be aware that imputation adds artificial values.class_distribution: {"0": 4000, "1": 1000} — The classes are imbalanced (80% legit, 20% compromised). This means a model that always predicts "legit" would score 80% accuracy. Any useful model must beat this baseline significantly.Decision: The 80/20 split means you should pay attention to recall for class 1 (compromised) — a model with 95% accuracy could still be missing half of all compromised accounts.
feature_importance_tree("mailbox", method="random_forest")
Rank Feature Importance
1. emails_sent_24h 0.3162
2. external_recipients_24h 0.2096
3. emails_with_links_24h 0.1564
4. send_spike_ratio 0.1278
5. emails_with_attachments_24h 0.0726
6. inbox_rules_changed_7d 0.0348
...
18. mailbox_size_mb 0.0002
19. password_changed_24h 0.0002
20. department 0.0001
What it measures: Random Forest builds many decision trees and tracks how much each feature reduces classification error when used as a split point. Features that create the cleanest separations between classes get higher importance scores. All scores sum to 1.0.
How to interpret the numbers:
emails_sent_24h means this single feature is responsible for ~32% of the model's decision-making power. It is the strongest individual predictor.department at 0.0001 — The department an employee works in has virtually no bearing on whether their account is compromised. Attackers don't care if you're in Sales or Engineering.What the gap tells you: There's a clear "elbow" between feature #5 (0.0726) and feature #6 (0.0348) — importance drops by half. This suggests the natural boundary is around 5 features.
Caveat: RF importance is biased toward high-cardinality features (features with many unique values). A continuous feature like emails_sent_24h (many values) may score higher than a binary feature like forwarding_rule_added (only 0/1) even if both are equally useful. That's why we use multiple methods.
feature_importance_tree("mailbox", method="gradient_boosting")
1. emails_sent_24h 0.9687
2. external_recipients_24h 0.0150
3. emails_with_links_24h 0.0111
What it measures: Gradient Boosting builds trees sequentially — each tree corrects the mistakes of the previous one. Importance measures how much each feature contributed across all correction steps.
How to interpret:
emails_sent_24h — GB found that nearly all classification errors can be fixed by this single feature. It alone solves 97% of the problem.Decision: When one method says "this feature is dominant" and another says "these 5 features are all important," it usually means the top feature is genuinely powerful, and the others provide overlapping information. Don't dismiss the others — they act as backup signals.
permutation_importance_analysis("mailbox", n_repeats=10)
1. emails_sent_24h mean_decrease=0.0441 std=0.0029
2. external_recipients_24h mean_decrease=0.0009
3. emails_with_links_24h mean_decrease=0.0006
...
4-20. (all remaining features) mean_decrease=0.0000
What it measures: After training a model, we randomly shuffle one feature's values and check how much accuracy drops. If accuracy drops a lot, the feature was important. If it doesn't change, the feature was irrelevant (or redundant with others).
How to interpret the numbers:
mean_decrease=0.0441 means that shuffling emails_sent_24h caused accuracy to drop by 4.41 percentage points (e.g., from 100% to 95.6%). This is a significant real-world impact.std=0.0029 means across 10 shuffles, the drop was consistent (low variance). This result is reliable.mean_decrease=0.0000 for features 4-20 does NOT necessarily mean these features are useless in isolation. It means that given the other features already in the model, removing them makes no difference. They are redundant — their information is already captured by the top features.Key insight — Redundancy vs Uselessness:
login_time_deviation_hrs shows 0.0 permutation importance here but scored well in ANOVA (F=1,965). It IS predictive on its own, but adds nothing when emails_sent_24h is already present.mailbox_size_mb shows 0.0 here AND near-zero everywhere else. It is genuinely useless.Decision: Permutation importance on a test set is the most honest measure of real-world impact. Use it as the final arbiter when tree-importance and statistical tests disagree.
statistical_feature_scores("mailbox", method="f_classif")
1. emails_sent_24h F=25,747 p≈0
2. external_recipients_24h F=16,418 p≈0
...
18. department F=1.19 p=0.275
19. account_age_days F=1.07 p=0.301
20. mailbox_size_mb F=0.65 p=0.420
What it measures: ANOVA F-test asks: "Is the mean value of this feature significantly different between the classes?" For each feature, it compares the variance between classes to the variance within classes. A high F-score means the classes have very different distributions for that feature.
How to interpret the numbers:
How to read the tiers:
| F-Score Range | Meaning in this dataset |
|---|---|
| > 10,000 | Strong predictor — clear class separation |
| 1,000 - 10,000 | Moderate predictor — useful but overlapping distributions |
| 100 - 1,000 | Weak predictor — some signal but lots of overlap |
| < 10 (p > 0.05) | Not significant — this feature is noise |
Caveat: ANOVA only measures linear separation. A feature that perfectly separates classes in a non-linear way (e.g., compromised accounts have EITHER very high OR very low values) might score poorly on F-test but well on Mutual Information.
statistical_feature_scores("mailbox", method="mutual_info")
1. emails_sent_24h MI=0.487
2. external_recipients_24h MI=0.445
3. emails_with_links_24h MI=0.421
4. send_spike_ratio MI=0.417
5. emails_with_attachments_24h MI=0.371
--- gap ---
6. login_countries_7d MI=0.146
...
18. department MI=0.002
19. account_age_days MI=0.000
20. mailbox_size_mb MI=0.000
What it measures: Mutual Information (MI) measures how much knowing a feature's value reduces uncertainty about the class. Unlike ANOVA, MI captures any kind of relationship — linear, non-linear, categorical, or complex. MI = 0 means the feature is completely independent of the target. Higher = more informative.
How to interpret the numbers:
emails_sent_24h eliminates about 49% of the uncertainty about whether an account is compromised. This is very high.mailbox_size_mb — Knowing the mailbox size tells you absolutely nothing about whether the account is compromised. Zero information gained.Why MI and ANOVA agree here: When relationships are mostly linear (which they are in this dataset), ANOVA and MI tend to agree. When they disagree, trust MI for the more complete picture.
correlation_matrix("mailbox", threshold=0.7)
emails_sent_24h <-> external_recipients_24h r=0.80
emails_sent_24h <-> send_spike_ratio r=0.79
emails_sent_24h <-> emails_with_links_24h r=0.78
external_recipients_24h <-> emails_with_links_24h r=0.75
emails_sent_24h <-> emails_with_attachments_24h r=0.74
external_recipients_24h <-> emails_with_attachments_24h r=0.71
What it measures: Pearson correlation (r) measures the linear relationship between two features. r = +1 means they move together perfectly, r = -1 means they move in opposite directions, r = 0 means no linear relationship.
How to interpret the numbers:
| r value | Meaning |
|---|---|
| 0.9 - 1.0 | Near-duplicate features — one can replace the other |
| 0.7 - 0.9 | Strongly correlated — significant redundancy |
| 0.4 - 0.7 | Moderately correlated — some shared information |
| 0.0 - 0.4 | Weakly or not correlated — independent information |
What this tells us about the mailbox data:
emails_sent_24h and external_recipients_24h means they carry 80% overlapping information. Including both in a model adds only ~20% new information over using one alone.Decision — What to do with correlated features:
emails_sent_24h alone captures most of the signal.target_correlation("mailbox")
1. emails_sent_24h +0.9151
2. external_recipients_24h +0.8756
3. emails_with_links_24h +0.8533
4. emails_with_attachments_24h +0.8058
5. send_spike_ratio +0.7397
...
14. device_diversity_7d +0.4227
15. oauth_consent_granted_7d +0.3502
...
18. department -0.0155
19. account_age_days +0.0146
20. mailbox_size_mb -0.0114
What it measures: How strongly each individual feature correlates with the target variable (compromised = 0 or 1). This is the most direct measure of "does this feature move with the outcome?"
How to interpret the numbers:
emails_sent_24h — A near-perfect positive correlation. As emails sent increases, the probability of being compromised increases almost linearly. This is the single most direct indicator.device_diversity_7d — A moderate positive correlation. Compromised accounts tend to show more device diversity, but there's substantial overlap with legitimate accounts (travelers, people with multiple devices).department — Essentially zero. The tiny negative sign is meaningless at this magnitude — it's just random noise.How to read the tiers in this dataset:
| Correlation | Features | Interpretation |
|---|---|---|
| > 0.7 | emails_sent, external_recipients, links, attachments, spike_ratio | Primary indicators — individually sufficient for detection |
| 0.4 - 0.7 | login_time, failed_logins, countries, legacy_protocol, inbox_rules, new_IPs, forwarding, pct_external, device_diversity | Secondary indicators — useful for edge cases and model robustness |
| 0.3 - 0.4 | oauth_consent, mfa_disabled, password_changed | Weak indicators — some signal but too noisy to rely on alone |
| < 0.05 | department, account_age, mailbox_size | No signal — safe to drop |
recursive_feature_elimination("mailbox", n_features_to_select=8, estimator="logistic_regression")
Selected (rank 1): login_time_deviation_hrs, emails_sent_24h,
external_recipients_24h, pct_external_recipients,
emails_with_attachments_24h, emails_with_links_24h,
send_spike_ratio, inbox_rules_changed_7d
Eliminated: legacy_protocol (rank 2), login_countries (rank 3),
... mailbox_size_mb (rank 12), account_age_days (rank 13)
What it measures: RFE starts with all features, trains a model, and removes the least important feature. It repeats this process until only the desired number of features remain. The rank number tells you the order of elimination — rank 1 = kept, rank 13 = eliminated first.
How to interpret:
account_age_days) — This was the first feature eliminated, meaning it contributed the least to the Logistic Regression model's performance.pct_external_recipients but RF importance ranked it lower: RFE uses Logistic Regression, which weights features differently than Random Forest. LR benefits from pct_external_recipients because it captures a normalized ratio that helps the linear model.Decision: RFE gives you a production-ready feature set. If you need exactly N features for a deployment, use RFE's selection. It accounts for feature interactions that univariate tests (ANOVA, MI) miss.
select_k_best("mailbox", k=10, score_func="mutual_info")
What it measures: Ranks all features by their individual MI score and picks the top K. Unlike RFE, this is univariate — it evaluates each feature independently, ignoring interactions.
When to use SelectKBest vs RFE:
| Method | Pros | Cons |
|---|---|---|
| SelectKBest | Fast, no model training needed | Ignores feature interactions |
| RFE | Considers interactions, model-specific | Slower, sensitive to model choice |
Decision: Use SelectKBest for quick screening. Use RFE for final feature selection before deployment.
evaluate_model("mailbox", model="random_forest") # all 20 features
evaluate_model("mailbox", features=[top 8], model="random_forest") # 8 features
All 20 features:
{
"test_accuracy": 1.0000,
"cv_mean_accuracy": 1.0000,
"cv_std": 0.0000,
"Legit": {"precision": 1.0, "recall": 1.0, "f1": 1.0},
"Compromised": {"precision": 1.0, "recall": 1.0, "f1": 1.0}
}
Top 8 features only:
{
"test_accuracy": 1.0000,
"cv_mean_accuracy": 1.0000
}
How to interpret each metric:
test_accuracy — Percentage of correct predictions on the held-out 30% test set. 1.0 = 100% correct. This tells you how well the model performs on data it hasn't seen.
cv_mean_accuracy — Average accuracy across 5-fold cross-validation. The dataset is split into 5 parts; the model trains on 4 and tests on 1, rotating 5 times. This is more reliable than a single train/test split because it tests on every data point.
cv_std — Standard deviation across the 5 folds. Low std (e.g., 0.000) means performance is consistent regardless of which data is in the test set. High std (e.g., 0.05+) means the model is sensitive to which examples it sees.
precision — Of all accounts the model flagged as compromised, what percentage actually were? Precision = 1.0 means zero false positives — no legitimate accounts were wrongly flagged.
recall — Of all actually compromised accounts, what percentage did the model catch? Recall = 1.0 means zero false negatives — no compromised accounts slipped through.
f1-score — The harmonic mean of precision and recall. F1 = 1.0 means both precision and recall are perfect. F1 is the single best metric for imbalanced datasets.
The critical finding: Both the 20-feature model and the 8-feature model achieve identical performance. This proves that 12 features are completely redundant. Fewer features = faster predictions, simpler model, easier to explain to stakeholders, less data to collect.
random_forest 100.0%
logistic_regression 100.0%
gradient_boosting 99.9%
decision_tree 99.8%
How to interpret:
Decision: For production, choose based on your priority:
Subset Features CV Accuracy
top_1 1 99.42%
top_2 2 99.84%
top_3 3 99.98%
top_5 5 100.0%
top_8 8 100.0%
top_20 20 100.0%
What it measures: This incrementally adds features (ranked by RF importance) and measures how accuracy changes. It answers: "How many features do I actually need?"
How to interpret:
emails_sent_24h alone correctly classifies 99.42% of accounts. Only ~29 out of 5,000 accounts are misclassified. This single feature is extraordinarily powerful.external_recipients_24h fixes about half of the remaining errors. Meaningful improvement.emails_with_links_24h fixes most of the remaining errors. Smaller but still valuable.How to find the "elbow" (optimal feature count):
Look for where accuracy gains become negligible. Here:
The elbow is at 3-5 features. Beyond 5, you're adding complexity with no accuracy benefit.
Decision: In practice, choose the smallest subset that meets your accuracy requirement:
The most reliable findings are those that multiple methods agree on. Here's the consensus view:
| Feature | RF Imp. | GB Imp. | Permutation | ANOVA F | MI | Target Corr | RFE | Verdict |
|---|---|---|---|---|---|---|---|---|
| emails_sent_24h | #1 | #1 | #1 | #1 | #1 | #1 | kept | Must-have |
| external_recipients_24h | #2 | #2 | #2 | #2 | #2 | #2 | kept | Must-have |
| emails_with_links_24h | #3 | #3 | #3 | #3 | #3 | #3 | kept | Must-have |
| send_spike_ratio | #4 | — | — | #5 | #4 | #5 | kept | Valuable |
| emails_with_attachments_24h | #5 | — | — | #4 | #5 | #4 | kept | Valuable |
| inbox_rules_changed_7d | #6 | — | — | #10 | #7 | #10 | kept | Moderate |
| mailbox_size_mb | #18 | — | — | #20 | #20 | #20 | eliminated | Drop |
| account_age_days | — | — | — | #19 | #19 | #19 | eliminated | Drop |
| department | #20 | — | — | #18 | #18 | #18 | eliminated | Drop |
Reading this table:
Email sending patterns are the strongest signal — Volume, recipients, and links dominate all importance rankings. When an attacker takes over a mailbox, the first thing they do is send emails (phishing, spam, BEC scams), creating an unmistakable spike.
Login anomalies are secondary — New IPs, odd hours, and multiple countries are useful individually but redundant when email patterns are present. They help catch compromised accounts that haven't started sending yet.
Account metadata is noise — Mailbox size, account age, and department have zero predictive power. Attackers don't target based on these attributes.
Feature reduction works — Dropping 75% of features (20 → 5) loses zero accuracy while making the model faster, simpler, and easier to explain.
Simple models suffice — Even Logistic Regression achieves 100% with the right features. Complex deep learning models are unnecessary for this problem.
Correlated features tell a story — The 6 correlated email features all spike together during an attack, representing a single underlying event (mass mailing burst). Understanding this clustering helps build intuition about the threat.
source .venv/bin/activate
python generate_mailbox_dataset.py # creates data/mailbox_compromise.csv
python demo_mailbox_compromise.py # runs the full 15-step pipeline
Charts are saved to demo_charts/mailbox/ (10 PNG files including importance plots, correlation heatmap, and confusion matrices).
FeatureEngineering/
feature_eval_server.py # MCP server (13 tools)
generate_mailbox_dataset.py # Mailbox compromise dataset generator
demo_mailbox_compromise.py # Mailbox compromise case study demo
demo.py # Generic demo (Iris)
data/ # Generated datasets
mailbox_compromise.csv
demo_charts/ # PNG charts generated by demos
mailbox/ # Mailbox case study charts (10 PNGs)
requirements.txt # Python dependencies
.venv/ # Virtual environment
README.md
mcp, scikit-learn, pandas, numpy, matplotlib, seabornДобавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"feature-evaluation-mcp-server": {
"command": "npx",
"args": []
}
}
}