Feature Evaluation Server

Бесплатно

Enables comprehensive feature engineering for classification datasets with automated preprocessing and 13 specialized analysis tools. Supports feature importanc

автор: jaivardhan1209

GitHub

Описание

Enables comprehensive feature engineering for classification datasets with automated preprocessing and 13 specialized analysis tools. Supports feature importance calculation, correlation analysis, recursive feature elimination, and model evaluation with integrated visualization capabilities.

README

An MCP (Model Context Protocol) server that provides a complete toolkit for evaluating, selecting, and comparing features in classification datasets. Built with FastMCP, scikit-learn, pandas, and matplotlib.

Quick Start

1. Install

cd FeatureEngineering
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Run the server

python feature_eval_server.py

The server starts in stdio transport mode, ready for any MCP client.

3. Connect from Claude Code

Add this to your MCP config (~/.claude/claude_desktop_config.json or project .mcp.json):

{
  "mcpServers": {
    "feature-eval": {
      "command": "/home/jai/MachineLearnin-Repo/FeatureEngineering/.venv/bin/python",
      "args": ["/home/jai/MachineLearnin-Repo/FeatureEngineering/feature_eval_server.py"]
    }
  }
}

Tools Reference (13 tools)

Data Loading

Tool	Description
`load_csv`	Load any CSV file with a specified target column
`load_sample_dataset`	Load a built-in dataset (`iris`, `wine`, `breast_cancer`)
`dataset_summary`	Descriptive statistics, missing values, and dtypes
`list_datasets`	Show all currently loaded datasets

Feature Importance

Tool	Description
`feature_importance_tree`	Importance via Random Forest, Gradient Boosting, or Decision Tree
`permutation_importance_analysis`	Permutation-based importance on a held-out test set
`statistical_feature_scores`	Univariate scores: ANOVA F-test, Chi-squared, or Mutual Information

Correlation Analysis

Tool	Description
`correlation_matrix`	Pairwise correlation heatmap with high-correlation pair detection
`target_correlation`	Each feature's correlation with the target variable

Feature Selection

Tool	Description
`recursive_feature_elimination`	RFE using Logistic Regression or Decision Tree
`select_k_best`	Top-K univariate feature selection

Model Evaluation

Tool	Description
`evaluate_model`	Train/test split + cross-validation with full classification report
`compare_feature_subsets`	Compare CV accuracy across different feature subsets

Every tool that produces a visualization returns a chart_base64_png field containing a PNG image encoded in base64.

Dataset Compatibility

The server works with any tabular classification dataset, not just the built-in samples.

Automatic preprocessing at load time

Data issue	How it's handled
Categorical features (text columns)	Ordinal-encoded automatically — all tools see numeric data
Missing values	Imputed using median (numeric) at load time
Non-numeric target (e.g. `"spam"` / `"ham"`)	Label-encoded to integers automatically

What the server does NOT support

Regression tasks — classification only (target must be discrete classes)
Multi-label targets — single target column only
Image/text/time-series features — tabular data only

How to Interpret Results — A Complete Guide

This section explains what each analysis technique measures, how to read the numbers, and what decisions to make based on the results. We use the Mailbox Compromise Detection case study as a running example.

Understanding the Problem

We have 5,000 email accounts with 20 features each. The goal is to classify whether an account is compromised (hacked) or legitimate. The dataset is imbalanced: 4,000 legit vs 1,000 compromised (4:1 ratio).

The 20 features fall into categories:

Login behavior (6 features): how the user logs in (countries, IPs, times, devices)
Email sending patterns (6 features): what emails look like (volume, recipients, links)
Account anomalies (5 features): suspicious changes (rules, forwarding, MFA)
Noise features (3 features): metadata that shouldn't predict compromise (mailbox size, account age, department)

The key question: Which of these 20 features actually matter for detecting compromise?

Step 1 — Loading and Preprocessing

load_csv("data/mailbox_compromise.csv", target_column="compromised", dataset_name="mailbox")

{
  "shape": [5000, 21],
  "categorical_features_encoded": ["department"],
  "missing_values_imputed": 450,
  "class_distribution": {"0": 4000, "1": 1000}
}

How to interpret:

shape: [5000, 21] — 5,000 rows (accounts) and 21 columns (20 features + 1 target). This is a reasonably sized dataset for ML.
categorical_features_encoded: ["department"] — The department column contained text values like "Engineering", "Sales", etc. The server automatically converted these to numbers (0, 1, 2...) so ML algorithms can process them.
missing_values_imputed: 450 — 450 cells had missing data (NaN). These were filled with the median value of their column. This prevents models from crashing on NaN but you should be aware that imputation adds artificial values.
class_distribution: {"0": 4000, "1": 1000} — The classes are imbalanced (80% legit, 20% compromised). This means a model that always predicts "legit" would score 80% accuracy. Any useful model must beat this baseline significantly.

Decision: The 80/20 split means you should pay attention to recall for class 1 (compromised) — a model with 95% accuracy could still be missing half of all compromised accounts.

Step 2 — Feature Importance (Random Forest)

feature_importance_tree("mailbox", method="random_forest")

Rank  Feature                              Importance
  1.  emails_sent_24h                      0.3162
  2.  external_recipients_24h              0.2096
  3.  emails_with_links_24h                0.1564
  4.  send_spike_ratio                     0.1278
  5.  emails_with_attachments_24h          0.0726
  6.  inbox_rules_changed_7d               0.0348
  ...
 18.  mailbox_size_mb                      0.0002
 19.  password_changed_24h                 0.0002
 20.  department                           0.0001

What it measures: Random Forest builds many decision trees and tracks how much each feature reduces classification error when used as a split point. Features that create the cleanest separations between classes get higher importance scores. All scores sum to 1.0.

How to interpret the numbers:

Importance = 0.3162 for emails_sent_24h means this single feature is responsible for ~32% of the model's decision-making power. It is the strongest individual predictor.
Top 5 features sum to 0.88 — meaning 88% of the model's classification ability comes from just 5 of 20 features. This is a strong sign that most features can be dropped.
Importance < 0.001 (features 15-20) means these features almost never help distinguish compromised from legit accounts. The model rarely picks them as split points.
department at 0.0001 — The department an employee works in has virtually no bearing on whether their account is compromised. Attackers don't care if you're in Sales or Engineering.

What the gap tells you: There's a clear "elbow" between feature #5 (0.0726) and feature #6 (0.0348) — importance drops by half. This suggests the natural boundary is around 5 features.

Caveat: RF importance is biased toward high-cardinality features (features with many unique values). A continuous feature like emails_sent_24h (many values) may score higher than a binary feature like forwarding_rule_added (only 0/1) even if both are equally useful. That's why we use multiple methods.

Step 3 — Feature Importance (Gradient Boosting)

feature_importance_tree("mailbox", method="gradient_boosting")

  1.  emails_sent_24h                      0.9687
  2.  external_recipients_24h              0.0150
  3.  emails_with_links_24h                0.0111

What it measures: Gradient Boosting builds trees sequentially — each tree corrects the mistakes of the previous one. Importance measures how much each feature contributed across all correction steps.

How to interpret:

0.9687 for emails_sent_24h — GB found that nearly all classification errors can be fixed by this single feature. It alone solves 97% of the problem.
Why so concentrated vs RF? GB is greedier — it uses the single best feature first, then only looks at others for residual errors. RF spreads usage across correlated features more evenly. Both are valid views.

Decision: When one method says "this feature is dominant" and another says "these 5 features are all important," it usually means the top feature is genuinely powerful, and the others provide overlapping information. Don't dismiss the others — they act as backup signals.

Step 4 — Permutation Importance

permutation_importance_analysis("mailbox", n_repeats=10)

  1.  emails_sent_24h                      mean_decrease=0.0441  std=0.0029
  2.  external_recipients_24h              mean_decrease=0.0009
  3.  emails_with_links_24h                mean_decrease=0.0006
  ...
  4-20. (all remaining features)           mean_decrease=0.0000

What it measures: After training a model, we randomly shuffle one feature's values and check how much accuracy drops. If accuracy drops a lot, the feature was important. If it doesn't change, the feature was irrelevant (or redundant with others).

How to interpret the numbers:

mean_decrease=0.0441 means that shuffling emails_sent_24h caused accuracy to drop by 4.41 percentage points (e.g., from 100% to 95.6%). This is a significant real-world impact.
std=0.0029 means across 10 shuffles, the drop was consistent (low variance). This result is reliable.
mean_decrease=0.0000 for features 4-20 does NOT necessarily mean these features are useless in isolation. It means that given the other features already in the model, removing them makes no difference. They are redundant — their information is already captured by the top features.

Key insight — Redundancy vs Uselessness:

login_time_deviation_hrs shows 0.0 permutation importance here but scored well in ANOVA (F=1,965). It IS predictive on its own, but adds nothing when emails_sent_24h is already present.
mailbox_size_mb shows 0.0 here AND near-zero everywhere else. It is genuinely useless.

Decision: Permutation importance on a test set is the most honest measure of real-world impact. Use it as the final arbiter when tree-importance and statistical tests disagree.

Step 5 — ANOVA F-Test Scores

statistical_feature_scores("mailbox", method="f_classif")

  1.  emails_sent_24h                      F=25,747   p≈0
  2.  external_recipients_24h              F=16,418   p≈0
  ...
 18.  department                           F=1.19     p=0.275
 19.  account_age_days                     F=1.07     p=0.301
 20.  mailbox_size_mb                      F=0.65     p=0.420

What it measures: ANOVA F-test asks: "Is the mean value of this feature significantly different between the classes?" For each feature, it compares the variance between classes to the variance within classes. A high F-score means the classes have very different distributions for that feature.

How to interpret the numbers:

F-score: Higher = more separation between classes. There's no universal threshold — compare features against each other. F=25,747 vs F=1.19 is a 20,000x difference, which is massive.
p-value: The probability that this score happened by chance.
- p ≈ 0 (features 1-17): The difference is statistically significant. There is a real relationship between this feature and the target.
- p = 0.275 (department): There's a 27.5% chance this apparent pattern is just noise. By convention, p > 0.05 means not significant — we cannot confidently say this feature helps.
- p = 0.420 (mailbox_size_mb): 42% chance of being noise. Clearly not useful.

How to read the tiers:

F-Score Range	Meaning in this dataset
> 10,000	Strong predictor — clear class separation
1,000 - 10,000	Moderate predictor — useful but overlapping distributions
100 - 1,000	Weak predictor — some signal but lots of overlap
< 10 (p > 0.05)	Not significant — this feature is noise

Caveat: ANOVA only measures linear separation. A feature that perfectly separates classes in a non-linear way (e.g., compromised accounts have EITHER very high OR very low values) might score poorly on F-test but well on Mutual Information.

Step 6 — Mutual Information Scores

statistical_feature_scores("mailbox", method="mutual_info")

  1.  emails_sent_24h                      MI=0.487
  2.  external_recipients_24h              MI=0.445
  3.  emails_with_links_24h                MI=0.421
  4.  send_spike_ratio                     MI=0.417
  5.  emails_with_attachments_24h          MI=0.371
  --- gap ---
  6.  login_countries_7d                   MI=0.146
  ...
 18.  department                           MI=0.002
 19.  account_age_days                     MI=0.000
 20.  mailbox_size_mb                      MI=0.000

What it measures: Mutual Information (MI) measures how much knowing a feature's value reduces uncertainty about the class. Unlike ANOVA, MI captures any kind of relationship — linear, non-linear, categorical, or complex. MI = 0 means the feature is completely independent of the target. Higher = more informative.

How to interpret the numbers:

MI = 0.487 — Knowing emails_sent_24h eliminates about 49% of the uncertainty about whether an account is compromised. This is very high.
MI = 0.000 for mailbox_size_mb — Knowing the mailbox size tells you absolutely nothing about whether the account is compromised. Zero information gained.
The gap between 0.371 and 0.146 (feature 5 to 6) is the same "elbow" we saw in RF importance. The top 5 features are in a different league.

Why MI and ANOVA agree here: When relationships are mostly linear (which they are in this dataset), ANOVA and MI tend to agree. When they disagree, trust MI for the more complete picture.

Step 7 — Correlation Matrix

correlation_matrix("mailbox", threshold=0.7)

emails_sent_24h  <->  external_recipients_24h      r=0.80
emails_sent_24h  <->  send_spike_ratio             r=0.79
emails_sent_24h  <->  emails_with_links_24h        r=0.78
external_recipients_24h  <->  emails_with_links_24h  r=0.75
emails_sent_24h  <->  emails_with_attachments_24h   r=0.74
external_recipients_24h  <->  emails_with_attachments_24h  r=0.71

What it measures: Pearson correlation (r) measures the linear relationship between two features. r = +1 means they move together perfectly, r = -1 means they move in opposite directions, r = 0 means no linear relationship.

How to interpret the numbers:

r value	Meaning
0.9 - 1.0	Near-duplicate features — one can replace the other
0.7 - 0.9	Strongly correlated — significant redundancy
0.4 - 0.7	Moderately correlated — some shared information
0.0 - 0.4	Weakly or not correlated — independent information

What this tells us about the mailbox data:

All 6 correlated pairs are email-sending features. This makes logical sense: when an attacker sends a burst of emails, ALL sending metrics go up together — more emails means more recipients, more links, more attachments, higher spike ratio.
r=0.80 between emails_sent_24h and external_recipients_24h means they carry 80% overlapping information. Including both in a model adds only ~20% new information over using one alone.
No login features are correlated with email features (r < 0.4), meaning they provide independent signals. Even if login features are weaker individually, they add unique information.

Decision — What to do with correlated features:

For model accuracy: Correlated features usually don't hurt tree-based models, but they inflate importance scores and make the model harder to interpret.
For interpretability: Pick ONE representative from each correlated cluster. From the sending cluster, emails_sent_24h alone captures most of the signal.
For linear models (Logistic Regression): High correlation causes unstable coefficients. Dropping correlated features or using regularization helps.

Step 8 — Feature-Target Correlation

target_correlation("mailbox")

  1.  emails_sent_24h                      +0.9151
  2.  external_recipients_24h              +0.8756
  3.  emails_with_links_24h                +0.8533
  4.  emails_with_attachments_24h          +0.8058
  5.  send_spike_ratio                     +0.7397
  ...
 14.  device_diversity_7d                  +0.4227
 15.  oauth_consent_granted_7d             +0.3502
 ...
 18.  department                           -0.0155
 19.  account_age_days                     +0.0146
 20.  mailbox_size_mb                      -0.0114

What it measures: How strongly each individual feature correlates with the target variable (compromised = 0 or 1). This is the most direct measure of "does this feature move with the outcome?"

How to interpret the numbers:

+0.9151 for emails_sent_24h — A near-perfect positive correlation. As emails sent increases, the probability of being compromised increases almost linearly. This is the single most direct indicator.
+0.4227 for device_diversity_7d — A moderate positive correlation. Compromised accounts tend to show more device diversity, but there's substantial overlap with legitimate accounts (travelers, people with multiple devices).
-0.0155 for department — Essentially zero. The tiny negative sign is meaningless at this magnitude — it's just random noise.
All signs are positive (except noise features) — This makes sense because all suspicious behaviors (more emails, more logins, more rule changes) are higher for compromised accounts.

How to read the tiers in this dataset:

Correlation	Features	Interpretation
> 0.7	emails_sent, external_recipients, links, attachments, spike_ratio	Primary indicators — individually sufficient for detection
0.4 - 0.7	login_time, failed_logins, countries, legacy_protocol, inbox_rules, new_IPs, forwarding, pct_external, device_diversity	Secondary indicators — useful for edge cases and model robustness
0.3 - 0.4	oauth_consent, mfa_disabled, password_changed	Weak indicators — some signal but too noisy to rely on alone
< 0.05	department, account_age, mailbox_size	No signal — safe to drop

Step 9 — Recursive Feature Elimination (RFE)

recursive_feature_elimination("mailbox", n_features_to_select=8, estimator="logistic_regression")

Selected (rank 1):  login_time_deviation_hrs, emails_sent_24h,
                    external_recipients_24h, pct_external_recipients,
                    emails_with_attachments_24h, emails_with_links_24h,
                    send_spike_ratio, inbox_rules_changed_7d

Eliminated:         legacy_protocol (rank 2), login_countries (rank 3),
                    ... mailbox_size_mb (rank 12), account_age_days (rank 13)

What it measures: RFE starts with all features, trains a model, and removes the least important feature. It repeats this process until only the desired number of features remain. The rank number tells you the order of elimination — rank 1 = kept, rank 13 = eliminated first.

How to interpret:

Rank 1 features — These are the features that survived all rounds of elimination. The model needs them.
Rank 13 (account_age_days) — This was the first feature eliminated, meaning it contributed the least to the Logistic Regression model's performance.
Why RFE picked pct_external_recipients but RF importance ranked it lower: RFE uses Logistic Regression, which weights features differently than Random Forest. LR benefits from pct_external_recipients because it captures a normalized ratio that helps the linear model.

Decision: RFE gives you a production-ready feature set. If you need exactly N features for a deployment, use RFE's selection. It accounts for feature interactions that univariate tests (ANOVA, MI) miss.

Step 10 — Select K Best

select_k_best("mailbox", k=10, score_func="mutual_info")

What it measures: Ranks all features by their individual MI score and picks the top K. Unlike RFE, this is univariate — it evaluates each feature independently, ignoring interactions.

When to use SelectKBest vs RFE:

Method	Pros	Cons
SelectKBest	Fast, no model training needed	Ignores feature interactions
RFE	Considers interactions, model-specific	Slower, sensitive to model choice

Decision: Use SelectKBest for quick screening. Use RFE for final feature selection before deployment.

Step 11 — Model Evaluation (All Features vs Selected Features)

evaluate_model("mailbox", model="random_forest")            # all 20 features
evaluate_model("mailbox", features=[top 8], model="random_forest")  # 8 features

All 20 features:

{
  "test_accuracy": 1.0000,
  "cv_mean_accuracy": 1.0000,
  "cv_std": 0.0000,
  "Legit":       {"precision": 1.0, "recall": 1.0, "f1": 1.0},
  "Compromised": {"precision": 1.0, "recall": 1.0, "f1": 1.0}
}

Top 8 features only:

{
  "test_accuracy": 1.0000,
  "cv_mean_accuracy": 1.0000
}

How to interpret each metric:

test_accuracy — Percentage of correct predictions on the held-out 30% test set. 1.0 = 100% correct. This tells you how well the model performs on data it hasn't seen.
cv_mean_accuracy — Average accuracy across 5-fold cross-validation. The dataset is split into 5 parts; the model trains on 4 and tests on 1, rotating 5 times. This is more reliable than a single train/test split because it tests on every data point.
cv_std — Standard deviation across the 5 folds. Low std (e.g., 0.000) means performance is consistent regardless of which data is in the test set. High std (e.g., 0.05+) means the model is sensitive to which examples it sees.
precision — Of all accounts the model flagged as compromised, what percentage actually were? Precision = 1.0 means zero false positives — no legitimate accounts were wrongly flagged.
recall — Of all actually compromised accounts, what percentage did the model catch? Recall = 1.0 means zero false negatives — no compromised accounts slipped through.
f1-score — The harmonic mean of precision and recall. F1 = 1.0 means both precision and recall are perfect. F1 is the single best metric for imbalanced datasets.

The critical finding: Both the 20-feature model and the 8-feature model achieve identical performance. This proves that 12 features are completely redundant. Fewer features = faster predictions, simpler model, easier to explain to stakeholders, less data to collect.

Step 12 — Classifier Comparison

random_forest              100.0%
logistic_regression        100.0%
gradient_boosting           99.9%
decision_tree               99.8%

How to interpret:

All classifiers score 99.8%+ — This means the signal in the data is so strong that even simple models can find it. When all models agree, the pattern is robust and not an artifact of any single algorithm.
Decision Tree at 99.8% — A single decision tree (no ensemble) nearly matches Random Forest. This means the decision boundary is simple enough to express as a few if/else rules. This is great for interpretability — you could explain the model to a non-technical auditor.
Logistic Regression at 100% — A linear model achieves perfect accuracy. This means the classes are linearly separable in the top-8 feature space. No complex non-linear model is needed.

Decision: For production, choose based on your priority:

Interpretability → Logistic Regression or Decision Tree
Robustness → Random Forest or Gradient Boosting
Speed → Decision Tree (single tree, fastest inference)

Step 13 — Feature Subset Comparison

Subset   Features  CV Accuracy
top_1         1     99.42%
top_2         2     99.84%
top_3         3     99.98%
top_5         5     100.0%
top_8         8     100.0%
top_20       20     100.0%

What it measures: This incrementally adds features (ranked by RF importance) and measures how accuracy changes. It answers: "How many features do I actually need?"

How to interpret:

top_1 = 99.42% — emails_sent_24h alone correctly classifies 99.42% of accounts. Only ~29 out of 5,000 accounts are misclassified. This single feature is extraordinarily powerful.
top_1 → top_2 = +0.42% — Adding external_recipients_24h fixes about half of the remaining errors. Meaningful improvement.
top_2 → top_3 = +0.14% — Adding emails_with_links_24h fixes most of the remaining errors. Smaller but still valuable.
top_3 → top_5 = +0.02% — Two more features push accuracy to 100%. Marginal but achieves perfection.
top_5 → top_20 = +0.00% — Adding 15 more features changes nothing. They are completely redundant.

How to find the "elbow" (optimal feature count):

Look for where accuracy gains become negligible. Here:

top_1 to top_3: each feature adds meaningful accuracy (+0.42%, +0.14%)
top_3 to top_5: small but reaches 100% (+0.02%)
top_5 to top_20: zero gain

The elbow is at 3-5 features. Beyond 5, you're adding complexity with no accuracy benefit.

Decision: In practice, choose the smallest subset that meets your accuracy requirement:

Need 99.4%+ accuracy? Use 1 feature.
Need 99.8%+ accuracy? Use 2 features.
Need 100% accuracy? Use 5 features.
Never need 20 features for this problem.

Putting It All Together — Cross-Method Consensus

The most reliable findings are those that multiple methods agree on. Here's the consensus view:

Feature	RF Imp.	GB Imp.	Permutation	ANOVA F	MI	Target Corr	RFE	Verdict
emails_sent_24h	#1	#1	#1	#1	#1	#1	kept	Must-have
external_recipients_24h	#2	#2	#2	#2	#2	#2	kept	Must-have
emails_with_links_24h	#3	#3	#3	#3	#3	#3	kept	Must-have
send_spike_ratio	#4	—	—	#5	#4	#5	kept	Valuable
emails_with_attachments_24h	#5	—	—	#4	#5	#4	kept	Valuable
inbox_rules_changed_7d	#6	—	—	#10	#7	#10	kept	Moderate
mailbox_size_mb	#18	—	—	#20	#20	#20	eliminated	Drop
account_age_days	—	—	—	#19	#19	#19	eliminated	Drop
department	#20	—	—	#18	#18	#18	eliminated	Drop

Reading this table:

A feature ranked #1-5 across all methods is a reliable predictor.
A feature ranked #15-20 across all methods is confirmed noise.
A feature ranked high by some methods and low by others deserves investigation — it may be non-linearly useful or redundant with a stronger feature.

Key Takeaways for Mailbox Compromise Detection

Email sending patterns are the strongest signal — Volume, recipients, and links dominate all importance rankings. When an attacker takes over a mailbox, the first thing they do is send emails (phishing, spam, BEC scams), creating an unmistakable spike.
Login anomalies are secondary — New IPs, odd hours, and multiple countries are useful individually but redundant when email patterns are present. They help catch compromised accounts that haven't started sending yet.
Account metadata is noise — Mailbox size, account age, and department have zero predictive power. Attackers don't target based on these attributes.
Feature reduction works — Dropping 75% of features (20 → 5) loses zero accuracy while making the model faster, simpler, and easier to explain.
Simple models suffice — Even Logistic Regression achieves 100% with the right features. Complex deep learning models are unnecessary for this problem.
Correlated features tell a story — The 6 correlated email features all spike together during an attack, representing a single underlying event (mass mailing burst). Understanding this clustering helps build intuition about the threat.

Run the Case Study

source .venv/bin/activate
python generate_mailbox_dataset.py    # creates data/mailbox_compromise.csv
python demo_mailbox_compromise.py     # runs the full 15-step pipeline

Charts are saved to demo_charts/mailbox/ (10 PNG files including importance plots, correlation heatmap, and confusion matrices).

Project Structure

FeatureEngineering/
  feature_eval_server.py            # MCP server (13 tools)
  generate_mailbox_dataset.py       # Mailbox compromise dataset generator
  demo_mailbox_compromise.py        # Mailbox compromise case study demo
  demo.py                           # Generic demo (Iris)
  data/                             # Generated datasets
    mailbox_compromise.csv
  demo_charts/                      # PNG charts generated by demos
    mailbox/                        # Mailbox case study charts (10 PNGs)
  requirements.txt                  # Python dependencies
  .venv/                            # Virtual environment
  README.md

Requirements

Python 3.12+
Dependencies: mcp, scikit-learn, pandas, numpy, matplotlib, seaborn

Как установить

Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.

{
  "mcpServers": {
    "feature-evaluation-mcp-server": {
      "command": "npx",
      "args": []
    }
  }
}

Command Palette

Feature Evaluation Server

Описание

README

Quick Start

1. Install

2. Run the server

3. Connect from Claude Code

Tools Reference (13 tools)

Data Loading

Feature Importance

Correlation Analysis

Feature Selection

Model Evaluation

Dataset Compatibility

Automatic preprocessing at load time

What the server does NOT support

How to Interpret Results — A Complete Guide

Understanding the Problem

Step 1 — Loading and Preprocessing

Step 2 — Feature Importance (Random Forest)

Step 3 — Feature Importance (Gradient Boosting)

Step 4 — Permutation Importance

Step 5 — ANOVA F-Test Scores

Step 6 — Mutual Information Scores

Step 7 — Correlation Matrix

Step 8 — Feature-Target Correlation

Step 9 — Recursive Feature Elimination (RFE)

Step 10 — Select K Best

Step 11 — Model Evaluation (All Features vs Selected Features)

Step 12 — Classifier Comparison

Step 13 — Feature Subset Comparison

Putting It All Together — Cross-Method Consensus

Key Takeaways for Mailbox Compromise Detection

Run the Case Study

Project Structure

Requirements

Как установить

Похожие MCP

GitHub

Supabase

Everything

Filesystem