Persona MCP

data-science-workflow-2025.md•14.1 KiB

# Data Science Workflow 2025: From EDA to Production **Updated**: 2025-11-23 | **Stack**: Python, Pandas, Scikit-learn, XGBoost --- ## The Real DS Workflow (Not What Bootcamps Teach) ``` ❌ MYTH: 80% ML algorithms, 20% data prep ✅ REALITY: 80% data cleaning, 10% EDA, 10% modeling ACTUAL TIME BREAKDOWN: - Data collection & cleaning: 50% - Exploratory analysis: 20% - Feature engineering: 15% - Model training: 5% - Model tuning: 5% - Deployment & monitoring: 5% ``` --- ## Exploratory Data Analysis (EDA) ### First 10 Minutes with New Data ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load data df = pd.read_csv('customer_data.csv') # 1. Basic info print(df.shape) # (10000, 25) print(df.info()) # dtypes, nulls print(df.describe()) # summary stats # 2. Missing values missing = df.isnull().sum() print(missing[missing > 0].sort_values(ascending=False)) # Output: # income 1250 (12.5%) # age 450 (4.5%) # purchase_date 120 (1.2%) # 3. Duplicates print(f"Duplicates: {df.duplicated().sum()}") # 4. Target distribution df['churned'].value_counts(normalize=True) # False 0.75 ← Imbalanced! # True 0.25 # 5. Correlations corr = df.select_dtypes(include=[np.number]).corr() sns.heatmap(corr, annot=True, fmt='.2f') plt.show() ``` ### Advanced EDA Patterns ```python # Detect outliers (IQR method) def find_outliers(df, column): Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] return outliers outliers = find_outliers(df, 'income') print(f"Found {len(outliers)} outliers in income") # Distribution comparisons by target fig, axes = plt.subplots(2, 3, figsize=(15, 10)) numeric_cols = ['age', 'income', 'tenure', 'monthly_spend', 'support_tickets'] for ax, col in zip(axes.flatten(), numeric_cols): df.boxplot(column=col, by='churned', ax=ax) ax.set_title(f'{col} by Churn Status') plt.tight_layout() plt.show() # Key insight: Churned customers have >2x support tickets! ``` --- ## Data Cleaning & Preprocessing ### Handling Missing Values (The Right Way) ```python from sklearn.impute import SimpleImputer, KNNImputer # ❌ BAD: Drop all rows with ANY missing value df_clean = df.dropna() # Lost 60% of data! # ✅ GOOD: Strategic imputation # 1. Analyze missingness pattern import missingno as msno msno.matrix(df) msno.heatmap(df) # 2. Different strategies for different columns # Categorical: mode df['gender'].fillna(df['gender'].mode()[0], inplace=True) # Numeric (normally distributed): mean df['age'].fillna(df['age'].mean(), inplace=True) # Numeric (skewed): median df['income'].fillna(df['income'].median(), inplace=True) # Advanced: KNN imputation (uses similar rows) imputer = KNNImputer(n_neighbors=5) df[['age', 'income', 'tenure']] = imputer.fit_transform( df[['age', 'income', 'tenure']] ) # 3. Create missing indicator feature (sometimes missingness is informative!) df['income_was_missing'] = df['income'].isnull().astype(int) ``` ### Feature Engineering That Actually Works ```python # 1. Domain knowledge features df['customer_lifetime_months'] = ( pd.to_datetime('today') - pd.to_datetime(df['signup_date']) ).dt.days / 30 df['avg_monthly_spend'] = df['total_spend'] / df['customer_lifetime_months'] df['support_tickets_per_month'] = df['support_tickets'] / df['customer_lifetime_months'] # 2. Interaction features df['high_spend_low_satisfaction'] = ( (df['monthly_spend'] > df['monthly_spend'].quantile(0.75)) & (df['satisfaction_score'] < 3) ).astype(int) # 3. Polynomial features (use sparingly!) from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) poly_features = poly.fit_transform(df[['age', 'income']]) # Creates: age, income, age², age×income, income² # 4. Binning continuous variables df['age_group'] = pd.cut( df['age'], bins=[0, 25, 35, 50, 65, 100], labels=['18-25', '26-35', '36-50', '51-65', '65+'] ) df['income_bracket'] = pd.qcut( df['income'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'] ) # 5. Time-based features df['signup_month'] = pd.to_datetime(df['signup_date']).dt.month df['signup_day_of_week'] = pd.to_datetime(df['signup_date']).dt.dayofweek df['is_weekend_signup'] = df['signup_day_of_week'].isin([5, 6]).astype(int) ``` --- ## Model Selection & Training ### The Scikit-learn Workflow ```python from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix # 1. Prepare data X = df.drop(['churned', 'customer_id', 'signup_date'], axis=1) y = df['churned'] # 2. Handle categorical variables X = pd.get_dummies(X, drop_first=True) # 3. Train/test split (stratified for imbalanced data) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # 4. Scale features (important for logistic regression, SVM, neural nets) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 5. Try multiple models models = { 'Logistic Regression': LogisticRegression(max_iter=1000, class_weight='balanced'), 'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced'), 'XGBoost': xgb.XGBClassifier(scale_pos_weight=3) # Handles imbalance } results = {} for name, model in models.items(): # Cross-validation cv_scores = cross_val_score( model, X_train_scaled, y_train, cv=5, scoring='roc_auc' ) # Train on full training set model.fit(X_train_scaled, y_train) # Evaluate y_pred = model.predict(X_test_scaled) y_proba = model.predict_proba(X_test_scaled)[:, 1] results[name] = { 'CV AUC': cv_scores.mean(), 'Test AUC': roc_auc_score(y_test, y_proba), 'Model': model } print(f"\n{name}") print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}") print(f"Test AUC: {roc_auc_score(y_test, y_proba):.3f}") print(classification_report(y_test, y_pred)) # Output: # Logistic Regression # CV AUC: 0.782 ± 0.012 # Test AUC: 0.791 # Random Forest # CV AUC: 0.856 ± 0.009 ← Best! # Test AUC: 0.863 # XGBoost # CV AUC: 0.849 ± 0.011 # Test AUC: 0.857 ``` ### Hyperparameter Tuning Done Right ```python from sklearn.model_selection import RandomizedSearchCV # Define parameter grid param_grid = { 'n_estimators': [100, 200, 300, 500], 'max_depth': [10, 20, 30, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'max_features': ['sqrt', 'log2', None] } # Randomized search (faster than GridSearchCV) rf = RandomForestClassifier(class_weight='balanced', random_state=42) random_search = RandomizedSearchCV( rf, param_distributions=param_grid, n_iter=50, # Try 50 random combinations cv=5, scoring='roc_auc', n_jobs=-1, # Use all CPU cores random_state=42, verbose=2 ) random_search.fit(X_train_scaled, y_train) print("Best parameters:", random_search.best_params_) print("Best CV AUC:", random_search.best_score_) # Best model best_model = random_search.best_estimator_ # Feature importance feature_importance = pd.DataFrame({ 'feature': X_train.columns, 'importance': best_model.feature_importances_ }).sort_values('importance', ascending=False) print(feature_importance.head(10)) # Output: # feature importance # support_tickets_per_month 0.234 ← Top predictor! # monthly_spend 0.156 # satisfaction_score 0.142 # tenure 0.098 ``` --- ## Handling Imbalanced Data ### Real-World Imbalance Problem ```python # Problem: 97% non-fraud, 3% fraud print(y_train.value_counts(normalize=True)) # 0 0.97 # 1 0.03 ← Severe imbalance! # ❌ BAD: Ignore imbalance model.fit(X_train, y_train) # Result: 97% accuracy by predicting all 0s! # ✅ GOOD: Multiple strategies # 1. Class weights (built into most sklearn models) from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(class_weight='balanced') # 2. SMOTE (Synthetic Minority Over-sampling) from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train) print(y_train_balanced.value_counts()) # 0 9700 # 1 9700 ← Balanced! # 3. Undersampling majority class from imblearn.under_sampling import RandomUnderSampler rus = RandomUnderSampler(random_state=42) X_train_under, y_train_under = rus.fit_resample(X_train, y_train) # 4. Combination (SMOTE + Undersampling) from imblearn.combine import SMOTEENN smoteenn = SMOTEENN(random_state=42) X_train_combined, y_train_combined = smoteenn.fit_resample(X_train, y_train) # 5. Threshold adjustment (works well for probability-based models) y_proba = model.predict_proba(X_test)[:, 1] # Default threshold: 0.5 y_pred_default = (y_proba >= 0.5).astype(int) # Optimized threshold (maximize F1 score) from sklearn.metrics import f1_score best_threshold = 0.5 best_f1 = 0 for threshold in np.arange(0.1, 0.9, 0.05): y_pred = (y_proba >= threshold).astype(int) f1 = f1_score(y_test, y_pred) if f1 > best_f1: best_f1 = f1 best_threshold = threshold print(f"Best threshold: {best_threshold}") # e.g., 0.35 print(f"Best F1 score: {best_f1}") ``` --- ## Model Evaluation (Beyond Accuracy) ### Confusion Matrix Interpretation ```python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay y_pred = model.predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) # Predicted # 0 1 # Actual 0 950 50 ← 50 False Positives # 1 80 220 ← 80 False Negatives (BAD!) # Metrics: # Accuracy: (950+220)/1300 = 0.90 (misleading with imbalance!) # Precision: 220/(220+50) = 0.81 (of predicted positives, 81% correct) # Recall: 220/(220+80) = 0.73 (caught 73% of actual positives) # F1-Score: 2*(0.81*0.73)/(0.81+0.73) = 0.77 # Cost-sensitive evaluation cost_fp = 10 # False positive costs $10 cost_fn = 100 # False negative costs $100 (fraud missed!) total_cost = (50 * cost_fp) + (80 * cost_fn) print(f"Total cost: ${total_cost}") # $8,500 # With optimized threshold: # FP: 120, FN: 20 # Cost: (120*10) + (20*100) = $3,200 ← Much better! ``` ### ROC Curve & AUC ```python from sklearn.metrics import roc_curve, auc y_proba = model.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = roc_curve(y_test, y_proba) roc_auc = auc(fpr, tpr) plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show() # AUC interpretation: # 0.5: Random guessing # 0.7-0.8: Fair # 0.8-0.9: Good # 0.9-1.0: Excellent # 1.0: Perfect (suspicious - check for data leakage!) ``` --- ## Common Pitfalls & How to Avoid ### Data Leakage (Silent Killer) ```python # ❌ LEAKAGE EXAMPLE 1: Using future data # Problem: 'churned' column includes info from after prediction date df['total_purchases_last_30_days'] # Uses data from FUTURE! # ✅ FIX: Only use data available at prediction time df['total_purchases_before_date'] = df.apply( lambda row: df[ (df['customer_id'] == row['customer_id']) & (df['purchase_date'] < row['prediction_date']) ]['purchase_amount'].sum(), axis=1 ) # ❌ LEAKAGE EXAMPLE 2: Scaling before split scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # LEAKAGE! Test data influenced scaling X_train, X_test = train_test_split(X_scaled) # ✅ FIX: Fit on train, transform on test X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Only transform, don't fit! # ❌ LEAKAGE EXAMPLE 3: Target encoding without cross-validation df['category_mean_target'] = df.groupby('category')['target'].transform('mean') # ✅ FIX: Use TargetEncoder with CV from category_encoders import TargetEncoder encoder = TargetEncoder() df['category_encoded'] = encoder.fit_transform(df['category'], df['target']) ``` --- ## Model Deployment ### Saving & Loading Models ```python import joblib # Save model + preprocessing pipeline model_artifacts = { 'model': best_model, 'scaler': scaler, 'feature_names': X_train.columns.tolist(), 'threshold': best_threshold } joblib.dump(model_artifacts, 'churn_model_v1.pkl') # Load in production artifacts = joblib.load('churn_model_v1.pkl') def predict_churn(customer_data): # Ensure same feature order X = customer_data[artifacts['feature_names']] # Scale X_scaled = artifacts['scaler'].transform(X) # Predict probability proba = artifacts['model'].predict_proba(X_scaled)[:, 1] # Apply optimized threshold prediction = (proba >= artifacts['threshold']).astype(int) return { 'will_churn': bool(prediction[0]), 'churn_probability': float(proba[0]), 'risk_level': 'high' if proba[0] > 0.7 else 'medium' if proba[0] > 0.4 else 'low' } ``` --- ## Key Takeaways 1. **80% of work is data cleaning** - Get comfortable with messy data 2. **Feature engineering > fancy algorithms** - Domain knowledge wins 3. **Always check for leakage** - Most common reason models fail in production 4. **Imbalanced data requires special handling** - Accuracy is misleading 5. **ROC AUC > Accuracy** - Better metric for most business problems 6. **Cross-validation is mandatory** - Single train/test split is not enough 7. **Interpretability matters** - Stakeholders need to trust your model --- ## References - "Hands-On Machine Learning" - Aurélien Géron - "Feature Engineering for Machine Learning" - Alice Zheng - Scikit-learn Documentation - Kaggle Competitions (best learning resource) **Related**: `deep-learning.md`, `feature-engineering.md`, `ml-deployment.md`

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/seanshin0214/persona-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

data-science-workflow-2025.md•14.1 KiB