Feature Engineering & Selection

Master the art and science of transforming raw data into powerful features that drive ML model performance.

35 min readIntermediate
Not Started
Loading...

🔧 The Art and Science of Feature Engineering

Feature engineering is often the difference between a mediocre and exceptional ML model.It's the process of transforming raw data into features that better represent the underlying problem. Good features can make simple models outperform complex ones, while poor features can cripple sophisticated algorithms.

Impact Reality: Data scientists spend 60-80% of their time on feature engineering. A well-engineered feature can improve model performance by 10x more than hyperparameter tuning.

80%
Of model performance
1000+
Features possible
10x
Performance gain
60%
Of project time

📊 Feature Engineering by Data Type

Numerical Features

Transform and engineer continuous numerical values

Scaling & Normalization

Log/Box-Cox Transformations

Rolling Statistics

Interaction Terms

Binning & Discretization

Outlier Treatment

Implementation

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from scipy import stats

class NumericalFeatureEngineer:
    def __init__(self):
        self.scalers = {}
        self.transformations = {}
        
    def engineer_features(self, df: pd.DataFrame, target_col: str = None):
        """Comprehensive numerical feature engineering"""
        features = df.copy()
        
        # 1. Statistical Features
        for col in features.select_dtypes(include=[np.number]).columns:
            if col != target_col:
                # Rolling statistics (if time series)
                if 'timestamp' in features.columns:
                    features[f'{col}_rolling_mean_7d'] = features[col].rolling(7).mean()
                    features[f'{col}_rolling_std_7d'] = features[col].rolling(7).std()
                    features[f'{col}_rolling_min_7d'] = features[col].rolling(7).min()
                    features[f'{col}_rolling_max_7d'] = features[col].rolling(7).max()
                
                # Lag features
                features[f'{col}_lag_1'] = features[col].shift(1)
                features[f'{col}_lag_7'] = features[col].shift(7)
                
                # Rate of change
                features[f'{col}_pct_change'] = features[col].pct_change()
                features[f'{col}_diff'] = features[col].diff()
                
                # Interaction features
                for other_col in features.select_dtypes(include=[np.number]).columns:
                    if other_col != col and other_col != target_col:
                        features[f'{col}_{other_col}_ratio'] = features[col] / (features[other_col] + 1e-8)
                        features[f'{col}_{other_col}_product'] = features[col] * features[other_col]
        
        return features

🎯 Feature Selection Methods

Filter Methods

Select features based on statistical properties independent of ML models

Fast computationModel-agnosticGood for initial screening

Implementation

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, chi2
from sklearn.preprocessing import LabelEncoder

def filter_feature_selection(X, y, method='f_score', k=50):
    """Filter-based feature selection"""
    
    if method == 'f_score':
        # F-score for numerical features
        selector = SelectKBest(score_func=f_classif, k=k)
    elif method == 'mutual_info':
        # Mutual information for mixed data types
        selector = SelectKBest(score_func=mutual_info_classif, k=k)
    elif method == 'chi2':
        # Chi-square for categorical features (requires non-negative features)
        selector = SelectKBest(score_func=chi2, k=k)
    
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()]
    feature_scores = selector.scores_
    
    # Create ranking
    feature_ranking = pd.DataFrame({
        'feature': X.columns,
        'score': feature_scores,
        'selected': selector.get_support()
    }).sort_values('score', ascending=False)
    
    return X_selected, selected_features, feature_ranking

⚡ Transformation Techniques

Scaling & Normalization

Standardize features to similar scales

Implementation

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standard Scaling (z-score normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max Scaling (0-1 range)
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)

# Robust Scaling (median and IQR)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)

✅ Feature Engineering Best Practices

Do's

  • ✓ Start with domain knowledge and EDA
  • ✓ Create features that make business sense
  • ✓ Handle missing values before engineering
  • ✓ Use cross-validation for feature selection
  • ✓ Track feature importance and stability
  • ✓ Document feature creation logic
  • ✓ Test features on out-of-time data

Don'ts

  • ✗ Create features using future information
  • ✗ Use target for feature engineering (leakage)
  • ✗ Ignore feature scaling for distance-based models
  • ✗ Create too many correlated features
  • ✗ Over-engineer features (curse of dimensionality)
  • ✗ Forget to validate feature distributions
  • ✗ Use complex features without interpretation

🎯 Key Takeaways

Domain knowledge is king: Understanding the business context leads to the most impactful features

Match techniques to data types: Numerical, categorical, temporal, and text data require different approaches

Select features systematically: Use filter, wrapper, or embedded methods based on your constraints

Prevent data leakage: Never use future information or target-derived features

Validate thoroughly: Test features on out-of-time data and monitor stability in production

📝 Feature Engineering Mastery Check

1 of 8Current: 0/8

What is the main purpose of cyclical encoding for temporal features?