Feature Transformation in Machine Learning: Why, When and How?

  1. Improves model performance
  • How? Variability of feature due to extreme values could be reduced
    Extreme values will lead to extreme estimates when modelling. A dataset with features having very different magnitude poses a problem in some machine learning algorithms as the feature with larger magnitude might be given higher weightage in comparison to features with smaller values. Distance-based algorithms like K-Nearest-Neighbours, SVM, and K-means are likely to be influenced by extreme values since the distance are calculated based on the values of the data points to determine their similarity. Feature transformation will allow these algorithm to consider each feature equally.
  • How? Non-linear relationship can be modelled better
    Given that the relationship between continuous variable and the target variable may not be linearly related, applying some form of transformation can model the relationship better and lead to improvement in model performance.
  • Transformation can improve the numerical stability of models using gradient-descent optimisation such as linear regression, logistic regression and neural networks. It can also speed up the training process. (Read this) Hence although these models may be unaffected in model performance by scaling, there are other advantages of performing transformation.
  • It is also important to note that interpretation of model coefficients can be affected by the scale of feature. In a dataset with Income and Age, Income ranges from $0 to say, $100K, whereas the range for Age is likely to fall between 0 to 100. Income can be a 1,000 times larger in value than Age. Given that the model parameters of a multivariate linear regression will be adjusted to counteract the scale of Income and Age, they are largely unaffected by the scale of different features,. The coefficients of Income will be compensated with a smaller coefficient as compared to Age. As such, comparison of coefficients to assess feature importance will not be meaningful.
  • How? Variability of feature due to extreme values could be reduced
    When we transform a feature, the skewness of the feature could be reduced and the transformed feature that goes into the model are less extreme.
  • How? More resilient to changes in underlying distribution of feature
    Specifically, for feature transformation by grouping numerical features (method 3 below) into several ordinal values promotes robustness as it is more resilient to changes in distribution of the feature over time.
  1. Transformation of feature values
  2. Log-transformation of feature values
    - Typically used on continuous variables that are naturally skewed with long tails of extreme values e.g. Income. By performing log-transformation, a right-skew distribution will become closer to normal.
  3. Categorisation of numerical features into dummy or ordinal variables
  • This method can be applied through 3 approaches: Quantiles, meaningful/arbitrary binning, automated categorisation through information gain.

Why and when do you standardise?

Why and when do you normalise?

Dealing with Extreme Values

  1. An estimation of the CDF distribution of the feature is computed
  2. and used to map the original values into a uniform distribution (in other words, frequent values are spread out into uniform distribution) or normal distribution (parameter specified in function)
  3. The obtained values is mapped to the desired output distribution using the associated quantile function

Illustration

Preview of original data vs other transformation techniques
Descriptive stats on various transformation techniques
Distribution plots of various transformation techniques
Summary

Other methods of transformation

  1. from sklearn.preprocessing import MaxAbsScalar - scales by dividing the values by the maximum absolute value of the feature.
  2. from sklearn.preprocessing import PowerTransformer - also changes the distribution of the feature to be more Gaussian
  • Automates the decision of square root/cube root/log transformation by introducing a parameter lambda and finds the best value of lambda based on Box-Cox transformation or Yeo-Johnson transformation.

Final Notes

  1. General questions we could ask ourselves before deciding which transformation technique to apply (if needed at all):
  • What algorithms are we applying? Does the algorithm we are applying assume any distribution on data?
    - Tree-based algorithms are insensitive to the scale of the features given that they are a rule-based algorithm. A node of a decision tree splits by maximising the information gain of a single feature and is not influenced by other features. Scaling this specific feature or scaling the others will not affect the information gain and homogeneity of the node. Hence they are invariant to the scale of features and are robust to outliers.
    - Distance-based algorithms like K-Nearest-Neighbours, SVM, and K-means are likely to be influenced by extreme values. Feature transformation will allow these algorithm to consider each feature equally.
    - Machine learning algorithms such as linear regression, logistic regression, neural network, etc. that use gradient descent as an optimisation technique can improve the numerical stability and training process.
    - Algorithms such as K-nearest neighbours and neural networks does not assume any distribution, which is great for normalisation to be applied.
  • Does the feature follow a Gaussian distribution?
    - Normalisation is great when the feature doesn’t follow Gaussian distribution.
  • Are there outliers/extreme values in the feature?
    - Robust scaling and quantile transformer are great techniques in this case.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Isabella

Isabella

Product analyst, curious about data science, personal finance, baking…! Currently snooping around growth — happy to chat if you’ve growth experiences!