Feature Engineering

Feature Generation

This is the process of transforming features that already exist into new ones to make them more relevant to the target feature. The following techniques used in feature generation:

  • Operations - A feature generation technique that generates new features based on mathematical operations on the existing numerical features.
  • Autolearn - A regression-based feature generation algorithm. Features are generated by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regression and selecting those relationships that are stable and improve the prediction performance.
  • Explorekit - Generates a large set of candidate features by combining information in the original features, with the aim of maximising predictive performance according to user-selected criteria.

Feature Selection

The techniques below helps to decrease the dimensionality of the feature space, streamline the model, and enhance generalisation performance of model by choosing a subset of relevant features from the original list in the dataset.

  • Embedded - A technique where feature selection is integrated into the process of training a machine learning model. The model itself decides which features are most relevant during training;
  • Filter - A technique that involves selecting the most relevant features based on their statistical properties or ranking scores.
  • Redundancy Elimination - A process of removing features from a dataset that provide similar or duplicate information.
  • Backward Feature Elimination - A technique that starts with all features in the dataset and iteratively removes the least significant features one at a time.
  • Exhaustive Feature Engineering - A technique that considers all possible combinations of features to find the optimal subset that results in the best model performance.
  • Forward Selection - A technique that starts with an empty set of features and iteratively adds the most significant features one at a time.

Feature Reduction

These techniques are used to address the “curse of dimensionality,” which is when an algorithm struggles to train an effective model due to large number of features in dataset relative to observations. The following effective techniques are employed:

  • PCA - Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much of the original data’s variability as possible.
  • FA - Factor Analysis (FA) is a statistical technique used to uncover underlying latent variables (factors) that explain patterns of correlations among observed variables in a dataset. It is commonly employed for dimensionality reduction and to gain insights into the structure of complex data.
  • NMF - NMF (Non-Negative Matrix Factorization) is a dimensionality reduction and feature extraction technique that is particularly useful when dealing with non-negative data, such as text data or image data with pixel intensities.
  • ICA - ICA ( Independent Component Analysis) is a technique used to separate a multivariate signal into statistically independent components, assuming that the observed data is a linear combination of non-Gaussian and independent source signals.
  • LDA - LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction and classification technique used to find a linear combination of features that best separates two or more classes in the data.

Last Updated 2023-10-08 10:48:45 +0530 +0530