Operations in QuickML

Data Preprocessing is the step in which data gets transformed, or encoded, to help the machine parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

  1. Encoding
  2. Feature Engineering
  3. Imputation
  4. Normalization
  5. Transformers

Encoding

Encoding is a technique of converting categorical variables (discrete) into numerical (continuous) values so they can be fit easily to a machine-learning model.

  1. Ordinal Encoder

    An ordinal encoding involves mapping each unique label to an integer value. This type of encoding is really only appropriate if there is a known relationship between the categories. If the data is ordered, we can use ordinal encoding.
    Example:
    For temperature values, Low, Normal, and High, we can use ordinal encoding. After encoding the data will look like 0,1,2.(0–>Low temp,2–>High temp). Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in. In this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.

  2. One-Hot Encoding

    We use this categorical data-encoding technique when the features are nominal (do not have any order). In one-hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category. If the categorical feature is not ordinal (ordered data) and the number of categories in categorical features is less, so one-hot encoding can be effectively applied.

    Sample input:

    color
    blue
    red
    green

    Sample output:

    color_blue color_red color_green
    1 0 0
    0 1 0
    0 0 1
  3. JamesStein Encoder

    For feature value, the James-Stein estimator returns a weighted average of:

    1. The mean target value for the observed feature value.
    2. The mean target value (regardless of the feature value).
  4. Label Encoding

    This is used to convert a categorical target column into a numerical column by assigning a unique integer or numerical label to each category in the categorical variable. It’s important to note that encoding introduces ordering to the categorical variables, which may not be useful in every case. It is appropriate for ordinal variables where there is inherent order or ranking among the categories.

  5. LeaveOneOut Encoder

    Leave one out encoding essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question. The encoding algorithm is slightly different between training and test data set. For training data sets, the record under consideration is left out, hence the name leave one out.

  6. Target Encoding

    In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.
    Target encoding is the process of replacing a categorical value with the mean of the target variable. Any non-categorical columns are automatically dropped by the target encoder model.

  7. Count Encoder

    Count encoding is based on replacing categories with their counts computed on the train set. Counts may be the same for some of the variables, which may result in collision, encoding two categories as the same value. Count encoder can be used if the count of categories are not the same.

    Sample Input 10 10 20 30 30 30
    Sample Output 2 2 1 3 3 3
  8. Backward Difference Encoding

    In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.

  9. Helmert Encoding

    The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. This comparison does not make much sense for a nominal variable, such as race.

  10. Catboost Encoding

    Catboost is a target-based categorical encoder. It replaces a categorical feature with average value of target corresponding to that category in training dataset combined with the target probability over the entire dataset. However, this introduces a target leakage, because the target is used to predict the target.

Last Updated 2023-10-08 10:48:45 +0530 +0530