Key Concepts

Before you read more about using AutoML, ensure that you understand the following concepts of Zia AutoML.

Model

A model is a set of computations generated as a result of training the input dataset using various machine learning algorithms. You can use the AutoML model to make predictions in the dataset for various conditions. A model is therefore a mathematical representation of a real-world process which you can perform an in-depth analysis on to test various hypotheses.

Once a model is generated in AutoML, you can provide a set of input values and generate a set of predictive output values based on the patterns observed in the dataset.

Dataset

An input training dataset is the collection of structured data that you provide for the model to analyze and train to perform predictions on. You must provide the dataset in the form of a CSV file that contains columns and rows of data in AutoML. You can upload the CSV file directly from your computer or import it from the Catalyst File Store. You can learn more about this in the Implementation section.

Target

The target is the column whose value needs to be predicted after the model is trained with the dataset. The value prediction is based on the data type of the target column.

You can only choose a numerical or a categorical type column as the target in AutoML. Zia cannot predict the values of a string or date type column, as they do not hold calculable data. You will learn about the data types of a column in the next part.

Attributes of a Column

Zia determines six attributes for every column in a dataset that is uploaded. Various algorithms calculate and determine the values of these attributes before you select a target.

The following attributes are determined for the columns in a dataset:

  1. Type
    This is determined for every column in the data set. AutoML supports the following data types:
    • Numerical: A column with only numerical values in it is classified as Numerical.
    • String: A column with a set of numerical, alphabetical, or any other characters as values is classified as a String. Any column that contains mixed values of various data types is also classified as a String.
    • Date: A column with only date-time values in it is classified as a Date. AutoML supports the following date formats:
Format Example
YYYY-MM-DD ‘2019-02-12’
YYYY/MM/DD ‘2008/07/28’
YYYY/MM/DD hh:mm:ss ‘2011/03/17 23:58:30’
DD-MM-YYYY ‘03-09-2016’
DD/MM/YYYY ‘22/11/2018’
DD-Month-YYYY ‘13-January-2012’
YYYY-MM-DDThh:mm:ss.sTZD ‘2019-11-28T05:19:31.665523+00:00’
YYYY.MM.DD ‘2020.01.24’
Unix timestamp string in seconds ‘1574918464’
Unix timestamp string in milliseconds ‘157491844000’
Unix timestamp string in microseconds ‘157491844000000’
  • Categorical: A column with a limited number of distinct values in it is classified as Categorical. There are two types of Categorical columns:
    • Binary-class: A binary-class column contains only two distinct values in all the records. For example, columns with values as Yes/No, Win/Lose.
    • Multi-class: A multi-class column contains three or more, but a limited number of, distinct values in all the records. For example, a column that depicts the states in a country, or a column that lists the graduate programs available in a University. The following table depicts the columns that can or cannot be used as the target or for training the model, based on their data types:
Data Type Target Training
Numerical
String
Date
Categorical (Both binary- and multi-class)
  1. Missing (in%)
    This represents the percentage of missing values in a column in the dataset. For example, in a dataset that contains 20 records, if the values of a column are empty for 10 records, the missing amount of data is 50%.

  2. Distinct Values
    This represents the number of distinct entries in the values of a column in the dataset. For example, if a column’s values contain only ‘Yes’, ‘No’, and ‘Maybe’ for all the records, the number of distinct values is three and the column is classified as the Multi-Class Categorical type.

  3. Mean
    This represents the mean value of all values in the column. This is determined only for Numerical columns.

  4. SD
    This represents the standard deviation of all values in the column. This is determined only for Numerical type columns.

  5. Correlation with Target
    This represents the correlation of a column with the target ranging from 0 to 1, where 0 indicates no correlation and 1 indicates perfect correlation. The correlation of a column with the target is determined by the patterns observed in the column’s values with reference to the values in the target column.

For example, a column reporting the number of common flu cases is the target of a model. Another column depicting the months of the year will have a high correlation with the target, as the number of flu cases are generally higher during the winter months, and they are therefore highly dependent on each other. This is determined for every column in the dataset, except for the String type columns.

The following table depicts how the various attributes are determined for columns, based on the data types:

Data Type Missing Distinct Mean SD Correlation with Target
Numerical
String
Date
Categorical (Both binary- and multi-class)

Input Feature Selection

AutoML allows you to select the columns to be used for training the model. This is based on a machine learning concept known as feature selection, which is the process of selecting a subset of relevant features to use to build a model. You can select the features that you think will contribute most to your prediction variable.

The columns that you select for training have a high impact on the accuracy of a model’s prediction. The accuracy is calculated and determined for the binary-class and multi-class classification models. You will learn about these in the next part.

It is a good practice to exclude the columns that are irrelevant or that have low correlations to the target, as they will affect the model’s learning by providing unnecessary patterns. You can also exclude columns based on the percentage of missing data in them, since columns with a high number of missing values can alter the accuracy of the model’s prediction.

A String type column cannot be used for training a model, as shown in the table earlier. This is because, the String type does not contain quantifiable or calculable data.

Model Types

After you select a target for a model, it is classified into one of the following three types based on the data type of the target column you selected:

  • Regression: If the target column of a model is of the numerical type, then the model is classified as a regression model. This model predicts a numerical value.
  • Binary-Class Classification: If the target column of a model is of the binary-class categorical type, then the model is classified as a binary-class classification model. This model predicts a binary or a Boolean outcome.
  • Multi-Class Classification: If the target column of a model is of the multi-class categorical type, then the model is classified as a multi-class classification model. This model predicts one class from three or more discrete classes.

You can see a model’s type in its evaluation report.

Training a Model

AutoML runs machine learning algorithms to identify patterns, draw inferences, and build and train models by using 80% of the dataset that you provide. AutoML then uses the remaining 20% of the dataset to validate the model it has built. This entire process happens while the model training is in progress.

After a model is trained, AutoML provides various statistics that were calculated during the training process in the model’s evaluation report. The information provided differs based on the model type.

Evaluation Report for Binary-Class and Multi-Class Classification Models

AutoML provides percentage values for the following attributes of a binary-class classification model in the form of a confusion matrix:

  • True Positive (TP): A true positive is an outcome where the model correctly predicts the positive class.
  • True Negative (TN): A true negative is an outcome where the model correctly predicts the negative class.
  • False Positive (FP): A false positive is an outcome where the model incorrectly predicts the positive class..
  • False Negative (FN): A false negative is an outcome where the model incorrectly predicts the negative class.

The confusion matrix is a 2 x 2 matrix where the columns represent the predicted class and the rows represent the actual class.

Predicted False Predicted True
Actual False TN FP
Actual True FN TP

The positive class and negative class are characteristics of a binary-class classification where each class lies on either side of a boundary. For example, in a case where there are only two possible values for a column, Domestic and International, Domestic is assigned to the positive class when the classifier is looking for “Domestic” positive results. Anything that is not Domestic, which means the values that are International, are assigned to a “Domestic” negative class.

The confusion matrix helps you understand the instances of misclassification, or wrongly assigning a value to a category, that occurred during the model’s training.

Note: AutoML only provides the confusion matrix for binary-class classification models, and not for the multi-class classification models.

The following information is provided for both binary-class and multi-class classification models in their evaluation reports:

  1. Accuracy
    The accuracy is the fraction of total predictions made by the model on the test data that were correct, as a percentage value.

    Accuracy = Number of Correct predictions/Number of Total predictions

    For a binary-class classification model, the accuracy can also be calculated as:

    Accuracy = (TP + TN) / (TP + TN + FP + FN)

    As discussed earlier, you can improve the accuracy of a model’s prediction by excluding irrelevant columns or columns with a high amount of missing data during the input feature selection. You can also improve it by ensuring that you provide correct and valid data.

  2. Precision
    The precision is the fraction of total positive predictions made by the model on the test data that were correct.

    Precision = TP / (TP+FP)

    The precision indicates how right a model’s positive prediction is.

  3. Recall
    The recall is the fraction of the true positive predictions made by the model, out of all true positives and false negatives.

    Recall = TP / (TP+FN)

    This is used to select the best model when there is a high cost associated with the false negatives. The recall is also known as the True Positive Rate.

  4. F1 Score
    The F1 score is the harmonic mean of the precision and recall.

    F1 score = 2 x (Precision\*Recall) / (Precision+Recall)

    The F1 score is a useful metric if you are looking for a balance between precision and recall.

  5. Log Loss
    The log loss measures the uncertainty of a model’s prediction. A small log loss value indicates low uncertainty. Therefore, a high log loss value is not desirable.

Evaluation Report for Regression Models

The statistics discussed in the previous part do not apply to regression models. AutoML provides the following statistics in a regression model’s evaluation report:

  1. Mean Absolute Error (MAE)
    The Mean Absolute Error is the average absolute difference between the target values and the predicted values. This metric ranges from zero to infinity, where a lower value indicates a higher quality model.
  2. Mean Squared Error (MSE)
    The Mean Squared Error is the average of the squares of the absolute difference between the target values and the predicted values.
  3. Root Mean Squared Error (RMSE)
    The Root Mean Squared Error is the square root of the mean squared error.

Last Updated 2023-05-09 17:03:08 +0530 +0530