Create a data pipeline

Now that we have uploaded the dataset, we will proceed with creating a data pipeline with the dataset.

  1. Navigate to the Datasets component in the left menu and click the Zoho_CRM_Deal_Prediction_Sample dataset. All Datasets

  2. The dataset Details page will be displayed. Click on Create Pipeline in the top-right corner of the page. Create Pipeline Option

  3. Name the pipeline “Deal Prediction Data Pipeline” and click Create Pipeline. Pipeline Name

The pipeline builder interface will be opened as shown in the screenshot below.

Initial Pipeline

We will be performing the following set of data pre-processing operations in order to clean, refine, and transform the datasets, and then execute the data pipeline. Each of these operations involve individual data notes that are used to construct the pipeline.

Select fields for data pre-processing

First, we will select the required fields in the dataset to modify them further.

  1. Expand the Data Cleaning component in the Operations menu. Drag and drop the Select/Drop node in the pipeline builder and make a connection with the Source node. Select/drop

  2. In the Select/Drop section in the right panel, select the columns “Deal ID,” “Deal Name,” “Closing Date,” “Created Time,” and “Modified Time”, and choose the operation “Drop” to drop the columns from the dataset, then click Save. In our case, these columns are generic, serving no purpose to be used further so we are removing them.

Handle missing values

To enhance the quality of the data used for training, we will filter out the non-empty data for the required columns using the Filter node. This process eliminates irrelevant or incomplete data, ensuring only valuable information is used for model development.

Data Filter

Since the Lead Source is one of the key columns for our model training, we are adding a filter to that column to avoid empty cells in it. If you want to process the unmatched data from the filter, choose show unmatched records as a secondary output if you want to get another output for unmatched data.

Fill columns

As a part of data pre-processing, we will need to check if there are missing values in any of the columns in the datasets and fill them. We will be using the Fill Columns node for executing this operation.

  1. Expand the Data Cleaning component in the Operations menu. Drag and drop the Fill Columns node into the pipeline builder and make a connection with the previous Filter node as shown in the screenshot below. Fill Column 1

  2. From the dropdown named Select Column, choose “Type”. In the Fill with field, choose “Custom Value”. Update the Value field as “Not mentioned”, and select the criteria as “Type”, and “Is empty” in the dropdown, then click Save. This fills the empty values as “Not mentioned” in the Type column. Fill Column 2

As of now, our dataset is prepared and we have configured the required nodes for this tutorial. Finally, make a connection between the last configured node Fill Columns and the Destination node.

Click Execute.

Completed data pipeline

The data pipeline will start the execution and the status of the execution will be displayed on the pipeline Details page as shown in the screenshot below. Once the pipeline has completed the execution, the execution status will display “Success”.

Executed data pipeline

Click on Execution Stats to view more details about each stage of the execution in detail.

Execution stats for data pipeline

Now, we have prepared our dataset which can be used to develop the ML model. We will be discussing more about the ML pipeline creation in the next section.

Note : The data pipeline can be reused to create multiple ML experiments for varied use cases within your Catalyst project.

Last Updated 2024-04-05 23:41:34 +0530 +0530