Data Transformation

Data transformation is the process of converting data from one format or structure into another format or structure.

  1. Date time Transformation

    This is used to extract date-time features, such as date, year, and month, from a column that contains dates.

    Usecase 1: A retail company wants to track the sales of their products over time. DateTime transformation helps them to extract the date, month, and year from the list of sales transactions.

    Usecase 2: A transportation company can use DateTime Transformation to extract the date, time, and location from a list of past deliveries to optimize their delivery routes.

    Sample input:

    2021-11-29 11:52:59

    Sample output:

    dt_day_of_week 1
    dt_date_no 29
    dt_year_no 2021
    dt_month_no 11
    dt_business_day 1
    dt_week_no_year 48
    dt_day_of_year 333
    dt_AM_PM AM
    dt_quarter_year 4
  2. Email Transformation

    This is used to extract features like username, domain, and suffix from a column that contains emails.

    Usecase 1: A sales team can use Email Transformation to extract the username and domain name from a list of email addresses to personalize their email outreach to potential customers.

    Usecase 2: To identify phishing emails, a security team need features such as the domain name and suffix from a list of email addresses that have been flagged as suspicious

    Sample input:


    Sample output:

    mail_first mail_middle mail_last
    abc zylker com
  3. Extract Data

    Extract data component utilizes regex patterns to retrieve information from text columns. These patterns may encompass various date formats, emails, or numerical values embedded within the text columns

    In the car price prediction dataset, to extract the Year of manufacture of the car from the name column, we use the POSIX regex pattern /d{4} applied on the same column.

    Sample POSIX Regex patterns:

    • Date format: \d{4}-\d{2}-\d{2}
    • Email: [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+
    • Number: \d+
  4. Format

    This is used to modify the column values in dataset based on the function chosen by the user. The Function will be displayed based on the data types of the column.

    In patient dataset, formatting age column using the absolute value function. The age column will no longer have any negative values in it.

  5. Group By

    This is used to partition rows into groups based on their values in one or several columns. It is mandatory to give one aggregate function for a column. Users can also able to set the having criteria for the dataset.

    In the employee salary dataset, users can be able to sum the salary given to department by group the department wise.

  6. Hash Generator

    This is used to hash one or several columns in the dataset. Users are also able to drop the source column once the hash is generated.

  7. Join

    This is used to join two datasets into single dataset based on the primary column in both tables. User can able to perform Left, Right, Inner and Outer joins.
    Note : This stage requires selecting a dataset node.


  8. Merge Column

    This is used to merge two or more columns into a single column using a separator specified by the user. User can also able to drop the source column after performing the merge operation.

    In passenger dataset, We can combine first_name, middle_nam,e and last_name as Name column using a space separator.

  9. Normalization

    Normalization is a statistical method that helps scale and compare variables on different scales or units. This node is used to normalize two or more columns.

  10. Outliner Handler

    Outliers are the data points that falls outside of the normal distribution of the data, which can distort the results of data analysis and leads to incorrect conclusions. Outlier Handler can be used to remove outliers from a dataset or replace them with more reasonable values, such as the cap, mean, or median values of the respective columns in the data.

  11. Sort

    This is used to sort the dataset by set of columns. Users are also able to choose the sorting order.

  12. Split Column

    This is used to split a single column into two or more columns based on the separator specified by the user. The number of output columns will be determined based on the output names given by the user.

    In passenger dataset, User can split Name column into three columns by specifying first_name, middle_name, and last_name in the output column text box.

  13. String Transformation

    Used to produce word embeddings from columns that contain text.

    Usecase 1: To provide recommended content to its users, a social media company can utilize the String Transformation to get the word embeddings from the text of the posts that users have liked and shared.

    Usecase 2: From the text of the tickets, a customer support team needs to classify customer support tickets by identifying common issues that their customers are experiencing by using word embeddings produced by string transformation.

  14. Type Conversion

    This is used to convert type of column into data type specified by the user. Users are able to handle the resultant dataset if an error occurs by using the On Error option in configuration. revert will throw an exception if an error occurs, while ignore will return the original object if an error occurs.

  15. URL Transformation

    This is used to extract features such as sub_domain, domain, and suffix from a column that contains ULs.

    Usecase 1: A marketing team needs to determine which websites are most frequently visited by their target market. They can utilize URL Transformation to extract the domain names from a list of URLs that users of their website have clicked.

    Usecase 2: A fraud detection team can use URL transformation to find out the list of domains, subdomains, and suffixes of malicious websites from a list of URLs that has been flagged as suspicious.

    Sample input:


    Sample output:

    link_url_protocol https
    link_url_path /library
    link_url_query fetch=query
    link_url_fragment fragment_part
    link_domain_tld in
    link_domain_country IN
    subdomain www
  16. Union

    This is used to combine two datasets into a single dataset. Users can able to drop duplicate rows after combining two datasets.
    Note: This stage requires selecting a dataset node.

  17. Windowing

    This is used to perform window average operation in number columns. User should need to specify the windowing range which should be greater than 0.Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting. This can be used for time series dataset.

Last Updated 2023-11-30 17:09:44 +0530 +0530