# QuickML ## Getting Started -------------------------------------------------------------------------------- title: "Catalyst QuickML" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.559Z" source: "https://docs.catalyst.zoho.com/en/quickml/getting-started/introduction/" service: "QuickML" -------------------------------------------------------------------------------- # Catalyst QuickML ### Introduction QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end machine learning solutions. QuickML helps you execute sequences of stages including wide ranges of data processing and machine learning sub tasks needed to efficiently build, test, deploy, and monitor effective ML models for different business requirements. Catalyst, as a whole, serves as a complete end-to-end development platform that provides you with services, components, and tools to build, code, test, deploy, and monitor web applications, mobile applications, and microservices. Catalyst offers a host of {{%link href="/en/cloud-scale/getting-started/introduction/" %}}backend services{{%/link%}}, {{%link href="/en/serverless/help/functions/introduction/" %}}FaaS components{{%/link%}}, {{%link href="/en/devops/getting-started/introduction/" %}}DevOps tools{{%/link%}}, powerful AI and ML microservices, and more. The setup and management of the underlying server resources utilized by these applications are handled entirely by Catalyst, thereby completely eliminating the infrastructure maintenance, and costs from your end. Catalyst services can either be used independently, or be integrated with one another to build highly-functional, robust applications and micro services. You can access QuickML from the highly-integrated Catalyst console, from where you can {{%link href="/en/getting-started/catalyst-projects/" %}}create a project and get started{{%/link%}}. You can {{%link href="/en/getting-started/set-up-a-catalyst-project/introduction/" %}}set up your Catalyst project{{%/link%}} and tailor it to your requirements, and access all the other services and components of Catalyst from the console as well. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} QuickML is currently not available to users accessing from the CA (Canada), JP (Japan) or SA (Saudi Arabia) data centers.{{%/note%}} ### What is a Machine Learning Pipeline? A machine learning pipeline is a series of instructions in the form of sequential steps, where each step is a particular process that is achieved by placing the respective components to develop, deploy, and monitor the machine-learning model. These sequential steps include end-to-end processes like data collection, data validation, preprocessing, model training, analysis, and deployment. When it comes to machine learning, iterations are the key to building and achieving an effective model with high accuracy. Breaking down complex solutions into smaller components is important, as these are easier and quicker when it comes to iterations. Having smaller components dedicated to individual roles makes it easier to replace them as well. The monolithic approach is not practically scalable in the long run. This is because when tweaking a particular area or context, the whole flow needs to be re-processed, as there is no segmentation available. Going through each step of a machine-learning pipeline manually is costly, time-consuming, and often erroneous. Automated machine-learning pipelines will help data scientists to focus on new models without having to manually keep previously developed models upto date by preprocessing data or running deployment scripts. In the early days, machine-learning models served as the primary product in the market. However, as we now consider machine learning to be service, the major product in demand is actually the workflows defined by connecting the components to achieve desired machine-learning solutions (i.e., a machine-learning pipeline) is the product in demand. -------------------------------------------------------------------------------- title: "Why QuickML?" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.559Z" source: "https://docs.catalyst.zoho.com/en/quickml/getting-started/why-quickml/" service: "QuickML" -------------------------------------------------------------------------------- ### Why QuickML? In machine-learning solutions, there is a gap between the data scientists working on the different stages of machine learning, and the developers working on the models. QuickML focuses on giving data scientists an end-to-end control over the whole pipeline, making iterations much easier for improving accuracy. Creating a machine-learning model to get better predictions in order to help improve the businesses is an iterative process which involves dealing with many development and production difficulties, such as: - Data handling - Resources handling - Model management strategies - Model Monitoring - And other operational challenges QuickML promises to be the service to tackle all these difficulties, and take advantage of the data that is available over the cloud, with zero operational overheads. QuickML helps in creating these data and machine-learning-based systems effectively by providing a no-code platform for the developers, data analysts, data scientists, and others to get the most out of the data with minimal effort. It serves as a platform for developing, maintaining, and producing machine learning models. We have categorized the activities of QuickML into two main modules: - QuickML Machine Learning Pipelines - QuickML Data Pipelines ### QuickML Machine Learning Pipelines Machine-learning pipelines are the end-to-end execution of workflows for data and machine learning tasks, designed to orchestrate fully trained and accurate machine learning models to help provide predictive intelligence in a wide range of business requirements. QuickML has a unique no-code pipeline builder platform in which machine learning pipelines are designed and executed. It focuses on easing up the machine learning operations in the development of machine-learning models that are ready for production. ![QuickML Machine Learning Pipelines](/images/help/quickml/introduction/ML_pipeline.webp) Users can build a machine learning model or improve the quality of the data by using this pipeline builder interface. The interface contains simple drag-and-drop UI for constructing the stages of a pipeline. Each of these stages can be further configured with output preview in the platform, based on the requirements. QuickML specializes in providing a good variety of basic ML algorithms and Artificial Intelligence features integrated as atomic Stages in the pipeline building flow. A pipeline execution flow can contain various combinations of Data operations, Machine-learning tasks and algorithms that can generate different feature focuses and resilient machine learning models with business data. ### QuickML Data Pipelines Data pipelines are part of the **Machine-Learning Model Lifecycle**, which often requires various manipulations that have to be applied to the data before passing to any machine-learning training process. QuickML data pipelines can be treated as independent data pipelines as well part of ML pipelines based on requirements. Data can be either imported from other zoho services or from external services like AWS S3/GCloud or from the local file system. Once the data is imported into the system, users will have the basic details about the data, such as the data quality. -------------------------------------------------------------------------------- title: "Use Cases" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.559Z" source: "https://docs.catalyst.zoho.com/en/quickml/getting-started/use-cases/" service: "QuickML" -------------------------------------------------------------------------------- ### Sales forecasting Accurate sales forecasting is crucial for businesses to make informed decisions about inventory management, resource allocation, and overall business strategy. ML classification algorithms can help businesses to forecast sales accurately by analyzing historical sales data, customer data, and other relevant variables. ### Sentiment Analysis Sentiment analysis involves analyzing text data such as customer reviews, social media posts, and other sources of feedback to determine the sentiment towards a product or service. By using ML classification algorithms, businesses can gain insights into customer preferences, identify areas for improvement, and make informed decisions about their marketing and product development strategies. Sentiment analysis can provide businesses with a better understanding of customer preferences and opinions. By analyzing the sentiment of customer reviews and social media posts, businesses can identify areas where they need to improve and take proactive steps to address customer concerns. It can also be used to monitor brand reputation and respond to negative feedback in real-time. ### Fraud detection Fraud detection is one of the most common business use cases that require ML classification algorithms as a solution. With the increase in online financial transactions, fraudulent activities have also increased significantly. Fraudulent activities such as identity theft, credit card fraud, and online banking fraud are some of the most common types of frauds that businesses need to be aware of. The ML classification algorithms can be used to analyze large volumes of financial transaction data and identify patterns that are indicative of fraudulent activities. This can be achieved by training the ML model on past transaction data and identifying the common features that are present in fraudulent transactions. The model can then be used to predict the likelihood of a new transaction being fraudulent by analyzing its features and comparing them to those of past transactions. ### Churn Prediction Churn refers to the rate at which customers or subscribers stop using a product or service over a given period of time. It is often used to measure customer attrition or loss, and is an important metric for businesses to understand because it directly affects their revenue and growth. High churn rates can be a sign of problems with a business's products, customer service, or pricing, and can lead to decreased revenue and profitability over time. To manage churn, businesses often focus on improving customer satisfaction and loyalty, addressing any issues that customers may have, and providing excellent customer service. They may also use data analytics and customer feedback to identify the reasons why customers are leaving, and develop strategies to address these issues. Additionally, businesses may offer incentives or promotions to encourage customers to stay, such as loyalty programs or discounts. ML classification algorithms can be used to analyze the user data and identify the customers who are about to churn. ### Inventory Forecasting Inventory forecasting is the practice of using past data, trends and known upcoming events to predict needed inventory levels for a future period. Inventory forecasting is crucial for businesses to ensure that they have sufficient stock to meet customer demand while avoiding overstocking, which can tie up capital and lead to wastage. ML Algorithms can be used to analyse the sales and correctly identify the amount of products that needs to be restocked. ### Price Prediction Price Prediction aims to strike a balance between maximising revenue and maintaining competitive pricing to attract customers. The goal is to set prices that will generate the highest possible profits while considering factors such as elasticity of demand, production costs, customer preferences, and market trends. They collect and analyse data on customer behaviour, market dynamics, historical sales, and competitor pricing. Then ML Algorithms are used to identify the best score for the products. ### Spam detection Spam detection aims to differentiate between legitimate messages (ham) and unsolicited, irrelevant, or malicious messages (spam) that may contain advertisements, scams, phishing attempts, or other undesirable content. The goal is to automatically classify incoming messages as either spam or ham, allowing users to focus on genuine and important communications while minimizing the impact of spam. The labelled data is collected and analyzed. Then, ML Algorithms are used to learn the patterns in the data to correctly identify spam. ## Help Components -------------------------------------------------------------------------------- title: "Data Profiler and Viewer" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.560Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-profiler-and-viewer/" service: "QuickML" -------------------------------------------------------------------------------- # Data Profiler and Viewer ### What is Data Profiling? Data profiling is the process of reviewing raw data; understanding its structure, content, and interrelationships; and identifying opportunities for analytical insights. It is crucial while performing data preprocessing, performing data visualization, and improving data quality score. It also helps to improve the analysis by providing overall intelligence about the data being used. Examples of how data profiling can be used: - **Data preprocessing:** To identify missing values, unique and duplicate values, and inconsistent data formats, which helps to clean up and prepare the data for analysis. - **Data visualization:** Data profiling helps us to understand the actual distribution of data and identify any outliers, so that we can create more accurate and informative data visualizations. - **Improving data quality score:** By identifying and fixing the data quality issues, data profiling helps to improve the reliability of the data for further analysis. Data profiling involves an assortment of activities, including: - Collecting statistics details like min, max, mean, and more for numeric columns - Gathering data types and identifying recurring patterns in data - Finding redundancy and quality of data - Making inter-table analysis of tabular data ### Data Profiling in QuickML The uploaded data in the QuickML dataset module automatically goes through the data profiling section and provides rich understanding and valuable insights of the data. 1. **Records count, unique values and percentages:** QuickML can find the number of unique, duplicate values and their respective percentages of total in each column of your data that can be used to efficiently update the rows & columns. 2. **Data type and Visualization:** The QuickML Data profiler accurately identifies the data type of each feature in a dataset and visualises the distribution to get clear insights about the data. 3. **Missing values:** QuickML data profiler can fetch the count and percentage of missing values, such as blank and null values, and helps data scientists set up appropriate values. 4. **Statistical details:** QuickML data profilers generates below statistical data for numeric and categorical columns: <table class="content-table" style="width:800px;"> <tr> <th style="text-align:center">Numeric columns</th> <td style="text-align:center">Sum, Min, Max, Mean, Median, Standard Deviation, Variance & low, medium, high percentiles</td> </tr> <th style="text-align:center">Catagorical columns</th> <td style="text-align:center">Unique, Duplicates</td> </tr> <tbody> </tbody> </table> 5. **Dataset Versioning & Quality score** QuickML can profile multiple versions of the same data and generate quality scores for each version. The data profile can be viewed at any time using the version option on the Dataset details page. 6. **Correlation Heatmap** The data profiler generates a correlation heatmap for all versions of the dataset, which visualizes the interrelationships of the features and the extent to which they are related to each other and to the target feature. <br /> ### Data Viewer in QuickML A data preview is generated in the dataset details page by sampling 250 records from the original dataset, which helps to identify the data types of each feature. <br /> -------------------------------------------------------------------------------- title: "Zia Features" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.560Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/zia-features/" service: "QuickML" -------------------------------------------------------------------------------- # Zia Features QuickML is integrated with market-hot Zia text analytics features that are indigenously developed at Zoho . These Zia capabilities can be used for a wide variety of tasks that involve understanding and processing text. Zia is a powerful tool that can help businesses to improve their operations, make better decisions, and better understand their customers. The Key features in these categories are: - **Zia Sentiment Analysis** - **Zia Keyword extraction** - **Zia Language detection** - **Zia Emotion detection** - **Zia Intent Extraction** - **Zia Activity Extraction** - **Zia Commitment Classification** #### Zia Sentiment Analysis: Zia Sentiment Analysis is a part of text analytics that processes textual content to recognize the tone of the message, and the sentiments it conveys. It analyzes each sentence to determine if its tone is positive, negative, or neutral. It then determines the tone of the overall text as one of these three sentiments, based on the sentiments recognized in each sentence. #### Zia Keyword extraction: Keyword Extraction is a text analysis technique that involves extracting important and relevant terms from a piece of text, which provides an abstraction of the whole text. It also works on the principles of text mining, information retrieval, and natural language processing. Keyword extraction is similar to the areas of analyzing human language and developing precision with more training using rich data sets. It uses simple statistical approaches like word frequencies and collocations, as well as advanced machine learning approaches. #### Zia Language detection: In natural language processing, language identification or language detection is the technique of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization. Language detection is a great use case for machine-learning more specifically for text classification. Given some text from an email, news article, output of speech-to-text capabilities, or anywhere else, a language detection model will tell you what language it is in. #### Zia Emotion detection: Emotion detection is a subset of sentiment analysis, as it predicts the unique emotion rather than just stating positive, negative, or neutral. This is a technique for figuring out people’s attitudes, emotions, and sentiments about a certain objective or thing. Zia’s emotion predictor makes it easier to identify and analyse the emotions hidden in textual data. Happy, Enthusiasm, Discontentment, Frustration, Trust, Confusion, Gratitude, and Neutral are the 8 emotions that Zia’s algorithm can anticipate with good degree of accuracy. Give a sentence, Zia's emotion detection model will predict and tell what kind of emotion is involved in that sentence. #### Zia Intent Extraction: Intent extraction model is a deep learning model using Distilbert Transformers architecture, trained on a large set of English sentences that assist to determine the kind of action being expressed in a phrase as well as all of its constituent elements. Intent detection is the process of analysing text data to find the author’s intention. Zia's intent extraction model can identify terms in a sentence that are relevant to complaints, requests, purchases, and queries. #### Zia Activity Extraction: Zia's activity extraction is a multi-class sequence classifier which will recognise activities such as an event, call and task from a given sentence by utilising the Distilbert classifier. #### Zia Commitment Classification: Zia’s commitment classification will predict and identify the commitment-related clauses in a phrase, such as the due date, the promise of something, etc. Give any sentence, Zia's commitment classification model will detect and will return the commitment statement. -------------------------------------------------------------------------------- title: "Class Imbalance" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.560Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/class-imbalance/" service: "QuickML" -------------------------------------------------------------------------------- ### What is Class Imbalance?   Class imbalance occurs when the number of samples in one class (called the majority class) significantly outnumbers those in another class (called the minority class) in classification problems. This imbalance can cause models to be biased towards predicting the majority class, resulting in poor detection of minority class cases, which are often the critical ones (e.g., detecting fraud or diagnosing rare diseases). ### Why is it Important?   If imbalance is not handled, models may: - Achieve high overall accuracy, but fail to detect minority class cases - Produce many false negatives (missed detections), which can be dangerous in fields like healthcare - Have misleading evaluation metrics because accuracy is dominated by the majority class - Struggle to generalize well on new data ### Techniques to Handle Imbalance  Class imbalance handling refers to a set of techniques used to deal with classification problems where one class (the majority) has significantly more samples than another (the minority). ### Oversampling Oversampling is used to address the issue of class imbalance in datasets—particularly common in binary classification problems where one class significantly outnumbers the other. In an imbalanced dataset, the majority class dominates, which can lead machine learning models to be biased toward predicting that class more often, ignoring the minority class entirely. This results in poor recall and precision for the minority class, which is often the class of interest. **What does oversampling do?** Oversampling increases the representation of the minority class by adding more samples, either by duplicating existing examples or generating new synthetic ones. The goal is to balance the class distribution, so the model receives equal learning exposure to both classes. **Common techniques:** - **SMOTE (Synthetic Minority Over-sampling Technique)**: Creates synthetic minority class samples by interpolating between existing minority examples. - **RandomOverSampler**: Simply duplicates minority class samples randomly. - **BorderlineSMOTE**: Generates synthetic samples only near the decision boundary where classes overlap. - **ADASYN**: Similar to SMOTE but focuses more on creating samples for harder-to-learn minority cases. **Example:** Suppose you have a fraud detection dataset: <table class="content-table" style="width:400px;"> <thead> <tr> <th>Class</th> <th>Number of Samples</th> </tr> </thead> <tbody> <tr> <td>Legitimate</td> <td>10,000</td> </tr> <tr> <td>Fraud</td> <td>200</td> </tr> </tbody> </table> Using **SMOTE**, you generate synthetic fraud cases to increase minority class samples from 200 to 10,000. This balanced dataset helps the model better learn fraud patterns, reducing missed fraud cases (false negatives) and improving detection rates. ### Undersampling Undersampling is a technique used to handle class imbalance in machine learning datasets by reducing the size of the majority class so that it is comparable to the size of the minority class. This helps create a more balanced dataset, ensuring that the model pays equal attention to both classes during training. In a typical imbalanced scenario, one class (usually the one we care less about) dominates the dataset. For example, in an email classification task, the number of “non-spam” emails might vastly outnumber the “spam” ones. Without balancing, a model might learn to predict only the majority class to optimize accuracy, while completely neglecting the minority class. **What does undersampling do?**  Undersampling reduces the skewed distribution by randomly or strategically removing samples from the majority class, thus shrinking it to match or get closer to the size of the minority class. This forces the model to learn more equally from both classes, which can improve its performance on the minority class. **Common techniques:** - **RandomUnderSampler**: Randomly removes samples from the majority class. - **TomekLinksUnderSampler**: Removes majority samples that are very close to minority samples (cleaning noisy overlaps). - **EditedNNUnderSampler**: Removes majority samples misclassified by nearest neighbors, reducing noise. - **NearMissUnderSampler**: Keeps majority samples that are close to minority samples (focuses on difficult boundary cases). **Example:** In a medical diagnosis dataset: <table class="content-table" style="width:400px;"> <thead> <tr> <th>Class</th> <th>Number of Samples</th> </tr> </thead> <tbody> <tr> <td>Healthy</td> <td>5000</td> </tr> <tr> <td>Disease</td> <td>300</td> </tr> </tbody> </table> Using **RandomUnderSampler**, the healthy samples are reduced from 5,000 to 300 to match the disease samples. This helps the model avoid bias towards the healthy class and better detect the disease. -------------------------------------------------------------------------------- title: "Periodic Sync" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.560Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/periodic-sync/" service: "QuickML" -------------------------------------------------------------------------------- # Periodic Sync Periodic sync helps reduce the manual effort required to import data repeatedly from the same data connector at multiple intervals. By configuring sync, users can automate the data import process according to their chosen periodicity, minimizing the need for manual intervention. In QuickML, data sync can be configured using Sync Frequency, with various periodic interval options, including daily, weekly, monthly, yearly, and custom frequencies set in hours and minutes, with a minimum of 6 hours to a maximum of 23:59 hour intervals. ### Benefits of Using Periodic Sync in QuickML - **Reduces Manual Interference:** Users do not need to be available during the import process. - **Time-Saving:** Automating data import saves time spent on configuring the data connectors each time data needs to be imported. - **Consistency:** Ensures data is consistently updated within the QuickML platform, maintaining accuracy and reliability. - **Better Model Performance:** Having the latest data available helps in periodically training the model; the latest data ensures better and more consistent model performance, resulting in reliable predictions. - **Efficiency:** Reduces the risk of errors associated with manual data imports. - **Flexible Interval Options:** QuickML offers a range of scheduling options to meet different data import and update needs. ### Where to configure the sync frequency #### During data import: We can set the periodic sync frequency during the data import from any of these data connectors object storages, Zoho apps, and Databases. <br /> #### After the data import: In the dataset details page, there is a sync frequency field that conveys the chosen sync configuration from the data import stage. However, the user can modify the configuration as needed using the drop-down options, as shown below. <br /> #### Creating data pipeline: While creating the data pipeline for a periodic sync-enabled dataset, kindly make sure to enable the Execute pipeline after periodic sync option. This ensures that after every sync, the data pipeline will rerun and produce the next version of the transformed data. <br /> #### Creating ML Pipeline: While creating the ML pipeline for a periodic sync-enabled dataset, kindly make sure to enable the Retrain model when the dataset is updated option. This ensures that the model is automatically retrained whenever the dataset is updated. <br /> -------------------------------------------------------------------------------- title: "Create Data Pipeline" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.560Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/create-data-pipeline/" service: "QuickML" -------------------------------------------------------------------------------- # Create your first Data Pipeline Let's look at how to create a data pipeline using the QuickML platform. <!-- **Resources:** Click {{%link href="https://workdrive.zohoexternal.com/external/153b694f054dabf868e5a93d22af4f97b8783975609b6f2ff2857f5f5865afda" %}}here{{%/link%}} to access some sample datasets to get started with. --> ## Creating Data Pipeline using pipeline builder interface 1. Go to the **datasets** section 2. Click {{%badge%}}Import Dataset{{%/badge%}} to upload a dataset to QuickML platform using the available data connector options. <br /> 3. Select the preferred data source to perform the import operation. <br /> After uploading the file from the above given options, the data quality score will be displayed. <br /> 4. In order to create a data pipeline, click {{%badge%}}Go to Data Cleaning{{%/badge%}}. The Create pipeline popup will appear asking you to provide a pipeline name then click the {{%badge%}}Create Pipeline{{%/badge%}} button. <br /> 5. You'll then be taken to the {{%link href="/en/quickml/help/pipeline-builder-interface/walkthrough/" %}}Pipeline Builder {{%/link%}} page where you can perform the below Data preprocessing operations simply by drag and drop components as shown in the left pane and connecting to previous and subsequent components to create a pipeline. - {{%link href="/en/quickml/help/data-preprocessing/data-cleaning/" %}}Data Cleaning {{%/link%}} - {{%link href="/en/quickml/help/data-preprocessing/data-transformation/" %}}Data Transformation {{%/link%}} - {{%link href="/en/quickml/help/data-preprocessing/data-extraction/" %}}Data Extraction {{%/link%}} <br /> Components under each data preprocessing step in the left pane can be viewed by expanding them. 6. Every operation performed inside the builder can be seen processed on the sample data in the preview section. **Example:** Drag and drop the Select/Drop component under the Data Cleaning section; select the required columns; and select the Drop/select operation. You will be able to confirm in the preview section that the columns have dropped. <br /> 7. In a similar manner, drag and drop all required components into the builder and connect the last component to the destination to finish building the pipeline. It will appear as follows: <br /> 8. To run the pipeline, click the Execute button. Once the pipeline is executed, it will appear as follows: <br /> 9. Clicking on the **car_price** dataset will take you to the car_prices dataset details page, where you can verify new versions and other relevant {{%link href="/en/quickml/help/data-profiler-and-viewer/" %}}Dataset profile{{%/link%}} information <br /> **Note:** QuickML uses Zoho IAM as an authentication manager. In order to use the service, the user must be signed in using {{%link href="https://accounts.zoho.com/register?newsignup=true&servicename=AaaServer&serviceurl=https%3A%2F%2Faccounts.zoho.com%2Fhome%23profile%2Fpersonal" %}}Zoho account{{%/link%}}. -------------------------------------------------------------------------------- title: "Create ML Pipeline" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.561Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/create-ml-pipeline/" service: "QuickML" -------------------------------------------------------------------------------- # Create your first ML Pipeline Let's look at how to create an ML pipeline using the QuickML platform. ## Creating ML Pipeline using pipeline builder interface 1. Go to the Pipelines section 2. Click {{%badge%}}Create-Pipeline{{%/badge%}} to create the ML pipeline in the QuickML platform using the uploaded dataset. <br /> 3. Provide the ML Pipeline name, Model name, and Choose dataset with which you need to prepare the ML Pipeline. In Target column, select the column for which you are going to perform the prediction and click the Create Pipeline button. <br /> **Note:** When you create an ML pipeline from the same dataset as your data pipeline, the output data (the modified version of the dataset from the data pipeline) will be fetched automatically. 4. Once the pipeline builder interface is opened, you can now drag and drop the required nodes to perform the ML operations. Example: In this picture, you can see the {{%badge%}}Ordinal Encoder{{%/badge%}} [node](/en/quickml/help/operations-in-quickml/encoding/#ordinal-encoder) has been connected to the source. In configuration, select all the columns except **year**, **mileage**, and **money** as these are already in numeric form. <br /> Once the configuration is done, click {{%badge%}}Save{{%/badge%}} and proceed further. 5. Likewise, you can connect all the required nodes to perform the prediction and complete the pipeline, as shown below. <br /> 6. Once the pipeline is completed click {{%badge%}}Save{{%/badge%}} and, then click {{%badge%}}Execute{{%/badge%}}. <br /> Once the pipeline is successfully executed, you can see the execution status as Success as shown above. 7. By clicking on the **model name** (Car Price Prediction ML Pipeline model) you will be redirected to model details page, where you can see the model evaluation metrics data, as shown below. <br /> -------------------------------------------------------------------------------- title: "Create AutoML Pipeline" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.561Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/create-automl-pipeline/" service: "QuickML" -------------------------------------------------------------------------------- # AutoML Pipeline ### Steps to create AutoML pipeline in QuickML 1. Navigate to the bottom-left of the page, click the +Create button, and select Pipeline. 2. The Create Pipeline page will appear. Fill in the details, then check the Create an Auto-generated pipeline using AutoML checkbox. <br /> 3. Click Create Pipeline. This will create an automated pipeline with the necessary configuration, as shown below: <br /> The best performing model will be finalized and generated with all the evaluation metrics, as shown below. -------------------------------------------------------------------------------- title: "Custom Code" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.561Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/custom-code/" service: "QuickML" -------------------------------------------------------------------------------- ### Custom code operations {{%note%}}Custom code operations are available in {{%bold class="bold-primary"%}}early access{{%/bold%}} across all the data centers. To use this feature, request access via support@zohocatalyst.com {{%/note%}} The Custom Code operations in the QuickML pipeline allow developers to insert their own logic into the model training process. By implementing Python classes provided in templates, users can customize how data is transformed, how features are processed, and even define the machine learning algorithm used. This capability is divided into three distinct components: - Custom Data Transformation - Custom ML Transformation - Custom Algorithm Each component plays a unique role in the machine learning lifecycle and comes with predefined method signatures that must be implemented by the user. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} The implementation is mandatory for all methods in the template class.{{%/note%}} ### Custom Data transformation Custom data transformation is used in optimizing the data by performing operations such as data cleaning, transforming, and extraction of raw data using custom logic. It is particularly useful for preprocessing steps like column removal, format conversions, and scaling that must remain consistent across both model training and prediction stages. The customized code handles complex requirements during data pre-processing and is implemented in the **transform()** method, which accepts and returns a DataFrame object and is executed during both training and inference. In the sample code below, the custom data transformation node preprocesses data before it’s used for model training or prediction. It removes the "Glucose" column from the dataset to ensure that irrelevant or potentially biased features do not influence the model's learning or predictions. ### Custom ML transformation In the ML model development life cycle, custom ML transformation operations are used during the pre-processing step by performing feature engineering tasks specifically tailored to the use case using custom logic. These transformations are ideal for handling operations such as missing value imputation, feature encoding, and normalization—where the logic must learn from the training data and consistently apply the same transformation during prediction. This is accomplished through a **fit()** method, which learns the necessary parameters from the training data, and a **transform()** method, which applies those learned parameters to preprocess new data. Both methods accept and return a DataFrame object. In the sample code below, the custom ML transformation node handles missing value imputation as part of the data pre-processing step. During training, it computes the mean of each numeric column and stores it. Later, during prediction or inference, it fills missing values in those columns using the stored mean values, ensuring consistency between training and prediction phases. ### Custom algorithm Allows you to plug in a custom ML model and define its training, prediction, and evaluation logic. This offers full control over which algorithm is used and how its performance is assessed. It has **fit**, **predict**, and **get_evaluation_metrics** functions. Fit is executed during model training, predict is executed during prediction, and get_evaluation_metrics is executed after model training to generate the evaluation metrics. The metrics returned from the **get_evaluation_metrics()** method will be displayed in the evaluation metrics section of the model details page. Custom algorithm operation refers to the application of custom algorithms tailored to specific business use cases. It allows users to plug in a custom ML model and define bespoke training, prediction, and evaluation logic. This approach offers full control over which algorithm is used and how its performance is assessed. It can be seen as an enhancement of standard pre-defined algorithms, enabling the solution of unique modeling problems using domain-specific rules. The operation includes three key methods: - **fit()** — executed during model training to learn from the data, - **predict()** — executed during prediction to generate outputs, and - **get_evaluation_metrics()** — executed post-training to compute and return evaluation metrics. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} The metrics returned from get_evaluation_metrics() are displayed in the evaluation section of the model details page, supporting customized model assessment workflows.{{%/note%}} In the sample code below, the custom algorithm node builds an ML model using the LightGBM classifier. It splits the input dataset into training and testing sets, trains the model on the training data, and stores the trained model instance. During prediction, it uses the trained model to generate outputs for unseen data. After training, it evaluates the model using accuracy, precision, recall, and F1-score metrics, which are computed on the test data and returned as a dictionary for performance analysis. **List of py libraries form which import is supported:** *numpy, scipy, pandas, xgboost, catboost, lightgbm, sklearn, tld, patsy, tensorflow, statsmodels, tldextract, huggingface_hub, sentence_transformers, imbalanced_learn, hyperopt, shap, lime, transformers, pmdarima, lightfm, LibRecommender, subseq* -------------------------------------------------------------------------------- title: "Model Details" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.561Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/models-details/" service: "QuickML" -------------------------------------------------------------------------------- # Models Once an ML pipeline is successfully executed, a respective ML model is created. This model view can be used to gain an understanding of its internal metrics. The list of models created can be viewed on the Models page along with the status of each model as seen below. <br /> ### Model Metrics QuickML users will have the access to view the model metrics for each version, which provide valuable insights into the performance of the machine learning models. These metrics serve as essential indicators to assess the accuracy and effectiveness of the model in making predictions. <br /> QuickML users have access to the following metrics. ### Confusion matrix In machine learning, a confusion matrix is used to measure the performance of a classification model. In simple terms, a confusion matrix is a summary of the number of correct and incorrect predictions made by the machine learning model. The matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as shown below. **TP:** True Positive is the count of instances where both predicted and actual values are positive. **TN:** True Negative is the count of instances where both predicted and actual values are negative. **FP:** False Positive is the count of instances where the model predicted them as positive but the actual values are negative. **FN:** False negative is the count of instances where the model predicted them as negative but the actual values are positive. ### Usecase Let's explain the confusion matrix with a use case: to predict airline's passenger satisfaction with flight service. **Brief explanation:** An airline gathers information about its passengers, including their demographics, flight information, and survey responses regarding their satisfaction with the airline's service. After that, the airline applies this information to create a machine learning classification model that predicts whether the passenger is **satisfied** or **neutral/dissatisfied**. Let's evaluate the performance of a classification model using a confusion matrix in QuickML as shown below: Values listed down from the confusion matrix: <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Total response count</th> <td style="text-align:center">5,196</td> </tr> <th style="text-align:center">True Positive (TP)</th> <td style="text-align:center">2,930</td> </tr> <th style="text-align:center">False Negative (FN)</th> <td style="text-align:center">182</td> </tr> <th style="text-align:center">False Positive (FP)</th> <td style="text-align:center">35</td> </tr> <th style="text-align:center">True Negative (TN)</th> <td style="text-align:center">2,049</td> </tr> <tbody> </tbody> </table> Commonly used performance metrics to evaluate any classification model are as follows. **Accuracy score:** The proportion of correctly predicted instances among the total instances. {{%code class="language-json"%}}Accuracy = TP+TN/Total Responses = (2,930+2,049)/5,196 = 0.958{{%/code%}} **Precision score:** Precision is a measure of how accurate the model's positive predictions are. It is calculated as the ratio of true positive predictions to the sum of true positive and false positive predictions. {{%code class="language-json"%}}Precision = TP/(TP+FP)= 2,930/(2,930+35) = 0.988{{%/code%}} **Recall score:** Recall score, also known as sensitivity, is the percentage of actual positive cases that a model correctly predicts. It is calculated by dividing the number of true positive predictions by the sum of the true positive and false negative predictions. In simpler terms, a recall score measures how well a model can identify all of the positive cases in a dataset. A high recall score means that the model is good at finding all of the positive cases, while a low recall score means that the model is missing a lot of positive cases. {{%code class="language-json"%}}Recall score = TP/(TP+FN) = 2930/3112 = 0.941{{%/code%}} **F1 score:** The harmonic mean of precision and recall, providing a balanced assessment of the model’s performance. {{%code class="language-json"%}}F1 Score = 2*Recall*Precision/(Recall+Precision) = 2*0.941*0.988/(0.941+0.988) = 0.9639{{%/code%}} From the above metrics, we can infer a few conclusions about the model. 1. The airline model accurately predicted 4,979 passengers' level of satisfaction where as, 182 are incorrectly predicted as satisfied and 35 as dissatisfied/neutral. 2. It has good performance with an accuracy of 95.8%, a precision of 98.8%, and a recall of 94.1%. However, it has missed 182 satisfied passenger predictions. Hence, the model should be fine-tuned to increase the recall score, which then would identify all the satisfied passengers. By examining these model metrics in QuickML, we can gain deeper insights into the performance of any machine learning model and make informed decisions on model selection and optimization. This will empower users to fine-tune their models and improve predictive accuracy. ### Evaluation metrics QuicML shows the below evaluation metrics regarding classification and regression models created. 1. #### Classification <br /> 2. #### Regression <br /> 3. #### {{%link href="/en/quickml/help/learning-center/recommendation/#evaluation-metrics-for-specific-algorithms" %}}Recommendation {{%/link%}} 4. #### {{%link href="/en/quickml/help/learning-center/time-series/#model-evaluation-metrics" %}}Time Series{{%/link%}} ### Cross validation metrics Cross validation is a method for evaluating the performance of a machine learning model by splitting the training data into k folds, training the model on k-1 folds, and evaluating the model on the remaining fold. This process is repeated k times, and the average performance of the model on the k folds is used to evaluate its overall performance. In simpler terms, cross validation works by training the model on a subset of the training data and then evaluating its performance on the remaining subset of the training data. This is repeated multiple times, and the average performance of the model on all of the subsets is used to evaluate its overall performance. It helps to ensure that the model is not over fitting the training data and that it will generalize well to new data. QuickML provides you a plethora of cross validation metrics to track performance for both classification or regression models. The list of metric types provided in cross validation is below: 1. #### Classification Model <br /> **Metrics Types:** - ROC AUC OVR - ROC AUC OVO - ROC AUC OVR weighted - ROC AUC OVO weighted - Balanced accuracy - Average precision - F1 score - F1 macro - F1 micro - F1 samples - F1 weighted 2. #### Regression Model <br /> **Metrics Types:** - Negative mean-squared error - Negative mean-squared log error - Negative root mean-squared error - Negative mean absolute error - Negative median absolute error - Negative mean poisson deviance - Negative mean gamma deviance - Negative log loss - Negative brier score - R2 score ### Model Versions Model versioning is the process of tracking and managing different versions of a machine learning model. <br /> This is important because it allows you to compare different versions of the model, track its performance, and select the best version for deployment. Model versioning can also help you roll back to a previous version of the model if necessary. ### Feature Importance Idenitfying the importance of each input feature used in building a machine learning model can help you track how they will affect the quality of the model and its decision making progress. A feature importance score is calculated to quantify the importance; the higher the score for a feature, the larger the impact of the feature on the model that predicts the target variable. By quantifying the importance of each feature, it provides insights into the data and the model, allowing for better understanding, interpretation, and potential improvement of the model's performance. Feature importance can be calculated using various methods, such as permutation importance or mean decrease in impurity, or using model-specific techniques like the feature importances attribute in tree-based models. #### Uses of Feature importance Feature importance is also used to understand the relationship between input features and target variables, similar to the correlation matrix. Highly relevant features that impact the model are identified using the importance score. - **Dimensionality Reduction:** Feature importance assists in reducing the dimensionality of the model by keeping features with higher scores and removing those with lower scores from the training data. This simplification reduces bias, removes noise, and speeds up model training, ultimately generating a better-performing model. - **Model Performance:** By focusing on the most relevant features, feature importance helps to optimize the model for better performance and improve its predictability. - **Stakeholder Insights:** Feature importance provides stakeholders with insights into which features have the highest impact on the model's predictions and helps in understanding the model’s behaviour. In short, feature importance explains the significance of each feature on the model. <br /> In the QuickML model details page, a bar chart is generated to visually depict the feature importance score of the top 20 features in descending order, and the rest will be categorized as others. -------------------------------------------------------------------------------- title: "Pipeline Execution Metrics" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.562Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/pipeline-execution-metrics/" service: "QuickML" -------------------------------------------------------------------------------- # Pipeline Execution Metrics Once any pipeline is published, it gets scheduled for execution in an independent environment. The pipeline executions will be versioned for subsequent executions and the execution details will be monitored and gathered by QuickML service. These execution details can be viewed from the **Pipeline Metrics viewer** integrated in QuickML. These metrics will help in finding the performance and resource usage of the pipeline execution and will be beneficial for improving and continuous development of pipelines with respect to ML pipelines. <br /> The metrics viewer can be used to view the information of **full pipeline execution and stage execution**. The metrics listed below will be available for monitoring: - **Pipeline Execution Information** - Execution time - Average memory usage - Average Disk Usage - Average CPU usage - **Runtime details** - Best Pipeline run - Previous Pipeline run - **Resource usage charts** ### Pipeline Execution Information From the pipeline details page, the execution info window can be opened and it will contain the respective execution information like below. <br /> ### Resource usage graphs To keep track of time-based resource usage in pipeline execution, status charts and graphs are available in QuickML to get better understanding of the stage-wise execution and its details. These charts can also be zoomed in to chart based editors to have further analysis of the execution. ![Process_stats](/images/help/quickml/pipeline-metric-viewer/Process_stats.webp) -------------------------------------------------------------------------------- title: "Pipeline Endpoints" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.562Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/pipeline-endpoints/" service: "QuickML" -------------------------------------------------------------------------------- # QuickML Pipeline Endpoints # Endpoints The primary outcome of machine learning (ML) pipelines is to **create endpoints** that can be used to make predictions on new data. These endpoints are created using trained ML models, and they can be used for continuous inference. Once an ML model is **developed and trained**, QuickML users can create an endpoint with the latest version of the model. This endpoint can then be used to make predictions on new data. QuickML monitors all endpoints to identify areas where the model can be improved. <br /> ### Published Model Version Once a machine learning model has been generated using the QuickML pipeline, a user can choose to **publish** it as an endpoint with a **specific model version**. This allows users to deploy and use the model to make accurate predictions. Models are versioned whenever there is change in pipeline stage configurations for the pipeline execution. <br /> ### Model tester In order to check the accuracy of the model, a **Test the Model** component w.r t model versions has been integrated in QuickML endpoints module, in which live requests can be made within the web app using Endpoint URL to test the model's inference. <br /> ### Model Explanation Chart Each input feature in the data used for training the machine learning model has an impact on the model's predictions. However, these models exhibit the black box behaviour, making it harder to understand how it arrived at those predictions. Quantifying the contribution of each feature to the final prediction is challenging, which hinders our understanding of the model's behavior. To address this issue, QuickML has created a Model Explanation Chart. This chart shows the effect on prediction on the x-axis and the list of features on the y-axis. For each feature, the effect on prediction is quantified by SHAP (SHapley Additive exPlanations) values, which explains the decision of the machine learning model. These values help to understand and interpret the model's decisions and increase transparency. The distribution of values on the x-axis indicates whether the respective feature has a positive or negative impact on the prediction, along with its score. This chart is located on the endpoint details page, as shown below. <br /> In summary, the Model Explanation Chart in QuickML shows the influence of each feature on the final prediction result for a given input. Quick ref video on model explanation chart view. <iframe src="https://workdrive.zohoexternal.com/embed/aw71qaca827e70c8a4c70917ac924336ca5be?toolbar=false&appearance=light&themecolor=green" scrolling="no" frameborder="0" allowfullscreen=true width="800" height="450" title="Embed code" ></iframe> A quick overview of the model explanation chart located in the endpoint details page shows the impact score of each feature on the prediction in the response box and visual representation of the impact in the chart above. ## Endpoints Authentication The live endpoints are exposed via REST APIs from QuickML service and they can be accessed with two modes: - External OAuth2 authentication - Internal authentication ### External OAuth2 authentication The endpoints that are exposed as REST APIs are authenticated with the Zoho accounts OAuth2 mechanism. Users can make the prediction calls externally by generating the tokens from the accounts [console](https://api-console.zoho.com/). - **Prediction guidelines:** To make the REST API calls, the user should provide the below given details, - **Request URL:** Deployment Url provided in details - **HTTP Method:** POST - **Headers :** {{%code class="language-json"%}}{ 'X-QUICKML-ENDPOINT-KEY' : ************************, 'Authorization': 'Zoho-oauthtoken <access-token>', 'CATALYST-ORG': ******** 'Environment': , 'Development/Production' }{{%/code%}} To generate an access token, please follow the steps in this document, - **Scope:** QuickML.deployment.READ -------------------------------------------------------------------------------- title: "Stage Reconfiguration" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.562Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/stage-reconfiguration/" service: "QuickML" -------------------------------------------------------------------------------- ### Understanding stage reconfiguration in QuickML pipelines Building machine learning pipelines often involves chaining multiple operations, each dependent on the outcome of the previous. In complex pipelines with 20+ stages, modifying a single stage may trigger unintended configuration changes in downstream stages. This can lead to tedious reconfiguration work, loss of previous settings, and disrupted data flow. To address this challenge and simplify the maintenance of complex workflows, QuickML introduces the pipeline **Stage Reconfiguration**, an automated mechanism that runs in the background as you build pipelines. ### What is Stage Reconfiguration? Stage Reconfiguration is a smart, user-guided mechanism in QuickML that identifies downstream stages impacted by a change in a previous stage. Rather than automatically resetting the configurations of all successor stages, the system prompts users with the affected stages and options to reconfigure, skip, or reset only the necessary stages, enabling efficient and controlled pipeline management. This ensures: - Reduced impact of configuration changes on subsequent stages, enabling a smoother pipeline building process. - Transparency about which stages are affected - Flexibility to handle changes as per business needs. Let's consider a scenario to understand the use of stage reconfiguration in real-time. Imagine a retail company building a sales forecasting pipeline in QuickML to analyze product performance across different regions. The pipeline includes over 25 stages, covering everything from data ingestion to algorithm selection. In the initial pipeline version, the team selects the amount column in Stage 4 (Select Columns) and later applies a Type conversion to it in Stage 7. During refinement, a team member decides to remove the amount column in Stage 4, believing it’s no longer needed for analysis. QuickML’s Stage Reconfiguration mechanism instantly detects that this change impacts Stage 7, which still depends on the amount for type conversion. Instead of resetting all subsequent stages, the system pinpoints Stage 7 as affected and presents options. Since the column is no longer required, the team simply chooses to skip and merge Stage 7. This keeps the pipeline valid and avoids unnecessary disruptions, demonstrating how Stage Reconfiguration helps teams make precise changes without reworking the entire flow. ### Pipeline Types and Stage Reconfiguration Compatibility  Stage Reconfiguration is implemented across various pipeline types in QuickML, depending on whether the pipeline supports editable and user-configurable stages. The mechanism is especially useful in Classic mode pipelines, where changes made in one stage can directly affect subsequent stages and require reconfiguration. This reconfiguration is typically triggered only when there is a Schema change (such as adding, removing, or renaming columns) or Value change that affects how the data is processed in the following stages. However, in Smart mode pipelines, this feature is typically not required, because the stages do not involve manual configuration. Instead of resetting or reconfiguring, Smart mode handles updates through simple re-execution since the logic remains consistent and controlled by the system. The table below outlines which pipeline types support Stage Reconfiguration and explains the reason for its applicability or exclusion in each case: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:25%;">Pipeline Type</th> <th style="text-align:left; width:25%;">Stage Reconfiguration Applicable</th> <th style="text-align:center; width:50%;">Reason</th> </tr> </thead> <tbody> <tr> <td style="text-align:left">Data Transformation</td> <td style="text-align:left">Yes</td> <td style="text-align:left">Stages are user-configurable; reconfiguration is supported</td> </tr> <tr> <td style="text-align:left">Prediction</td> <td style="text-align:left">Yes</td> <td style="text-align:left">Stages are user-configurable; reconfiguration is supported</td> </tr> <tr> <td style="text-align:left">Text Analytics</td> <td style="text-align:left">Only in Classic Mode</td> <td style="text-align:left">In Smart mode, no stage-level edits required</td> </tr> <tr> <td style="text-align:left">Recommendation</td> <td style="text-align:left">Yes</td> <td style="text-align:left">Stages are user-configurable; reconfiguration is supported</td> </tr> <tr> <td style="text-align:left">Forecasting</td> <td style="text-align:left">No</td> <td style="text-align:left">Uses Smart mode by default; no stage-level edits required</td> </tr> <tr> <td style="text-align:left">Clustering</td> <td style="text-align:left">Yes</td> <td style="text-align:left">Stages are user-configurable; reconfiguration is supported</td> </tr> <tr> <td style="text-align:left">Anomaly Detection</td> <td style="text-align:left">No</td> <td style="text-align:left">Uses Smart mode by default; no stage-level edits required</td> </tr> </tbody> </table> ### Working process of stage reconfiguration   When you make a change in a pipeline stage, QuickML automatically checks whether this change impacts the configurations of the stages that follow. Depending on what you changed, there are two possible outcomes. **Case 1: The change has no Impact on subsequent stage configurations** If your change doesn’t alter the schema of the dataset used in subsequent stages, QuickML will recognize that those stages are still valid. However, to maintain consistency in the pipeline, the system will re-execute the subsequent stages. In this case, you’ll see two options: - **Ignore and Reset All** - This resets the configurations of all subsequent stages, even though they aren’t affected by your change. - **Proceed** - This re-executes the subsequent stages using their existing configurations without requiring any rework. For example, suppose in Stage 2 you use a **Select/Drop** Columns operation to drop a column named **membership_category**, which is not used in any of the subsequent stages. Since none of the later stages rely on this column, your change doesn't impact their configuration. In this case, QuickML recognizes that the schema of subsequent stages remains intact. You can simply choose to proceed, and the pipeline will re-execute without requiring any reconfiguration. **Case 2: The change impacts the subsequent stage configurations** If your change affects the schema of the dataset used in later stages, QuickML will highlight which stages are impacted. You’ll then be given a set of options mentioned below to control how you want to handle the affected stages: - **Ignore and Reset All** - This ignores the configurations of all impacted stages and resets them. - **Configure Now** - This opens a pop-up that lists the affected stages and shows you what exactly has changed. Within the configuration pop-up, you can: - **Review and Edit** each impacted stage to update the configuration based on your recent change. - **Cancel** to undo the change you just made, taking the pipeline back to its previous state. - **Skip and Merge** a stage if it’s no longer required in the new flow. - **Discard** - This reverts the change made in the current stage itself and restores it to its original configuration, effectively undoing the edit you just applied. This is useful when you want to cancel the recent change due to its impact on the rest of the pipeline. Here’s an example: Suppose in Stage 2 you use a **Select Columns** operation to include the churn_risk_score column, and in Stage 4 you apply a **Type Conversion** to that column. If you remove the churn_risk_score column in Stage 2, Stage 4 will break because it depends on that column. QuickML will detect this and show you that Stage 4 is affected. You’ll then have the choice to either re-add the column in Stage 2, update Stage 4 to use a different column, or skip Stage 4 entirely if it’s no longer necessary. This approach gives you full control over how to handle changes without losing previous work unnecessarily. It also helps you avoid time-consuming reconfiguration by focusing only on what truly needs your attention. ### Stage-Specific Edge Cases **Behavior of Added Columns in the Select/Drop Stage**: Columns added during reconfiguration will not impact the configuration of subsequent stages. However, they will be listed in the Select Columns dropdown of any affected stages, allowing you to include them if needed. **Behavior of Dropped Columns in the Merge Columns Stage**: If a dropped column was previously used in a Merge Columns operation, you must initiate reconfiguration starting from the Merge Columns stage to clear the dependency and update the configuration accordingly. **Behavior of Dropped Columns in the Split Columns Stage**: If a dropped column is used in Split Columns and shared across two outputs, two separate reconfiguration layers will be shown based on how the column was modified. **Add Dataset Impact**: Changes in the Add Dataset stage can impact the entire pipeline.  <u>**Points to remember**</u> To make the most of the stage reconfiguration feature, it's important to keep the following key points in mind during its usage. - **Page Reload Handling**: Reloading the page during stage configuration will prompt you to either revert all changes or resume from where you left off. - **Effect of Delete Actions**: Delete Connection and Delete Stage reset the configurations of successor stages to default. - **Reconfiguration Cancellation**: Cancelling the reconfiguration during re-execution will revert all intermediate changes. -------------------------------------------------------------------------------- title: "QuickML Limitations" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.563Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/quickml-limitations/" service: "QuickML" -------------------------------------------------------------------------------- # QuickML Limitations 1. ### Entity Limits <table class="content-table" style="width:700px;"> <tbody> <tr> <th style="text-align:center">Maximum no.of datasets that can be created</th> <td style="text-align:center">25</td> </tr> <tr> <th style="text-align:center">Maximum no.of pipelines that can be created</th> <td style="text-align:center">25</td> </tr> <tr> <th style="text-align:center">Maximum no.of endpoints that can be created</th> <td style="text-align:center">10</td> </tr> <tr> <th style="text-align:center">No. of columns that can be configured in a single stage operation in builder</th> <td style="text-align:center">25</td> </tr> </tbody> </table> 2. ### Dataset sample preview **Dataset details preview section** <table class="content-table" style="width:700px;"> <tbody> <tr> <th style="text-align:center">Number of columns can be viewed</th> <td style="text-align:center">50</td> </tr> <tr> <th style="text-align:center">No of records can be viewed</th> <td style="text-align:center">250</td> </tr> </tbody> </table> **Pipeline builder Preview Section** <table class="content-table" style="width:700px;"> <tbody> <tr> <th style="text-align:center">Number of columns can be viewed</th> <td style="text-align:center">50</td> </tr> <tr> <th style="text-align:center">No of records can be viewed</th> <td style="text-align:center">250</td> </tr> </tbody> </table> 3. ### Visualization Limits <table class="content-table" style="width:900px;"> <tbody> <tr> <th style="text-align:center">Maximum number of visualizations can be created</th> <td style="text-align:center">5 per dataset version</td> </tr> <tr> <th style="text-align:center">Max data points in Chart Preview</th> <td style="text-align:center">2000 records</td> </tr> <tr> <th style="text-align:center">Max data points in Chart</th> <td style="text-align:center">25000 records</td> </tr> <tr> <th style="text-align:center">Maximum number of categories displayed in a chart. Ex.Composition chart: Category vs Count </th> <td style="text-align:center">The top 19 categories will be displayed, * The rest will be grouped into a category called "Others" (e.g., Pie, Funnel, etc.).</td> </tr> <tr> <th style="text-align:center">Maximum number of series or trend lines displayed in line charts</th> <td style="text-align:center">100</td> </tr> <tr> <th style="text-align:center">Correlation heatmap for maximum number of columns</th> <td style="text-align:center">25</td> </tr> <tr> <th style="text-align:center">Stage visualization in pipeline builder utilises maximum number of records</th> <td style="text-align:center">250</td> </tr> <tr> <th style="text-align:center">Chart events not available</th> <td style="text-align:center">If any single series has more than 8000 points</td> </tr> </tbody> </table> 4. ### Data source Limitations <table class="content-table" style="width:700px;"> <thead> <tr> <th style="text-align:center">Source</th> <th style="text-align:center">Restrictions</th> </tr> </thead> <tbody> <tr> <td style="text-align:center">Zoho apps</td> <td style="text-align:center">200K records</td> </tr> </tr> <tr> <td style="text-align:center">The periodic import capability for CRM, Bigin, and Recruit</td> <td style="text-align:center">one million records</td> </tr> <tr> <td style="text-align:center">Cloud Data Storages (Object Storages)</td> <td style="text-align:center">1GB size</td> </tr> <tr> <td style="text-align:center">Databases</td> <td style="text-align:center">100K records</td> </tr> <tr> <td style="text-align:center">File Upload</td> <td style="text-align:center">1 GB size</td> </tr> </tbody> </table> 6. #### Points to remember <table class="content-table" style="width:700px;"> <tbody> <tr> <th style="text-align:center">Local File System</th> <td style="text-align:center">Dataset features will be named col_0, col_1, col_2, etc., if uploaded without column names.</td> </tr> <tr> <th style="text-align:center">Rules in feature engineering</th> <td style="text-align:center">Feature names must not contain [, ], or <, ></td> </tr> </tbody> </table> ### Advanced models -------------------------------------------------------------------------------- title: "Recommendation" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.563Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/recommendation/" service: "QuickML" -------------------------------------------------------------------------------- # Introduction to Recommendation System A **recommendation system** is a branch of machine learning focused on building recommendation models that predict user preferences and generate personalized item suggestions to users by analyzing their past interactions, behaviour, item attributes, and user demographic data. These systems are widely used across various domains, such as e-commerce, media streaming, and social networking platforms, where they help users discover relevant products, movies, books, or other content as per their previous interactions. Through personalized item suggestions, recommender systems will improve user satisfaction, boost engagement, increase user retention, and sales for businesses by presenting the most relevant options specifically for each user. ### Sample Business Applications Recommendation models are actively integrated into e-commerce platforms to solve multiple usecases ,which drives the business value. #### 1. Sequential Recommendations in E-commerce Platforms: Sequential Recommendation can be applied in e-commerce platforms to suggest products based on users' browsing and purchase history. By analyzing the sequence of products viewed or bought by users, businesses can provide personalized recommendations, thereby increasing sales and customer satisfaction. **Scenario:** In e-commerce platforms, analyzing the sequential patterns of user interactions can reveal insights into user preferences and purchasing habits. For example, if a user purchases running shoes followed by sports apparel, the platform can identify this sequence. Based on this information, the platform can recommend complementary products such as running socks, water bottles, or fitness trackers. These personalized recommendations not only increase the likelihood of additional purchases, but also enhance the overall shopping experience for the user. #### 2. Personalized Product Suggestions: Personalized recommendations in e-commerce platforms involve suggesting products to users based on their browsing history, purchase behavior, and preferences. By analyzing user interactions and historical data, e-commerce platforms can deliver tailored product recommendations, thereby enhancing user experience and increasing sales. **Scenario:** Consider an established e-commerce platform where users browse a vast array of products across different categories. By analyzing user behavior, such as past purchases, items added to the cart, and products viewed, the platform can generate personalized product suggestions. For example, if a user frequently purchases electronics and gadgets, the platform can identify this preference. Based on this insight, the platform can recommend similar products, accessories, or related items that align with the user's interests. These tailored recommendations not only improve user satisfaction but also increase the likelihood of conversion and repeat purchases. #### 3. Consumer Goods and FMCG (Fast-Moving Consumer Goods): In industries such as consumer goods and FMCG, where products are purchased frequently and regularly, optimizing repeat purchases is essential for businesses. The challenge lies in effectively recommending staple items or commonly purchased products to customers based on their purchase history and consumption patterns. **Scenario:** For example, in the food and beverage sector, a grocery delivery service can use repeat purchase recommendation algorithms to suggest staple items or commonly purchased groceries based on a customer's purchase history. By analyzing past orders and consumption patterns, the platform can anticipate when certain items are likely to be depleted and recommend replenishment to streamline the ordering process for customers. #### 4. Retail Inventory Management: By predicting the next occurrence of product purchases, retailers can optimize their inventory levels and ensure sufficient stock availability. This enables them to prevent stock outs, minimize overstocking, and enhance overall supply chain efficiency. **Scenario:** In e-commerce platforms or even retail vendors, offer subscription services with various renewal intervals, such as monthly or bimonths. Additionally, by analyzing recurring purchase patterns, retail businesses can better manage inventory, minimizing stock outs or overstocking. By optimizing inventory levels, businesses can ensure stock availability, consistently serving users’ needs. #### 5. Subscription Services: Subscription-based businesses, such as subscription boxes, meal kit delivery services, or subscription-based software platforms, rely on maintaining a steady stream of recurring revenue through subscription renewals. However, managing subscription renewals for plans that require manual reminders, especially when renewal interval patterns are complex, presents a challenge. **Scenario:** For example, in the case of a newspaper subscription service, some customers may only renew their subscriptions during specific periods, such visits to their vacation homes. In such cases, repeat purchase recommendation algorithms can play a crucial role in identifying subscription renewal dates and providing timely reminders to users. By analyzing subscribers' renewal history and understanding complex renewal patterns, the recommendation system can enhance retention rates and ensure a consistent flow of revenue for the business. Similar to e-commerce platforms, recommendation algorithms can also be applied in various industries to enhance user experience, engagement, and ultimately drive business value. Few examples are: - **Entertainment and Media:** Content streaming platforms such as movies, music, and articles based on past consumption. - **Education:** Generate course recommendations and customize learning paths based on the student's interests and learning style. - **Social Media:** Content suggestions, friends, and groups suggestions based on user interactions and interests. - **Healthcare:** Medication reminders, health care plans suggestions based on patient medical history and health data. These are just a few examples; many other industries are actively exploring ways to integrate recommendation models to drive value and enhance business outcomes. ## Recommendation Model Evaluation These evaluation metrics collectively provide insights into different aspects of the performance of a recommendation model, including the diversity, quality, predictive accuracy, and overall correctness of the item suggestions/recommendations the model generates. Let's explore and interpret the metrics used in evaluating the recommendation models: #### a. Coverage **What it tells you:** Coverage measures the proportion of all possible items that the system can recommend. **Intuition:** - High coverage means the system can recommend a wide variety of items, making it more diverse and less biased toward popular items. - Low coverage suggests that the recommendations are limited to a smaller set of items, which may not cater to all users. **Example Inference:** If coverage is 90%, the system can recommend 90% of the items in the catalog, ensuring diverse options for users. #### b. NDCG at K (Normalized Discounted Cumulative Gain at K) **What it tells you:** NDCG measures how well the recommendations are ranked, focusing on the position of relevant items within the top K recommendations. **Intuition:** - High NDCG means relevant items appear near the top of the list, making the recommendations more useful to users. - Low NDCG indicates that relevant items might be buried deeper in the list, reducing user satisfaction. **Example Inference:** If NDCG@10 is high, users are more likely to see relevant recommendations within their top 10 suggestions, enhancing their experience. #### c. Accuracy **What it tells you:** Accuracy is the percentage of correct predictions (relevant vs. irrelevant recommendations) made by the system. **Intuition:** - High accuracy means the system is generally good at distinguishing relevant recommendations from irrelevant ones. - However, it doesn’t capture how well the system handles edge cases, like rare items or user-specific preferences. **Example Inference:** If accuracy is 95%, most recommendations are likely correct, but it doesn’t ensure that all relevant items are included. #### d. ROC AUC (Receiver Operating Characteristic Area Under the Curve) **What it tells you:** ROC AUC evaluates how well the system can separate relevant from irrelevant items across different thresholds. **Intuition:** - High AUC means the system can effectively rank relevant items higher than irrelevant ones, regardless of the cut-off threshold. - Low AUC suggests that the system struggles to distinguish relevant items from irrelevant ones. Example Inference: If AUC is 0.95, the system is very reliable at ranking relevant items consistently across all users. #### e. Precision **What it tells you:** Precision measures the proportion of recommended items that are actually relevant. **Intuition:** - High precision means most of the recommendations made are useful and relevant to the user. - Low precision indicates that users are being shown many irrelevant items, which could reduce trust in the system. **Example Inference:** If precision is 80%, 8 out of every 10 recommendations are likely to be relevant. #### f. Recall **What it tells you:** Recall measures the proportion of all relevant items that are successfully recommended. **Intuition:** - High recall means the system is good at identifying all the relevant items, even if it occasionally includes irrelevant ones. - Low recall suggests that many relevant items are being missed. **Example Inference:** If recall is 70%, the system is finding 70% of all the relevant items for a user, but some are still being left out. #### g. Recurrence Rate **Definition:** The proportion or frequency of recurrent events or patterns within a dataset. **What it tells you intuitively:** - A higher recurrence rate means the dataset contains frequent repeated events or patterns. - A lower recurrence rate suggests fewer or rare repetitions in the data. **Example Inference:** If the recurrence rate is 60%, it means 60% of the events in the dataset are part of some repeating pattern. This helps identify how cyclical or predictable the data is. #### h. Mean Absolute Error (MAE) **Definition:** The average absolute difference in days between the predicted and actual recurrence. **What it tells you intuitively:** MAE shows how far off the predictions are, on average, from the true recurrence intervals, ignoring whether the difference is an overestimation or underestimation. Lower MAE means more accurate predictions. **Example Inference:** If MAE is 3 days, the predicted recurrence intervals are, on average, 3 days away from the actual intervals. This gives a straightforward sense of how precise the predictions are. #### i. Mean Squared Error (MSE) **Definition:** The average squared difference in days between predicted and actual recurrence. **What it tells you intuitively:** - MSE penalizes larger errors more heavily because it squares the differences. - Lower MSE means the model is generally accurate and avoids large mistakes in predictions. **Example Inference:** If MSE is 16 days², it indicates that while most predictions may be close, any larger errors in the recurrence interval predictions significantly impact the score. This makes it useful for identifying models prone to big errors. ### Evaluation metrics for specific algorithms Recommendation models vary in type, as previously mentioned. Hence, there is no single generic approach to evaluate the models. Instead, they can be evaluated using different metrics specific to each algorithm. The evaluation metrics for information retrieval algorithms are as follows: #### a. Information Retrieval <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Recommendation type</th> <th class="w15p">Algorithms</th> <th class="w30p">Evaluation Metrics</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}Sequential Recommendation{{%/badge%}}</td> <td>SubSeq</td> <td> <ul style="text-align: left;"><li><strong>Coverage</strong>: Coverage measures the proportion of all possible items that the system can recommend.<li><strong> NDCG at K (Normalized Discounted Cumulative Gain at K)</strong>: NDCG measures how well the recommendations are ranked, focusing on the position of relevant items within the top K recommendations.<li><strong>Accuracy</strong>: Accuracy is the percentage of correct predictions (relevant vs. irrelevant recommendations) made by the system.</td> </tr> <tr> <td>{{%badge%}}Personalized Recommendation{{%/badge%}}</td> <td>Light FM</td> <td> <ul style="text-align: left;"><li><strong>ROC AUC (Receiver Operating Characteristic Area Under the Curve): </strong>: ROC AUC evaluates how well the system can separate relevant from irrelevant items across different thresholds.<li><strong> Precision: </strong>: Precision measures the proportion of recommended items that are actually relevant.<li><strong>Recall</strong>: Recall measures the proportion of all relevant items that are successfully recommended.</td> </tr> <tr> <td></td> <td>Pixie</td> <td> <ul style="text-align: left;"><li><strong>Precision</strong>: Precision measures the proportion of recommended items that are actually relevant.<li><strong> Recall</strong>: Recall measures the proportion of all relevant items that are successfully recommended.<li><strong>Coverage</strong>: Coverage measures the proportion of all possible items that the system can recommend.</td> </tr> <tr> <td>{{%badge%}}Recurring Purchase Recommendations{{%/badge%}}</td> <td>Recurrence Finder</td> <td> <ul style="text-align: left;"><li><strong>Recurrence Rate</strong>: The proportion or frequency of recurrent events or patterns within a dataset.<li><strong> Mean Absolute Error (MAE)</strong>: The average absolute difference in days between predicted and actual recurrence.<li><strong> Mean Squared Error (MSE)</strong>: The average squared difference in days between predicted and actual recurrence.</td> </tr> </tbody> </table> Each of these evaluation metrics serves a specific purpose in assessing the performance of a respective model, providing insights into its performance and reliability. <!-- #### Create Sample Pipeline 1. Sample Recommendation pipeline creation example 2. Prediction type - New users/Existing user --> -------------------------------------------------------------------------------- title: "Time Series" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.564Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/time-series/" service: "QuickML" -------------------------------------------------------------------------------- # Introduction to Time Series Data Time series refers to a sequence of data points representing how one or more features change over time. These data points are recorded at regular intervals, allowing for the analysis of trends, seasonality changes, patterns, and anomalies over time. Time series data can be categorized into two types: **univariate**, where only one feature is recorded, and **multivariate**, where multiple features are tracked in time simultaneously. The primary goal of time series analysis is often to predict future values of a feature, which serves as the target variable in tasks such as forecasting or anomaly detection. ### Components of Time series Time series components characterize the underlying patterns or behavior of the data series over time. These are just the factors that affect the values of observed data points. Understanding these components will help in creating better and accurate models. The four components of the time series data are: - Trend - Seasonal - Cyclic - Noise #### 1. Trend The trend component represents the long-term direction or movement in the time series. It indicates whether the data is generally increasing, decreasing, or staying constant over time. **Characteristics** - Reflects the underlying tendency of the series over a longer period. - Trends can be linear or non-linear. - Trends are not affected by short-term fluctuations or irregularities. **Example** <br /> In a company's sales data, a steady increase in sales over several years indicates a positive trend. #### 2. Seasonal The seasonal component represents regular, repeating patterns in the time series that occur within a fixed period, such as daily, monthly, or yearly. **Characteristics** - Seasonality occurs at regular intervals and has a fixed period. - Often driven by external factors such as weather, holidays, or cultural events. - Seasonal effects are predictable and recur at the same time each year, month, week, or day. **Example** <br /> The example here shows the monthly passengers with respect to each month from 1949 to 1960, where we can see the seasonal distribution of the passengers in the chart. #### 3. Cyclic The cyclic component refers to fluctuations in the time series that occur over longer periods, usually influenced by economic or business cycles. **Characteristics** - Cycles are usually irregular in period and amplitude. - They can last for more than a year, often several years. - Unlike seasonality, cyclic patterns are less predictable in their frequency and duration but are influenced by broader economic factors. **Example** <br /> Sales of new family houses fluctuate during certain periods, but these changes are not seasonal, as seen in the graph. These fluctuations could be influenced by periods of economic expansion and recession, which are less predictable than seasonal patterns. #### 4. Noise The noise component captures the random variation in the time series that cannot be attributed to trend, cyclic, or seasonal patterns. It represents the unpredictable, irregular fluctuations in the data. **Characteristics** - Noise is the residual part of the time series after removing trend, cyclic, and seasonal components. - Random and unpredictable. - Noise does not follow any specific pattern and can be caused by random factors or measurement errors. **Example** <br /> As we can conclude from the graph, there is no specific trend, seasonality, or cyclical pattern in the changes in the stock price of the company. The price fluctuations appear as random ups and downs with no predictable pattern. ### Stationarity Stationarity refers to a characteristic of time series data where the statistical propertie,s such as mean, variance, and autocorrelation, remain constant over time. - Mean is the average value of observations over a certain time period; - Variance is the measure of the spread or dispersion of the value around the mean, - Autocorrelation is the correlation of the series with its previous values. All the three statistical properties of the time series data should remain constant over time to conclude that the series is stationary. Understanding the series is stationary or not is important, as it informs the type of preprocessing needed to build an effective time series model.When a series is stationary, the recorded values either do not change or stay within the same range over time. Let's look at example use cases where the recorded series considered as stationary or non-stationary series. #### a. Non-stationary series A time series data considered non-stationary if the recorded values were affected at different times due to the presence of either trend, seasonality, cyclic or noise in the data. **Example:**\ Passengers of an Airline will be in increasing trend with seasonal fluctuations every year, as shown in the plot. Hence, this series is considered as non-stationary series. **The ADF test result to assess the presence of stationarity** ``` ADF Statistic: 0.81 p-value: 0.99 Critical Values: {'1%': -3.48, '5%':-2.88, '10%': -2.57} Fail to reject the null hypothesis: The series is not stationary. ``` <br /> #### b. Stationary series When a series is stationary, the recorded values either do not change or stay within the same range over time. **Example:**\ After applying differential transformation technique of lag 2 , the non-stationary series is converted to stationary series. If we plot the transformed values in a chart, it will appear as shown below: <br /> **ADF Test results to assess the presence of stationarity after data transformation** ``` ADF Statistic: -2.96 p-value: 0.03 Critical Values: { '1%': -3.48, '5%': -2.88, '10%':-2.57 } Reject the null hypothesis: The series is stationary. ``` Time series models are typically built using stationary data. If a series is non-stationary, it should be transformed into a stationary series before training the model to ensure accurate future predictions. Two statistical tests, known as the ADF and KPSS tests, are be performed on the time series data to determine whether stationarity is present in the series. #### Tests for Stationarity **Stationarity** in a time series means that the statistical properties of the series do not change over time. In other words, if you look at different parts of the series, they should look statistically similar. For a series to be stationary, it should have **a constant mean, constant variance, and constant auto correlation.** There are two types of statistical methods that can be performed on the time series data to check for the presence of stationarity. They are **ADF** test and **KPSS** test. However, before going to stationarity tests, let's understand a concept in statistical tests called hypothesis. **Hypothesis:** Hypothesis testing is a statistical method used to make inferences or draw conclusions about a dataset using sample data drawn from the entire available data. It helps us decide whether there is enough evidence in our sample to support or reject a particular claim or hypothesis about the whole dataset. - **Null Hypothesis (H₀)**: This is the assumption that there are no effects or differences. It represents the status quo. - **Alternative Hypothesis (H₁)**: This is the opposite of the null hypothesis, and it represents a new claim or effect. This is what you would need to test for. Now that we understand what the hypotheses mean, let's look into the stationarity tests: 1. #### ADF Test The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine whether a time series has unit root or not. Presence of unit root indicates that the values are highly dependent on previous values indicates that series is non-stationary. **Goal:** ADF test is used to check if the series contains unit root. **Null hypothesis**: The time series has unit root, hence it is non-stationary. **Interpretation**: If the ADF test statistic is less than the critical value, then the null hypothesis is rejected concluding that the time series is Stationary. **Testing process**: The ADF test works by testing the null hypothesis to check the presence of a unit root in time series indicates that the series is non-stationary. 2. #### KPSS Test The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test is another statistical test used to check the stationarity of a time series. It differs from the ADF test in that it tests the null hypothesis that the time series is stationary around a mean or a deterministic trend. **Goal**: KPSS test is used to check if the series is stationary. **Null hypothesis**: The time series is stationary around a constant mean or deterministic trend. **Interpretation**: If the KPSS test statistic is greater than the critical value, the null hypothesis is rejected, indicating the series is non-stationary. **Testing process**: The KPSS test examines whether the time series is stationary by testing for a deterministic trend or mean around which it fluctuates. #### How to infer the test results ADF and KPSS tests generate a few statistical values that helps us to classify the series is stationary or not. These values are: - Test statistic - P-Value - Critical Values at 1%, 5%, and 10% Confidence Intervals #### Test Statistic Test statistic is the calculated value in the tests to determine how likely the value would be if the null hypothesis were true. Calculate the test statistic based on your sample data. - For a z-test: test statistic is z = (X̄ - μ) / ( σ/ √n) (Statistic value for Whole data) - For a t-test: test statistic is t = (X̄ - μ) / (s / √n) (Statistic value for Sample data) Here, X̄ is the sample mean, μ is the population mean, s is the standard deviation of Sample or σ is standard deviation for whole data, n is the sample size. In QuickML, the ADF and KPSS test statistics are calculated from their respective tests on the whole input data. Both these test statistic values provide a measure of the evidence against the null hypothesis. The strength of the evidence is assessed at three levels—strong, moderate, and weak—by comparing the test statistic value to the critical values at the 1%, 5%, and 10% significance levels. #### P-Value **What is p-Value?** The p-value in both tests indicates the probability of obtaining a test statistic value as extreme as the one observed, assuming the null hypothesis is true. P-value can be obtained from the statistical tables to determine whether to the null hypothesis should be rejected. **How is it derived?** 1. Formulating and estimating the ADF / KPSS regression model 2. Calculating the test statistics t𝛄 as mentioned above. 3. Comparing t𝛄 against critical values derived from simulation. 4. Interpolating to obtain an approximate p-value based on how the test statistic compares to critical values under the null hypothesis. #### In the ADF test **Null hypothesis:** The time series has unit root meaning it is non-stationary. **Interpretation:** Assuming the significance interval is set as 5%. 1. **Low p-value** ( ≤ 0.05): A low p-value indicates that you can reject the null hypothesis of a unit root, suggesting that the series is stationary. 2. **High p-value** ( > 0.05): A high p-value indicates that you failed to reject the null hypothesis of a unit root, suggesting that the series is non-stationary. #### In the KPSS test **Null hypothesis:** The time series is stationary around a constant mean or deterministic trend. **Interpretation:** Assuming the confidence interval is 5%. 1. **Low p-value** ( ≤ 0.05): A low p-value suggests you reject the null hypothesis indicates that the series is non-stationary. 2. **High p-value** ( > 0.05): A high p-value suggests you fail to reject the null hypothesis of stationary. #### Critical Values (at 1%, 5%, and 10% levels): Critical values are predefined thresholds or benchmarks used in stationarity tests like the Augmented Dickey-Fuller (ADF) and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests to determine whether to reject or fail to reject the null hypothesis. These values help assess the statistical significance of the test results. These values serve as benchmarks for decision-making in both KDF and KPSS tests. For example, at 5% significance interval: - **At Critical value at 1%**: It is a very strict criterion. If the test statistic value is more extreme than the 1% critical value, then there is a strong evidence to reject the null hypothesis. - **At Critical value at 5%**: If the test statistic is more extreme than the 5% critical value, then there is moderate evidence to reject the null hypothesis. - **At Critical value at 10%**: It is a more lenient criterion. If the test statistic is more extreme than the 10% critical value, then there is weaker evidence to reject the null hypothesis. **Example:** <table class="content-table quickml-content-table"> <tr> <th>ADF Test</th> <th>KPSS Test</th> </tr> <tr> <td> <ul> <li><b>Null Hypothesis:</b> The series has a unit root hence it is non-stationary.</li> <li><b>Example 1:</b></li> <ul> <li>ADF statistic = -3.5</li> <li>Critical values: <ul> <li>CV at 1%: -3.9</li> <li>CV at 5%: -3.3</li> <li>CV at 10%: -3.0</li> </ul> </li> <li>Since -3.5 is more extreme than -3.3 (the 5% critical value), you reject the null hypothesis at the 5% level with only moderate evidence.</li> <li>And, if the p-value is 0.04, this confirms you can reject the null hypothesis at 5% confidence interval, since 0.04 < 0.05.</li> <li>Test concludes the series is stationary.</li> </ul> </ul> </td> <td> <ul> <li><b>Null Hypothesis:</b> The series is stationary.</li> <li><b>Example 1:</b></li> <ul> <li>KPSS statistic = 0.6</li> <li>Critical values: <ul> <li>CV at 1%: 0.73</li> <li>CV at 5%: 0.46</li> <li>CV at 10%: 0.34</li> </ul> </li> <li>Since 0.6 is greater than the 5% critical value (0.46), you reject the null hypothesis of stationarity at the 5% level with only moderate evidence.</li> <li>If the p-value is 0.02, it means you can reject the null at the 5% level, since 0.02 < 0.05.</li> <li>Test concludes the series is non-stationary.</li> </ul> </ul> </td> </tr> <tr> <td> <ul> <li><b>Example 2:</b></li> <ul> <li>ADF Statistic: 3.14</li> <li>p-value: 1.0</li> <li>Critical values: <ul> <li>CV at 1%: -3.46</li> <li>CV at 5%: -2.87</li> <li>CV at 10%: -2.57</li> </ul> </li> <li>The p-value is greater than the significance level 0.05 and ADF statistic is far greater than all the critical values, hence there is no sufficient evidence to reject the null hypothesis.</li> <li>This concludes that the series is non-stationary.</li> </ul> </ul> </td> <td> <ul> <li><b>Example 2:</b></li> <ul> <li>KPSS Statistic: 1.65</li> <li>p-value: 0.01</li> <li>Critical values: <ul> <li>CV at 1%: 0.73</li> <li>CV at 5%: 0.46</li> <li>CV at 10%: 0.34</li> </ul> </li> <li>Reject the null hypothesis: The series is not stationary.</li> <li>The test statistic (1.65) is greater than the critical value (0.46) at the 5% level, and the p-value (0.01) is less than 0.05, we reject the null hypothesis.</li> <li>This means that we have strong evidence to conclude that the series is non-stationary.</li> </ul> </ul> </td> </tr> </table> #### Possible outcomes of the tests The following are the possible outcomes of applying both tests. **Outcome 1**: Both tests conclude that the given series is stationary - The series is stationary. **Outcome 2**: Both tests conclude that the given series is non-stationary - The series is non-stationary. **Outcome 3**: ADF concludes non-stationary, and KPSS concludes stationary - The series is trend stationary. To make the series strictly stationary, the trend needs to be removed in this case. Then, the de-trended series is checked for stationarity. **Outcome 4**: ADF concludes stationary, and KPSS concludes non-stationary - The series is difference stationary. Differencing is to be used to make the series stationary. The differenced series is then checked for stationarity. #### How to address the stationarity present in columns In QuickML, we will be using both Augmented Dickey-Fuller (ADF) test and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test as explained above to check the stationarity of each feature in the time series dataset. QuickML platform offers a set of data transformation techniques helps to transform non-stationary columns in dataset. By transforming these columns into stationary ones, QuickML ensures that the generated models using the transformed data are well-suited to capture underlying patterns and trends. Two data transformation techniques that are used primarily to address the stationarity present in columns are: #### Differencing Differencing is done by computing the differences between consecutive observations. It can help stabilize the mean of a time series by removing changes in the level of a time series and therefore eliminating (or reducing) trend and seasonality. Order of differencing indicates that the number of times the differencing needs to be performed on the non-stationary series to transform it into stationary series. Max order of differencing to be provided in QuickML is 5. If even after the fifth order of differencing the series is non-stationary, any of the methods in power transformations can be applied to stabilize the variance in the time series. #### Power transformation A power transform will make the probability distribution of a variable more Gaussian. This is often described as removing a skew in the distribution, although more generally, it is described as stabilizing the variance of the distribution. Apart from the log transformation, we can use a generalized version of the transform that finds a parameter (lambda) that best transforms a variable to a Gaussian probability distribution. In QuickML, there are two types of power transformations available - **Box Cox Transformation**\ The Box-Cox transformation aims to find the best power transformation of the data that reduces skewness and stabilises variance. It is effective when data exhibits heteroscedasticity (unequal variance across levels of predictors) and/or skewness. - **Yeo Johnson Transformation**\ The Yeo-Johnson transformation is a modification of the Box-Cox transformation that can handle both positive and negative values of the target variable. Like Box-Cox, Yeo-Johnson transforms data to stabilize variance and normalize distributions. It is more flexible, as it can handle negative values. Yeo-Johnson is often preferred when the data includes zeros or negative values, which the original Box-Cox cannot handle. A parameter in the Yeo-Johnson transformation, often referred to as lambda, is used to control the nature of the transform. DIfferent transformation techniques are selected based on the lambda value. - lambda = -1. is a reciprocal transform. - lambda = -0.5 is a reciprocal square root transform. - lambda = 0.0 is a log transform. - lambda = 0.5 is a square root transform. - lambda = 1.0 is no transform. ### Model evaluation metrics **Metrics for Forecasting** When evaluating forecasts in time series analysis, several metrics are commonly used to assess the accuracy and performance of the model. Each metric provides different insights into how well the forecasted values align with the actual observed values **1. Mean Absolute Percentage Error (MAPE)** MAPE expresses the average absolute percentage difference between predicted and actual values relative to the actual values. It provides insight into the relative accuracy of the forecasts and is particularly useful when comparing the accuracy of models across different datasets or scales. **Interpretation:** For example MAPE value is 8% indicates that the model is relatively accurate as it's predictions deviate, on average, by 8% from the actual values. Generally, MAPE values below 10% indicates that model is highly accurate, between 10-20% is good, and values above 20% may indicate the need for model performance improvement. **2.Symmetric Mean Absolute Percentage Error (SMAPE)** SMAPE addresses the issue of asymmetry in MAPE by using the average of the absolute values of the actual and predicted values in the denominator. It's often preferred when dealing with small or zero values in the dataset, as it avoids division by zero and provides a more balanced measure of accuracy. **Interpretation:** SMAPE offers a balanced perspective on error, particularly useful when actual values are small or zero. By using the average of the actual and predicted values in the denominator, SMAPE mitigates the impact of asymmetry. A lower SMAPE value indicates a smaller percentage difference between the actual and predicted values, signaling better model accuracy. It provides a symmetric measure that avoids the issues of extreme percentage errors present in other metrics like MAPE. **3. Mean Square Error (MSE)** MSE measures the average of the squares of the errors, giving more weight to large errors. It provides a more detailed insight into the spread of errors but can be heavily influenced by outliers due to the squaring operation. MSE is useful for penalising larger errors more significantly. **Interpretation:** MSE highlights the variance of errors, with larger errors receiving more weight due to the squaring function. This metric is particularly sensitive to outliers, making it ideal for cases where we want to penalize large deviations from the actual value. A lower MSE indicates a model with fewer and smaller errors on average, though it can be disproportionately affected by outliers. It is useful for understanding the spread of error but may not be as interpretable in real-world units. **4. Root Mean Square Error (RMSE)** RMSE is the square root of the MSE and is more interpretable in the same units as the original data. Like MSE, RMSE gives a measure of the average magnitude of the error, with higher values indicating larger average errors. It is widely used and provides a good balance between sensitivity to error size and interpretability. **Interpretation:** RMSE provides an average error magnitude, retaining the same units as the original data, which makes it more interpretable than MSE. RMSE is sensitive to large errors and is commonly used to understand typical prediction errors. A lower RMSE value indicates a model with lower error and better predictive accuracy. It is useful for models where interpreting errors in terms of the original data units is helpful, giving a clear sense of the "size" of errors. **5. Mean Square Log Error (MSLE)** MSLE calculates the mean of the squared differences between the natural logarithm of the predicted values plus one and the natural logarithm of the actual values plus one. It penalizes underestimates more heavily than overestimates. **Interpretation:** MSLE focuses on the ratio of predicted to actual values, dampening the impact of large values by applying a log transformation. This metric penalizes under-predictions more than over-predictions, making it suitable for models where underestimates are more detrimental than overestimates. A lower MSLE value implies that the predicted values align closely with the actual values on a logarithmic scale, reducing the penalty for high variance in large values and emphasizing performance on a multiplicative scale. **6. Root Mean Square Log Error (RMSLE)** RMSLE is simply the square root of MSLE. It provides a more interpretable measure in the same units as the target variable.Like MSLE, RMSLE penalizes underestimates more heavily than overestimates due to the squared differences. **Interpretation:** RMSLE is the square root of MSLE, preserving the interpretability in terms of the original units, though on a logarithmic scale. Like MSLE, RMSLE penalizes under-predictions more heavily, providing a measure that prioritizes errors in the lower range. A lower RMSLE value suggests closer alignment between predicted and actual values on a log scale, ideal when large positive discrepancies are tolerable but underestimations need to be minimized. -------------------------------------------------------------------------------- title: "Text Analytics" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.565Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/text-analytics/" service: "QuickML" -------------------------------------------------------------------------------- # Introduction to NLP in Text Analytics Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that enables the machines the ability to understand, interpret, and generate human language in way that is meaningful and useful. WIth the enormous amounts of textual data getting generated everyday, NLP blends computational linguistics, machine learning, and deep learning to allow machines to process natural language data and extract valuable insights from it, in turn generating value to businesses. Natural language processing enhances the user experience by proactively addressing data-driven concerns and challenges faced by business users, thereby adding significant value to the business. #### Why is it popular? NLP’s popularity is driven by its ability to make sense of complex human language and apply it to real-world problems, making interactions with technology more natural and intuitive. Key reasons for its increasing popularity include: - **Extracting Insights from Data**: Businesses collect vast amounts of textual data from various sources like social media, articles, product reviews, feedback, legal documentation, etc. They need to understand what customers or users are talking about their products or services. NLP provides efficient ways to process, analyze, and extract valuable insights from this large, unstructured data. - **User-Centric Applications**: With the advent of NLP, businesses rely on virtual assistants, chatbots, and automated customer service systems to address user queries faster and efficient manner. - **Translation Capabilities**: NLP enables machine translation, facilitating communication across languages and helping to remove language barriers in society. - **Summarization Capabilities**: Lengthy documents can be converted into short summaries for quick reading, saving time and effort. These are just a few applications, with more innovations emerging every day. ### Business applications NLP enables various applications in business environments by utilizing NLP tasks to help companies automate processes and extract actionable insights from large datasets. Let's looks at a few NLP tasks and their real-time business applications. #### NLP Tasks Tasks refer to specific objectives or problems that an NLP model is designed to perform or solve. Each task is focused on addressing a particular aspect of processing or analyzing natural language. **Examples of NLP Tasks**\ Tasks that could be performed using QuickML's Text analytics builder include: - **Spam detection**: Spam classification is the binary classification model, where the purpose is to classify the emails received as spam or not. The spam detection model takes the email, subject, sender details, and so on as input to the model and generates the probability the email being spam. Based on the threshold, it will classify the email as spam or not spam. Text classification tasks: Various other tasks that come under text classification similar to Spam classification for emails are intent detection, commitment classification, and emotion detection and tonality identification fall under the text classification tasks. - **Language detection**: Language detection is a fundamental step applied in various NLP applications like language translation, grammar correction, and text-to-speech. Detecting the language used to search for web content and return the results in the same language or used in chatbots/translation tools to provide response in the same language. - **Sentiment Analysis**: Determines the sentiment tone (positive, negative, or neutral) of a piece of text. In general, the input to a sentiment classification model would be a piece of text, and the output is the sentiment that is expressed in the text. In a real-world scenario, sentiment classification would be helpful for businesses to find out what their customers are feeling about their products using the product reviews and understand the negatively impacting areas. Tasks that can be performed using Zia features in QuickML are: {{%link href="/en/quickml/help/zia-features/" %}}Zia Features{{%/link%}} The following tasks can be performed using {{%link href="https://console.catalyst.zoho.com/baas/825916373/project/2003000000101001/Development#/zia/face-analytics" %}}Catalyst Zia services{{%/link%}}: - **Named Entity Recognition (NER**): Detectis proper nouns, such as names of people, places, and organizations. - **Keyword extraction**: A technique in NLP to identify and extract the most relevant keywords from text data, helping capture the main topics and, in turn, aiding in summarising information, indexing, and categorising documents. #### NLP applications Applications refer to the practical, real-world usecases of NLP models. An application is often built using one or more NLP tasks to solve broader, end-user problems or provide services. **Examples of NLP Applications**: - **Sentiment Analysis Tools**: Uses sentiment extraction, tonality, and emotion identification tasks to gauge public opinion. It is crucial for businesses and brands that place a large value on understanding customer experiences. It can be implemented for the following purposes: - Gauging and monitoring sentiments expressed by customers in social media platforms and opinion surveys - Automating the analysis of customer feedbacks and reviews and reducing the manual workload involved. - Identifying and addressing critical situations in real-time by the automated monitoring of negative sentiments - **Email Spam Filters**: Spam filters leverage tasks like spam classification to identify and filter unsolicited or unwanted emails, automatically moving them to the spam folder. This improves the user experience by reducing inbox clutter, saving time, and minimizing the risk of exposure to phishing attempts, fraudulent schemes, and other harmful scams. - **Customer Feedback analysis**: By leveraging multiple NLP tasks such as sentiment analysis and keyword extraction, businesses can gain insights into what customers are saying about their products in reviews or on social media. This enables them to address customer concerns or complaints effectively, ultimately enhancing customer satisfaction and improving the overall experience - **Enhance Customer support efficiency**: Tasks like Emotion detection enable businesses to analyze customer interactions with their customer support staff via chats, calls, complaints, feedback or reviews, social media posts, and more. This helps to identify emotions that their customer going through, such as happiness, frustration, or anger upon using their products or services. This helps businesses better understand their customer sentiment, allowing them to respond empathetically, tailor their services, and improve customer relationships. Detecting what their customer feels gives a window to implement proactive measures to address their concerns, resolve their issues, and thereby enhance brand loyalty. - **Customer support at scale**: Businesses leverage intent and activity classification tasks to interpret user queries accurately and provide responses accordingly. By automating the resolution of common or basic inquiries, organizations can significantly reduce response times and free up resources for more complex issues. This automation enables customer support to scale efficiently, extending the reach of its support services, and ensuring consistent service delivery. Additionally, virtual assistants are employed to model and deliver responses, further enhancing the speed and quality of support while maintaining a personalized user experience. An application is essentially a practical implementation where multiple tasks come together to create value for users. While tasks focus on specific language operations, applications focus on solving broader user-centric problems by combining several tasks. These are just a few examples of how NLP tasks are being applied in real-world businesses. With advancements in technology, a plethora of innovative applications continues to emerge, revolutionizing user experiences, streamlining processes, and fostering greater customer loyalty by addressing their needs more effectively ### Steps to build a pipeline QuickML uses classic mode and smart mode pipeline builders to create text analytics models. The **Classic mode** features a drag-and-drop pipeline builder interface where a list of data and machine learning operations are available for building NLP models. These nodes can be used to drag and drop into the builder to build the machine learning pipeline, which, upon execution, generates an NLP model. The Smart mode provides a prebuilt template for building NLP models, designed to simplify the model development process from data preprocessing, Feature Extraction, to model selection. With this prebuilt template, stages are predefined, and users are presented with various parameters to configure operations at each stage. This template removes the ambiguity of which operation to use and when, to build NLP model and streamlines the model-building process. #### Build a Model Using Smart Builder Building text-based machine learning models follows a similar process to other ML models, involving similar steps from data pre-processing, algorithm selection, and tuning the model. However, NLP models differ in the sequence of applying specific text-based operations. This sequence is crucial to build a model, as it ensures that the text data is transformed and processed in the most effective way before applying machine learning algorithms. In the Smart Mode of model building, the pipeline builder is streamlined into three main steps: - **Pre-processing**: In this step, raw text data is cleaned and prepared for further analysis. This involves operations such as tokenization, case conversion, stemming and lemmatization, stopword and noise removal, and normalization. - **Feature Extraction**: This step involves converting the text data into numerical features that can be understood by machine learning algorithms. Techniques like TF-IDF, word embeddings, and bag-of-words are commonly used. - **Algorithm Selection**: In the final step, a suitable algorithm is chosen from the available supervised or unsupervised learning algorithms based on the use case. The model is then trained using the pre-processed data from the feature extraction step. Each step in the Smart Mode Pipeline Builder comes with configurable stages/operations that allow you to configure the operations to your specific NLP problem. Let's deep dive into each step and explore the stages/operations available in them: #### Stage 1: Preprocess Text data is usually raw, so it's important to process the data before feeding it into any algorithm to improve performance. In our smart builder, textual data pre-processing stage has 7 operations where the first six operations work on the textual data and the final operation Label Encoding applied on the target column of the labeled datasets. The output of this stage is passed on to the Feature extraction stage, where the entire text data is converted to numerical format. Let's look at the operations in the preprocessing stage. #### a. Case conversion Convert raw text data with different cases to a common desired case (lower case in general). This would help in reducing sparsity (without common casing "NLP, Nlp, nlp" would be considered as three different words). <!-- **Example:** Dataset preview before applying case conversion: /image After applying case conversion /image --> #### b. Tokenization Tokenization is breaking the textual data down into further smaller units. For instance, paragraphs can be segmented into sentences, sentences into words, and words into individual characters, allowing to uncover patterns or insights that are not visible in larger segments. In QuickML, tokenization of sentences can be performed at both the word and character levels. After tokenization applied on the data at theWord level, subsequent operations (Stemming & lemmatization, Stop words removal, Noise Removal, Normalization) can be applied to the data as needed. However, at the Character level, only the Noise removal operation alone can be applied to the tokenized data. **Tokenization stages:** Text : Sentences ➔ words ➔ characters <!-- **Example:** Output will be a list of tokenized texts - Words or Characters Dataset preview before applying: /image After applying /image --> #### c. Stemming and Lemmatization Both techniques reduce words to their root forms, helping normalize word variations for more consistent analysis. Stemming trims trailing characters to obtain its root form, which may not always be a proper word. Lemmatization, however, reduces words to their dictionary form by considering context, resulting in a proper word with accurate meaning. The key difference is that stemming is faster but less precise, while lemmatization is more accurate but slower, as it relies on vocabulary and grammar. **Sample:** <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Words</th> <th class="w30p">Stem word</th> <th class="w15p">Lemmatized word</th> </tr> </thead> <tbody> <tr> <td>Studying, Studies,<br>Studied</td> <td>Studi</td> <td>Study</td> </tr> <tr> <td>Running, Runner, Runs</td> <td>Run or Runn</td> <td>Run</td> </tr> <tr> <td>Changing</td> <td>Chang</td> <td>Change</td> </tr> </tbody> </table> <!-- **Example:** Dataset preview before applying: /image --> #### d. Stop word removal Languages have many filler words that do not provide any significant use during training. By removing them, we can make our model to focus more on important words. Articles, prepositions, and so on etc come under the stop words category. <!-- **Example:** Dataset preview before applying: /image\ After applying /image --> #### e. Noise removal Removing unwanted words or parts of text depending on the feature in hand. Example: sentiment classification in NLP does not require the email addresses. It could be done using regex or having a list of noise words. Additional spaces, special characters, and digits could also be considered as noise. <!-- **Example:** Dataset preview before applying: /image After applying /image --> #### f. Normalization Converting different formats of text into a standard format. Example: The words USA, usa, United states of America, The united states of America, the usa to usa. This could be achieved using dictionary mapping. Noise removal can also be considered as a part of text normalization. <table class="content-table quickml-content-table"> <thead> <tr> <th class="w10p">Raw</th> <th class="w10p">Normalized</th> </tr> </thead> <tbody> <tr> <td> 2moro<br> 2mrrw<br> 2morrow<br> 2mrw<br> tomrw </td> <td>tomorrow</td> </tr> <tr> <td>b4</td> <td>before</td> </tr> <tr> <td>otw</td> <td>on the way</td> </tr> <tr> <td>:)<br>:-)<br>;)</td> <td>smile</td> </tr> </tbody> </table> <!-- Dataset preview before applying: /image\ After applying /image --> #### g. Label Encoding Label encoding is a method used to transform categorical variables into numerical values by assigning a unique integer to each category. It helps machine learning algorithms to process categorical data effectively. In the context of text data, the target label represents a categorical variable with various classes that can be converted into numeric form to train the model effectively. <!-- **Example:** Dataset preview before applying: /image algorithms to process categorical data effectively. In the context of text data, the target label represents a categorical variable with various classes that can be converted into numeric form to train the model effectively. **Example:** Dataset preview before applying:\ /image After applying\ /image --> Now, our text data is cleaned and ready to be transformed into word vectors using the vectorization techniques available in the feature extraction stage below. ## Stage 2: Feature Extraction Textual data cannot be fed directly fed into an algorithm; it must first be converted into a numerical format. There are several methods for this conversion, known as vectorization techniques, which can also be used to extract additional features from the vectorized data. In this context, each word, sentence, or character presented to a model is treated as a feature. That being said, in QuickML, text -to-number conversion happens using the techniques below. This is the process of transforming textual data into numerical format that captures the semantic meaning of words or phrases. Techniques available in QuickML includes: #### a. Bag Of Words (BOW) Bag of words, create a vocabulary of unique words present in the whole text corpus. Each sentence is then represented as a vector, does mapping based on the presence of these words, with the values representing the frequency of occurrence of each word within that sentence. **Example:** Below is an example showcasing the vectorized representation of the Bag of Words model for the following sentences: **S1.** Machine learning solves real world problems **S2.** NLP popularity is rising **S3.** Language translation is NLP task **Bag of Words Table representation:** Rows represent the sentences provided, and columns represent each word in these sentences, and values represent the frequency of occurrence of the specific word in each sentence. **Vector Representation:** Each sentence is represented as a vector by having word counts from the BoW table. Vector length is equal to the number of unique words in the vocabulary. Below is the vectorized representation of the three sentences above. Sentence 1 → [ 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] Sentence 2 → [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0] Sentence 3 → [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1] BoW model represents text data as vectors and each word in the text corpus as a feature. It simplifies the data for modelling but fails to capture the sanctity of the order of words and its contextual meaning. #### b. TF-IDF (Term Frequency - Inverse Document Frequency) TF-IDF quantifies the importance of each word in the corpus and allows to extract important features from the text data that are helpful in training the model for better performance. The higher the TF-IDF score, the more important the word is, and vice versa. It can be calculated in three steps. **Step1: Term Frequency (TF)** measures how frequently a term appears in a document relative to the total number of unique words in that document. Term Frequency of a word (**TF**) = Number of times a word appears in a document / Total number of unique words in the document **Step2: Inverse Document Frequency (IDF)** measures how important a term is across the documents. IDF of a word= log(N/n), where, **N** is the total number of documents and **n** is the number of documents a word has appeared in. **Step 3: Calculate TF-IDF score** **TF-IDF = TF * IDF** The IDF of less commonoccurring words will be high and low for common words. Thus, words like 'the, is, a' etc. will have a low importance, and higher importance could be given to actual document specific words. **Example:** Let's use the sample example from Bag of Words: **S1.** Machine learning solves real world problems **S2.** NLP popularity is rising **S3.** Language translation is NLP task **Step 1: Calculate Term Frequency (TF)** **Step 2: Calculate IDF score** **Step 3: Calculate TF-IDF** Multiply TF by IDF for each term in each sentence Key insights from the example includeare: <u>**High TF-IDF**</u> values indicate that a term is important in a specific sentence and not common across all sentences. <u>**Low TF-IDF**</u> values suggest that a term is either common across sentences or not unique to any single sentence. TF-IDF highlights the words important for understanding the context in the corpus. It showcases how some words or terms are important in a specific text than others. #### c. Word2Vec Word2Vec is a pre-trained word embedding model that generates numerical vectors (embeddings) for words based on their semantic and contextual relationships. It is trained on a large text corpus to ensure that words with similar meanings or contexts have embedding vectors that are close in the vector space. These relationships can be quantified using cosine similarity, where a higher score indicates greater similarity. **Example:** Words like “king” and “queen” will have vectors that are close to each other, reflecting their shared context of royalty. Similarly, “man” and “woman” will have vectors that capture their relationship and shared meaning, demonstrating Word2Vec’s ability to model semantic relationships effectively. #### d. GloVe GloVe (Global Vectors for Word Representation) is a pre-trained word embedding model similar to Word2Vec that generates embedding vectors for given words. It is a count-based model that produces embeddings by factorising a word co-occurrence matrix derived from the entire corpus. GloVe captures global statistical information by analyzing how frequently words co-occur across the corpus. Unlike Word2Vec, which uses a predictive approach based on local context, GloVe relies on global co-occurrence statistics for training. It differs from Word2Vec in the method of training. **Example:** In GloVe, words like “king” and “queen” will have similar vectors, reflecting their semantic relationship (e.g., royalty). Additionally, GloVe excels at capturing analogical relationships. **For instance:** king - man + woman ≈ queen This analogy represents the transformation: The relationship between “king” and “man” (which can be thought of as “king is to man as queen is to woman”) is mathematically captured by subtracting “man” from “king” and adding “woman.” The result is “queen”, meaning that the model understands that just as king is to man, queen is to woman. This means the vector difference between “king” and “man” is similar to the difference between “queen” and “woman”, illustrating GloVe’s ability to model both meaning and relationships effectively. ## Stage 3: Algorithms & Modelling Text data is fed to algorithms in a vectorized form to generate an NLP model. The NLP models could be broadly classified into supervised and unsupervised learning models. In QuickML, we have algorithms that use labelled data to build supervised learning models. The algorithms include: - {{%link href="/en/quickml/help/ml-algorithms/text/#naive-bayes" %}}Naive Bayes{{%/link%}} - {{%link href="/en/quickml/help/ml-algorithms/text/#svm" %}}SVM{{%/link%}} ### Model Evaluation metrics NLP also has the common evaluation metrics like accuracy, precision, f1 score, etc. #### i. Log-Loss score Log-loss, also known as logistic loss or cross-entropy loss, is a commonly used evaluation metric in natural language processing (NLP) models, especially for binary and multi-class classification tasks. Log-loss quantifies the difference between the predicted probabilities and the actual class labels. Prediction probability of data record is the probability that the model has to predict for each class that it classified under. Actual class label is the true class that the data record belongs to. Log loss score indicates how close the prediction probability to the actual class label is. The more the predicted possibility values are diverged from actual values, the higher the lob loss score. **where** i is the given observation/record, y is the actual/true value, p is the prediction probability, and ln refers to the natural logarithm (logarithmic value using base of e) of a number N is the number of observations It measures the performance of a classification model whose output is a probability value between 0 and 1. The lower the log-loss score, the better the model is fit, and the higher the performance. A model with a log-loss score of 0 has the perfect skill to predict. <b>For example:</b> In spam classification, the actual class of an email is “Spam.” The predicted probability for the email being “Spam” is 0.78, and for being “Not-Spam,” it is 0.22. The negative natural logarithm of the predicted probability for the correct class (Spam) is 0.2485, which represents the log-loss score for that particular record. The log-loss score of the model is calculated as the average of the negative natural logarithm of the predicted probabilities for the correct classes across all records #### ii. AUC-ROC Curve The Area under the Receiver Operating Characteristic (AUC-ROC) curve is a machine learning metric that measures how well a classification model is performing. AUC represents the area under ROC curve, where ROC is a graph that plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds. True Positive (TP) and False Positive (FP) are terms used in the {{%link href="/en/quickml/help/models-details/#confusion-matrix" %}}confusion matrix{{%/link%}} to evaluate the performance of classification models. 1. <u>**True Positive Rate (TPR)**</u>, also called sensitivity or recall, measures the proportion of actual positive records that are correctly identified by the model. <center> <font color="red">Recall score</font> = TP/(TP+FN) where FN represents False Negatives. </center> 2. <u>**False Positive Rate (FPR)**</u> measures the ratio of actual negative records that are incorrectly classified as positive. <center> FPR = FP/(FP + TN) where TN represents True Negatives. </center> <center> FPR, is also calculated as: <font color="red">FPR = 1- Specificity</font> </center> Specificity measures the actual negative instances that are correctly identified by the model. A visual interpretation of AUC - ROC curve is below. An ROC is the plot between TRP and FPR across all possible thresholds where as AUC is the entire area under the ROC Curve. The AUC (Area Under the Curve) score ranges from 0 to 1, with higher scores indicating better model performance. 1. <u>**An AUC of 1**</u> signifies a perfect model, where the ROC (Receiver Operating Characteristic) curve forms a complete right-angle path, achieving 100% sensitivity (True Positive Rate) with 0% False Positive Rate. 2. <u>**An AUC of 0.75**</u> indicates that the model has a good ability to distinguish between positive and negative classes 75% of the time, showing better performance but still leaving room for improvement. 3. <u>**An AUC of 0.5**</u> suggests the model performs no better than random guessing, which is the least desirable outcome. A higher AUC score always reflects better model discrimination between classes. ### Visualizations #### Wordcloud A Word Cloud is a visual representation of the most frequently occurring words in a text corpus. The size of each word in the word cloud reflects its frequency of occurrence. Word clouds are commonly used to identify quickly and highlight the most prominent keywords or phrases in large text datasets, making it easier to communicate the key terms or concepts being discussed. **Example:** Let's take an example of an excerpt from "The Chronicles of Narnia: The Lion, the Witch, and the Wardrobe". Let's analyze the word cloud below generated for that provided text to identify the frequent and significant words used. With this word cloud, we can quickly extract important keywords, gain insights, and understand the overall context of the text in a visually engaging and time-efficient manner. -------------------------------------------------------------------------------- title: "Anomaly Detection" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.594Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/anomaly-detection/" service: "QuickML" -------------------------------------------------------------------------------- ### Introduction   Anomaly detection is the process of identifying data points, events, or patterns that deviate significantly from the expected behavior in a dataset. These unusual observations, also called **outliers**, can indicate critical situations such as fraud, equipment malfunctions, cybersecurity threats, or unexpected shifts in consumer behavior. In QuickML, we broadly use **unsupervised learning algorithms** for anomaly detection, where algorithms learn patterns of normality without relying on predefined labels. Instead of being explicitly told what constitutes an anomaly, the algorithm builds an understanding of normal behavior and highlights data points that do not fall within the usual expected range. This makes anomaly detection especially valuable when anomalies are rare, labels are unavailable, or normal patterns are highly dynamic. Imagine a dataset of credit card transactions. Most transactions follow a typical pattern in terms of amount, frequency, and location. However, a sudden high-value purchase in a foreign country might be flagged as an anomaly. Detecting this anomaly in real time can prevent fraud, protect customers, and save costs for financial institutions. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} Anomaly Detection feature is available in {{%bold class="bold-primary"%}}early access{{%/bold%}} across all the data centers. To use this, request access via support@zohocatalyst.com {{%/note%}} {{%note%}} To know about the ML algorithms used in anomaly detection, kindly refer to the {{%link href="https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/anomaly-detection/" %}}Anomaly Detection Machine Learning Algorithms help doc{{%/link%}}. {{%/note%}} ### Business Impact of Anomaly Detection Anomaly detection is a business-critical process that enables organizations to protect assets, improve operations, and enhance decision-making. Below are some key reasons why anomaly detection is essential across industries:   - **Fraud detection**: In finance and e-commerce, anomalies often correspond to fraudulent activities. For example, an unusually high-value purchase made from a foreign location, when compared to a customer’s usual spending behavior, could be an early sign of credit card fraud. Identifying these patterns in real time prevents financial loss and safeguards customer trust. - **Predictive maintenance**: In manufacturing and industrial settings, machines continuously generate sensor data. Detecting anomalies in vibration, temperature, or pressure readings can indicate early signs of mechanical wear or failure. Predicting and addressing these anomalies before they lead to breakdowns significantly reduces downtime and maintenance costs. - **Cybersecurity and network protection**: Anomaly detection helps identify irregularities in system logs, network traffic, or user activity that may indicate a cyberattack, data breach, or unauthorized access. For instance, a sudden surge in login attempts or data transfer volume can reveal security threats that require immediate attention. - **Healthcare and life sciences**: In healthcare, anomaly detection can monitor patient vitals, lab results, or medical imaging data to flag potential health risks. For example, a sudden drop in oxygen saturation or an irregular heartbeat pattern can alert doctors to intervene before the condition becomes critical. - **Operational insights and decision support**: Organizations can use anomaly detection to uncover hidden insights that improve efficiency and support decision-making. For example, unusual spikes in website traffic may reveal marketing campaign success or unexpected customer interests. ### What Causes Data Anomalies?  Data anomalies can emerge from a variety of underlying causes, often depending on the nature of the data and the environment in which it is collected. The main contributing factors include: - **Human error**: Manual data entry mistakes, incorrect labeling, or improper configuration of systems can easily introduce irregularities. For example, a misplaced decimal point or an incorrect date format can significantly distort analysis results. Even simple oversights, such as forgetting to update a record or duplicating entries, can lead to anomalous data points. - **System failures**: Hardware malfunctions, software glitches, or communication breakdowns between systems can corrupt or distort data. For example, a temporary network outage during data transmission might result in incomplete or duplicated entries, while a software bug could generate unexpected values or inconsistencies. - **Fraudulent or malicious activity**: In areas like finance, cybersecurity, or e-commerce, anomalies often signal potential fraud or unauthorized access. Abnormally high-value transactions, unusual login patterns, or sudden shifts in user behavior may indicate deliberate attempts to exploit or manipulate the system. - **Environmental or external changes**: Unexpected shifts in external conditions, such as economic fluctuations, market volatility or seasonal changes can alter normal data patterns. These events introduce new variables that cause temporary or permanent deviations from established trends. ### Types of Anomaly Detection    Anomalies can be categorized into several types depending on the nature and context of the data. The following sections explain each type in detail with corresponding visualization and conceptual examples. 1. **Point anomalies** A point anomaly occurs when a single observation deviates significantly from the rest of the data. These are the most common anomalies and are typically easy to detect.   For example, in a dataset of credit card transactions, a sudden high-value transaction may be flagged as a point anomaly if it does not align with the user’s previous spending behavior.   Interpretation: The plot above illustrates how a few data points lie far away from the main data cluster. These isolated data points are clear examples of point anomalies because they differ substantially from the majority of normal data points. 2. **Contextual anomalies** A contextual anomaly is an observation that is considered anomalous only within a specific context. This type of anomaly is especially common in datasets where the same value may be normal under certain conditions but abnormal under others. For example, a temperature reading of 10°C may be normal in winter but would be an anomaly in summer. Similarly, an online store may see a high number of purchases during holiday sales, but the same number on an ordinary day could signal suspicious activity. Interpretation: The plot illustrates two distinct contexts, each represented by a separate cluster of data points. These clusters reflect different patterns or environments in which the data behaves normally. The highlighted point lies near one of the clusters, appearing normal when compared to the overall dataset or when evaluated in the context of the other cluster. However, within its own cluster, (when analyzed relative to the specific characteristics and distribution of points in its local context), the point stands out as unusual. This situation highlights a contextual anomaly, where an observation is considered anomalous only within a particular context or under specific conditions, even though it may seem normal in a broader or different context. 3. **Collective anomalies** A collective anomaly arises when a group of related data points together behave unusually, even if individual points appear normal.   For example, in stock market data, a consistent drop in the stock price of a company over several consecutive days could represent a collective anomaly, especially if this pattern deviates from normal fluctuations.   Interpretation: The plot above depicts a small group of points that form a distinct cluster separate from the majority of data. While each individual point within this group may appear normal when evaluated on its own, their collective behavior reveals a pattern that deviates significantly from the overall distribution. This indicates the presence of a collective anomaly, where the anomaly arises not from individual data points, but from their relationship to one another and to the rest of the dataset. ### Categories of Anomaly Detection   In QuickML, there are two broad categories of anomaly detection: **Time series anomaly detection** and **Non-time series anomaly detection**. Each of these approaches focuses on identifying outliers in different types of datasets, depending on whether time plays a role in the data’s structure. Here’s a comparison table showing the difference between Time-Series and Non-Time-Series Anomaly Detection: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:25%;">Feature</th> <th style="text-align:left; width:37.5%;">Time-Series Anomaly Detection</th> <th style="text-align:left; width:37.5%;">Non-Time-Series Anomaly Detection</th> </tr> </thead> <tbody> <tr> <td>Data Structure</td> <td>Data points are indexed by time and follow a sequential order.</td> <td>Data points are not indexed by time and are treated as independent observations.</td> </tr> <tr> <td>Temporal Dependency</td> <td>Requires algorithms that account for temporal dependencies, trends, and seasonality.</td> <td>Assumes data points are independent with no temporal relationship.</td> </tr> <tr> <td>Detection Method</td> <td>Anomalies are detected by comparing actual values against historical patterns or forecasts.</td> <td>Anomalies are detected by evaluating similarity or distance between data points in feature space.</td> </tr> <tr> <td>Influencing Factors</td> <td>Influenced by seasonality, trends, and long-term dependencies in time.</td> <td>Influenced by feature relationships and data distribution across multiple variables.</td> </tr> <tr> <td>Common Algorithms</td> <td>ARIMA, LSTM Autoencoders, Prophet, Isolation Forest (time-based).</td> <td>Isolation Forest, One-Class SVM, DBSCAN, LOF.</td> </tr> <tr> <td>Example Use Cases</td> <td>Monitoring stock prices, machine sensor readings, network performance over time.</td> <td>Detecting fraudulent transactions, defective products, or abnormal customer behavior.</td> </tr> <tr> <td>Key Focus</td> <td>Focuses on pattern deviations over time.</td> <td>Focuses on feature-based outliers without considering time.</td> </tr> </tbody> </table> Now that we understand the key differences between time-series and non-time-series anomaly detection, let’s explore each of these approaches in depth to gain a clearer and more comprehensive understanding. ### Time Series Anomaly Detection Time series anomaly detection deals with data that is collected and organized in a sequential manner, where each data point is associated with a specific timestamp. This type of data reflects how **values change over time**, such as stock prices, server performance metrics, sensor readings, or website traffic. Because time is an inherent part of the dataset, detecting anomalies involves analyzing temporal patterns like trends, seasonality, and sudden deviations. In time series anomaly detection, the goal is to identify moments when the behavior of a system changes unexpectedly compared to its historical pattern. These anomalies can appear as sudden spikes, sharp drops, or irregular fluctuations that break from the usual rhythm of the data. Time series anomaly detection methods often use forecasting models that learn from past trends to predict expected future values. If the actual observation deviates significantly from the predicted value, the system flags it as an anomaly #### Building Intuition Behind Time-Series Anomaly Detection    In time-series datasets, observations are ordered in time, meaning each data point may depend on previous values. Detecting anomalies involves understanding temporal patterns, trends, and seasonality, in addition to feature relationships. The goal is to identify points or periods that deviate from expected temporal behavior. - **Temporal pattern perspective**: Each observation is part of a sequence, and normal points follow predictable patterns over time, including trends, seasonal cycles, and recurring fluctuations. Anomalies break these patterns and stand out as unusual points or sequences that deviate from expected behavior. - **Trend and seasonality consideration**: Sudden spikes, drops, or shifts from established trends often indicate anomalies. It is essential to account for seasonal changes, because deviations may only be anomalous if they violate expected seasonal patterns. For example, a high sales value might be normal during a holiday season but anomalous during a typical off-season period. Properly modeling trends and seasonality allows the system to distinguish between expected variations and true anomalies. - **Contextual dependencies**: Time-based context is critical because the same value may be normal at one point in time but anomalous at another, depending on the surrounding sequence. Additionally, relationships across multiple time-dependent variables can reveal anomalies that might not be apparent when analyzing a single series in isolation. Considering these multivariate temporal dependencies improves detection accuracy by capturing patterns that emerge only when multiple factors are observed together. - **Density and distribution over time**: Anomalies often occur in regions of low probability within the expected temporal distribution. Points that lie far from the predicted range or historical baseline can be flagged as unusual, and probabilistic approaches such as Gaussian processes or historical distribution modeling can quantify these deviations. By understanding the expected density of observations over time, the model can assign anomaly scores that reflect how unlikely each point is relative to normal temporal behavior. #### Key Success Criteria for Time-Series Anomaly Detection    Time-series anomaly detection focuses on sequences of data points over time. Success relies on capturing temporal patterns, trends, and seasonality while providing interpretable and actionable insights. Key criteria for success include: - **Temporal data understanding**: Accurately model trends, seasonality, and short- and long-term dependencies. Quality data preparation: Handle missing timestamps, irregular intervals, and inconsistent sampling rates. - **Effective feature engineering**: Include lag features, rolling statistics (mean, variance), differences, or Fourier transforms to capture temporal behavior. - **Representative historical patterns**: Ensure training data reflects normal temporal variations, including rare seasonal patterns, to reduce false positives. - **Robust model selection**: Choose algorithms capable of capturing temporal dependencies, trends, and seasonality.  For example, ARIMA, AutoRegressor, Auto ARIMA, or SARIMA. #### Business Applications of Time-Series Anomaly Detection   Time-series anomaly detection has wide-ranging applications across industries, helping organizations identify unusual patterns, prevent failures, and make data-driven decisions by monitoring how values evolve over time. 1) **Finance and banking** Time-series anomaly detection in finance and banking is used to monitor sequences of transactions for unusual patterns that may indicate fraud or other irregular activities. Sudden spikes in transfers, abnormal trading behaviors, or atypical account activities can be flagged in real time, allowing banks and financial institutions to investigate suspicious behavior promptly. By analyzing the temporal patterns of transactions, models can distinguish between legitimate seasonal or cyclical activity and genuine anomalies that require attention. 2) **IoT and industrial operations** In IoT and industrial operations, time-series anomaly detection plays a critical role in monitoring equipment and production processes. Sensor readings collected over time can reveal early signs of machine failure, deviations in performance, or inefficiencies in production lines. By identifying these anomalies early, organizations can schedule predictive maintenance before breakdowns occur, reduce downtime, and maintain optimal operational efficiency. Detecting irregular temporal patterns also helps prevent costly disruptions and ensures the longevity of equipment. 3) **Retail and E-commerce**   Time-series anomaly detection is also valuable in retail and e-commerce for monitoring sales, returns, and website activity over time. Sudden drops or spikes in these metrics may point to operational issues, shifts in customer behavior, or fraudulent activity. By analyzing temporal patterns, businesses can distinguish between normal seasonal fluctuations and true anomalies, enabling faster response to unexpected events, improving customer experience, and optimizing operational decisions. #### Univariate vs Multivariate Forecasting in Time Series    In time-series anomaly detection, the data being analyzed can either be **univariate** or **multivariate**, depending on the number of variables tracked over time. In QuickML, data is categorized this way to ensure that models are applied appropriately: univariate forecasting is simpler and computationally efficient for single-variable trends, while multivariate forecasting is necessary when anomalies arise from interactions between multiple features, which could be missed in a univariate approach. This distinction helps you to choose the right modeling strategy and feature representation for accurate anomaly detection, whether the focus is on a single metric or on complex patterns across several interdependent variables. It is important to understand the difference between univariate and multivariate data, as this choice directly impacts how anomalies are detected and interpreted in time-series analysis. <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:25%;">Feature</th> <th style="text-align:left; width:37.5%;">Univariate Data</th> <th style="text-align:left; width:37.5%;">Multivariate Data</th> </tr> </thead> <tbody> <tr> <td>Definition</td> <td>Tracks a single variable over time.</td> <td>Tracks multiple interrelated variables over time and forecasts a single target variable while accounting for its dependencies on other related variables (multivariate forecasting), similar to how traditional predictive models operate.</td> </tr> <tr> <td>Focus</td> <td>Monitors trends, seasonality, and fluctuations of one metric.</td> <td>Monitors patterns and interactions among multiple metrics simultaneously.</td> </tr> <tr> <td>Anomaly Detection</td> <td>Flags deviations in the single variable from expected behavior.</td> <td>Flags anomalies based on irregularities in relationships between variables, capturing patterns that may not appear in isolation.</td> </tr> <tr> <td>Complexity</td> <td>Simpler and computationally efficient.</td> <td>More complex, requires modeling dependencies and interactions between variables.</td> </tr> <tr> <td>Use Cases</td> <td>Website traffic, daily sales, electricity consumption, revenue monitoring.</td> <td>Retail analytics (sales vs marketing spend), stock market analysis (price, volume, volatility), healthcare monitoring (heart rate, blood pressure, oxygen levels).</td> </tr> <tr> <td>Strengths</td> <td>Easy to implement and interpret; good for single-metric monitoring.</td> <td>Detects subtle anomalies that arise from variable interactions; captures complex patterns.</td> </tr> </tbody> </table> #### How Time-Series Anomaly Detection Works    Let’s go over time-series anomaly detection using the ARIMA (AutoRegressive Integrated Moving Average) model, one of the most widely used forecasting-based methods. 1) **Understand the time-series data** The process begins with collecting sequential data points, where each value is linked to a specific timestamp, for example, daily website visits, hourly temperature readings, or minute-by-minute stock prices. Because time is an essential factor, patterns such as trends (overall growth or decline) and seasonality (repeated cycles) are important to identify. 2) **Train the forecasting model** ARIMA learns from the historical time-series data to understand its underlying patterns. It combines three components: - **AR (AutoRegressive)**: Uses past values to predict future ones. - **I (Integrated)**: Makes the data stationary by removing trends. - **MA (Moving Average)**: Uses past forecast errors to refine predictions. The model is trained to capture how the data usually behaves over time. 3) **Generate forecasted values** After training, ARIMA predicts what the next data points should look like based on historical patterns. For each timestamp, the model produces an expected (forecasted) value and a confidence interval, which represents the normal range of variation. 4) **Compare actual vs. predicted values** The actual observed value from the dataset is then compared to the model’s predicted value. If the observed value lies far outside the expected range (for example, much higher or lower than the forecasted confidence band), it is considered an anomalous point. 5) **Detect and report anomalies** Once the entire time series is analyzed, the system flags timestamps where significant deviations occur. For example, a sudden spike in network traffic could indicate a cyberattack. The output typically highlights these timestamps and shows the number of detected anomalies for further investigation. Example: Imagine a company monitoring its website’s hourly traffic. ARIMA predicts that the next hour should have around 1,200 ± 100 visits, but the actual count drops suddenly to 400.Since this value lies far outside the expected range, the system flags it as an anomaly, prompting the company to investigate potential causes like a server crash or connectivity issue. #### Steps to Build a Time Series Anomaly Detection Pipeline In QuickML, time-series anomaly detection is implemented through the **Smart Builder**, which is designed specifically for datasets where observations are made over time. The Smart Builder structures this process into four main stages: **Source → Preprocessing → Algorithm → Destination**. **Stage 1: Source** The pipeline begins with ingesting the timeseries dataset, such as stock prices, machine sensor readings, or daily sales. Since these datasets are sequential, selecting the correct time-based input ensures the system can detect anomalies relative to historical patterns. **Stage 2: Preprocessing** Smart Builder automatically prepares raw time-based data for anomaly detection. **Frequency**: For timeseries datasets, data is resampled to a consistent frequency (e.g., daily, weekly). This standardization ensures anomalies are detected across uniform intervals rather than distorted by irregular timestamps. **Imputation**: Missing values in the dataset are handled through imputation. Without this step, gaps may be misinterpreted as anomalies or distort the model’s ability to recognize true patterns. **Transformation**: Transformation ensures the dataset is adjusted so that anomalies stand out more clearly rather than being hidden by trends or scale effects. In Smart Builder, this step can be configured directly, and users have two main options: **Differencing** or **Power Transformation**. - **Differencing**: Differencing works by calculating the difference between consecutive observations in a time series. This process helps stabilize the mean by removing shifts in the series’ level, effectively reducing or eliminating trends and seasonality. The order of differencing specifies how many times this operation is applied to transform a non-stationary series into a stationary one. In QuickML, you can specify a maximum order of differencing up to 5. If the series remains non-stationary even after the fifth order of differencing, you can apply one of the available power transformation methods to further stabilize the variance and make the series suitable for anomaly detection. - **Power Transformation**: Stabilizes variance and normalizes skewed data distributions. This is especially useful when raw data values are spread across very different scales, making anomalies in smaller values harder to detect. Available options for power transformation in Quickml are: - **Log Transformation**: Applies a logarithmic scale to compress large values and spread smaller ones. This helps reveal anomalies that might otherwise be overshadowed by extreme outliers. - **Square Root Transformation**: Useful when dealing with moderate skewness; it reduces the impact of high values while retaining the relative differences across smaller values. - **Box-Cox Transformation**: A flexible option that automatically determines the best power parameter (λ) to stabilize variance and normalize the dataset. Particularly effective when values are strictly positive and the distribution is heavily skewed. - **Yeo-Johnson Transformation**: Similar to Box-Cox but also works with zero or negative values, making it more versatile for datasets that aren’t strictly positive. **Stage 3: Algorithm** Once preprocessed, Smart Builder applies algorithms designed for timeseries anomaly detection, such as Isolation Forest or One-Class SVM. These models learn normal temporal behavior and flag deviations as potential anomalies, whether they appear as sudden spikes, drops, or irregular sequences. **Stage 4: Destination** Finally, Smart Builder outputs anomalies along with supporting metrics and visualizations. These results allow users to validate flagged anomalies, interpret the context, and take timely action. The system ensures anomalies are presented in an accessible format for faster decision-making. ### Non-Time Series Anomaly Detection   Non-time series anomaly detection deals with data that does not have an inherent temporal order, meaning the individual data points are independent of time or sequence. Each observation is treated as a standalone instance rather than part of a continuous timeline. Examples include credit card transactions, medical records, network packets, product reviews, or manufacturing measurements. In non-time series anomaly detection, the goal is to identify data points that behave differently from the majority of the dataset. Since there is no time component, detecting anomalies in non-sequential data focuses on spatial, relational, or statistical patterns rather than temporal ones. Methods analyze the distribution and structure of the data, measuring how far each point deviates from the normal population. #### Building Intuition Behind Non-Time-Series Anomaly Detection   In non-time-series datasets, observations are static and independent of time. Detecting anomalies involves understanding relationships between multiple features. The goal is to define the boundary that separates normal behavior from abnormal behavior. - **Feature space perspective**: Each observation can be visualized as a point in a high-dimensional space. Normal points cluster together, while anomalies lie far from these clusters. - **Density-based thinking**: Anomalies often occur in low-density regions where few data points exist. Algorithms use this intuition to assign anomaly scores. - **Multivariate dependencies**: Relationships between multiple features must be considered. A point might appear normal along individual dimensions but be anomalous when combined with other features. - **Context awareness**: In categorical or grouped data, normality can vary between groups. Understanding this contextual variation is essential for accurate detection. #### Key Success Criteria for Non-Time-Series Anomaly Detection   Non-time-series anomaly detection focuses on static or independent data points rather than sequences over time. Success depends on accurately identifying abnormal patterns within these static datasets, understanding relationships among features, and providing interpretable and actionable insights. Key criteria for success include: - **Quality data preparation**: Ensure data is clean, consistent, and representative of normal behavior. - **Effective feature engineering**: Choose features that capture meaningful variations in behavior. Combining numerical and categorical attributes provides a richer representation of normality. - **Representative normal samples**: The training data should cover the full range of legitimate patterns to prevent misclassification of valid variations as anomalies. - **Robust model selection**: The chosen algorithm must handle noise, outliers, and different data distributions. #### Business Applications of Non-Time Series Anomaly Detection   Non-time series anomaly detection is applied across many industries to identify unusual data points or behaviors that deviate from normal patterns. Because this type of data is not dependent on time, it helps organizations detect rare or suspicious events and strengthen decision-making across a wide range of operations. 1) **Healthcare**  In healthcare, non-time series anomaly detection is used to find unusual patient records or diagnostic results that may indicate errors, rare diseases, or potential fraud in insurance claims. For example, an unusually high number of procedures for a single patient or medical test results that differ greatly from normal ranges can be flagged for further review. Detecting these irregularities helps improve patient safety, reduce billing fraud, and maintain data accuracy in medical systems. 2) **Human Resources and Recruitment**  In HR and recruitment, anomaly detection can help identify unusual employee data, such as unexpected salary levels, inconsistent performance metrics, or irregular application patterns. Detecting these outliers helps organizations maintain fairness, detect data errors, and prevent possible policy violations. It also assists in spotting exceptional talent or early signs of workforce issues. 3) **Cybersecurity** In cybersecurity, anomaly detection helps identify suspicious network activities or system access patterns that differ from typical user behavior. For example, a login attempt from an unusual location, unauthorized data access, or abnormal file transfers can indicate potential cyberattacks or insider threats. By analyzing the relationships and characteristics of network data, security systems can detect and respond to threats quickly, even without relying on time-based sequences. #### How Non-Time Series Anomaly Detection Works    Now, let’s understand non-time series anomaly detection using the Isolation Forest algorithm, a popular method that works well for high-dimensional, tabular data. 1) **Understand the dataset** In non-time series data, each record is treated as an independent observation with multiple features (for example, customer age, income, and spending score). There’s no timestamp or sequential order, the goal is simply to find data points that look very different from most others. 2) **Build random trees** Isolation Forest creates many random decision trees by repeatedly splitting the dataset into smaller sections. Each split randomly selects a feature and a split value. The idea is that normal data points need more splits to be isolated, while anomalies, being rare and different, get isolated quickly with fewer splits. 3) **Measure isolation depth** For every data point, the algorithm calculates the average number of splits (path length) needed to isolate it across all trees. - If a point is isolated quickly, it is likely an anomaly. - If it takes many splits, it is considered normal. 4) **Compute anomaly scores** An anomaly score is then calculated for each record based on its average path length. Values closer to 1 indicate stronger anomalies, while values near 0 suggest normal behavior. 5) **Detect and report anomalies** Finally, the model classifies records with high anomaly scores as outliers and reports the number of anomalies detected in the dataset. These points can then be examined to understand why they differ from the norm, for example, fraudulent transactions or data-entry errors. Example: Imagine a bank analyzing customer transactions. Most customers spend amounts between $50 and $500, but a few transactions suddenly appear at $10,000+. Isolation Forest isolates these high-value transactions with very few splits, flags them as anomalies, and reports how many such suspicious cases exist, helping the bank focus on potential fraud. #### Steps to Build a Non-Time Series Anomaly Detection Pipeline  For non-timeseries data, such as customer transactions, employee records, or survey responses, anomaly detection pipelines are built using **Classic Builder**. Here, the process skips frequency alignment and temporal transformations, since the data is not sequential. Instead, Classic Builder directly prepares tabular data, applies algorithms, and outputs anomalies that highlight unusual or suspicious records. By separating Smart Builder for timeseries data and Classic Builder for non-timeseries data, QuickML ensures that each anomaly detection pipeline is tailored to the unique structure of the dataset, improving both accuracy and interpretability. **Stage 1: Data ingestion**   The process begins with loading the dataset into QuickML, which serves as the foundation for the anomaly detection workflow. This step involves importing data from various sources such as CSV files, relational databases, APIs, or cloud storage systems. Since non–time series data does not depend on temporal order, each record is treated as an independent observation. Ensuring accurate and complete data ingestion is critical for reliable anomaly detection results. **Stage 2: Data preprocessing**   Once the data is imported, it undergoes a preprocessing phase to ensure accuracy, consistency, and readiness for analysis. This step includes handling missing values through imputation or removal and encoding categorical variables using label or one-hot encoding. Effective preprocessing ensures that anomalies reflect true irregularities rather than data quality issues, improving the performance and interoperability of the detection model. **Stage 3: Algorithm selection**   After preprocessing, QuickML allows users to select an appropriate anomaly detection algorithm based on the characteristics of the dataset and the nature of expected anomalies. Common approaches include Isolation Forest, One-Class SVM, and Local Outlier Factor (LOF). Each algorithm uses different principles to distinguish normal data from anomalies. Choosing the right method is crucial for accurately capturing the unique patterns and relationships within the dataset. **Stage 4: Model training and detection**  In this step, the selected algorithm is trained on the processed data to learn the normal behavior or structure of the dataset. During training, the model identifies typical patterns, boundaries, or distributions and then evaluates each instance to detect those that deviate significantly from the learned norms. **Stage 5: Anomaly reporting**   After the model identifies anomalies, QuickML summarizes the results by displaying the number of anomalies detected in the dataset. This output provides a clear overview of how many data points deviate from normal patterns based on the trained model. ### Evaluation Metrics in Anomaly Detection  The evaluation metrics below help assess how well the anomaly detection model is identifying deviations from normal behavior and how accurate its underlying predictions are. These metrics quantify prediction error, model reliability, and ultimately the trust you can place in flagged anomalies. #### Mean Absolute Percentage Error (MAPE)   **What it tells you**: MAPE measures how far, on average, the model’s predicted values deviate from the actual observed values in percentage terms. It expresses the size of prediction errors relative to the magnitude of actual data, which makes it intuitive and easy to interpret across different datasets or scales. This metric is particularly useful when you need to understand how accurate your model’s predictions are in terms of percentage deviation rather than absolute difference. **Intuition**: - A lower MAPE indicates that the model’s predictions are closer to the true values. - A very high MAPE usually signals poor model accuracy — or that actual values are very small, which inflates percentage errors. **Example Inference**: If MAPE is **15%**, it means predictions deviate from actual values by 15% on average. #### Symmetric Mean Absolute Percentage Error (SMAPE)   **What it tells you**: SMAPE improves on MAPE by treating over-predictions and under-predictions equally. This makes it a fairer and more balanced measure, especially when your dataset contains small or zero values that can distort traditional percentage errors. SMAPE expresses how large prediction errors are compared to the average of the predicted and actual values, ensuring the metric remains stable even when actual values are close to zero. **Intuition**: - SMAPE ranges between 0% (perfect prediction) and 200% (completely inaccurate). - It is preferred over MAPE when your dataset contains very small or zero values to avoid disproportionately large errors. **Example Inference**: A SMAPE of **18.68%** indicates that, on average, predictions deviate by 18.68% from actuals in a symmetric fashion, which is a reasonably acceptable level of error for many real-world anomaly detection tasks. #### Mean Squared Error (MSE)   **What it tells you**: MSE calculates the average of squared differences between the predicted and actual values. By squaring the errors, it emphasizes large deviations, meaning big mistakes have a much stronger impact on the score. This makes MSE valuable for understanding how significantly prediction errors vary across the dataset and for identifying models that may be making large, costly errors. **Intuition**: - An MSE of 0 means perfect predictions. - Higher MSE values indicate a higher level of prediction error. - Because errors are squared, large deviations contribute disproportionately, making MSE sensitive to outliers. **Example Inference**: In the provided output, MSE is **0**, meaning the model’s predictions match the observed values exactly, which  is an ideal case. #### Root Mean Squared Error (RMSE)     **What it tells you**: RMSE is the square root of MSE and provides an error measure in the same unit as the original data, making it more interpretable. It represents the standard deviation of prediction errors, or how spread out the residuals are around the true values. RMSE is especially helpful when comparing models or assessing prediction consistency. **Intuition**: - RMSE is easier to interpret than MSE because it is expressed in the same unit as the predicted variable. - Lower RMSE means better prediction accuracy. **Example Inference**: An RMSE of **0.0041** indicates that, on average, the prediction error is very small and close to the true values. #### Mean Squared Log Error (MSLE)   **What it tells you**: MSLE measures the squared difference between the logarithms of predicted and actual values. Instead of focusing on absolute differences, it emphasizes relative errors, making it ideal for data that spans multiple scales or involves exponential growth (e.g., population, sales, or web traffic). MSLE rewards predictions that are proportionally close to actual values and penalizes underestimation more strongly than overestimation, which is useful when missing a spike is more serious than over-predicting one. **Intuition**: - MSLE penalizes underestimation more than overestimation. - Best used when your data spans multiple orders of magnitude (e.g., exponential growth patterns). **Example Inference**: An MSLE of **0** indicates perfect alignment between predicted and actual values in log scale, meaning no significant under- or over-prediction. #### Root Mean Squared Log Error (RMSLE)   **What it tells you**: RMSLE is the square root of MSLE, making it easier to interpret while retaining its focus on relative prediction accuracy. It helps evaluate how much the model’s predictions differ from actual values in percentage-like terms without being overly influenced by large outliers. RMSLE is particularly useful for problems where relative growth or proportional differences are more important than exact numerical accuracy. **Intuition**: - RMSLE is less sensitive to large absolute errors but emphasizes relative accuracy. - Particularly useful when percentage differences matter more than raw differences. **Example Inference**: An RMSLE of **0.0041** indicates that, on a log scale, the prediction errors are extremely low, confirming high-quality forecasting. #### Number of Anomalies   **What it tells you**: This metric represents the count of data points flagged as anomalies by the model after processing the dataset. It provides a direct measure of how frequently the system detects deviations from normal behavior. **Intuition**: - A very high number of anomalies might indicate that the model is too sensitive (many false positives). - A very low number might mean the model is too strict and missing meaningful anomalies. **Example Inference**: If the model detects **42** anomalies, you can review them to ensure they correspond to meaningful events, like sudden demand spikes or sensor failures, and not random noise. **How to use these metrics together** No single metric tells the full story of model performance. Use them collectively to get a holistic view; prediction error metrics (MAPE, SMAPE, MSE, RMSE, MSLE, RMSLE) indicate how accurately the model forecasts normal behavior, while the number of anomalies shows how often the model flags deviations. **Example**: Suppose your model has an RMSE of 0.0041, a MAPE of 15%, and detects 42 anomalies in a dataset. The low RMSE and moderate MAPE indicate that predictions are very close to actual values, and the number of anomalies is reasonable, suggesting the model is accurate and appropriately sensitive. If instead the model flagged 300 anomalies, you might suspect it’s too sensitive and producing many false positives. Interpreting these metrics together helps balance prediction accuracy with meaningful anomaly detection, giving you confidence in actionable insights. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} For non-time series anomaly detection, the evaluation focuses only on the number of anomalies detected, since there are no time-dependent predictions involved. Metrics like MAPE, MSE, or RMSE are not applicable because the model does not forecast sequential values. {{%/note%}} ### Visual Evaluation of Anomalies  Visual evaluation is a crucial step in validating the performance of a time-series anomaly detection model. It enables qualitative assessment of the model’s outputs by comparing predicted anomalies against the actual time-series behavior. This visualization enables users to: - Quickly validate whether detected anomalies align with real changes in the data. - Identify false alarms or missed anomalies. - Gain insight into the model’s reliability across different time segments and trends. **What it tells you**: This visualization shows how the model identifies anomalies in a time-series dataset. The chart plots actual data values over time, highlights the expected (normal) range, and marks points that deviate significantly as anomalies. **Intuition**: - **Green line**: Represents the actual observed values over time. - **Gray shaded area**: Indicates the expected or predicted range of normal behavior based on the model. - **Red circles**: Highlight points detected as anomalies, where the actual values significantly deviate from the expected range. - If the red anomaly points align with clear spikes, drops, or deviations outside the gray expected range, it indicates that the model is accurately capturing unusual patterns. - Few, well-placed anomalies suggest the model is stable and precise. - Frequent or random anomaly points within normal regions may indicate an overly sensitive threshold or noise in detection. **Example**: If anomalies appear at sharp peaks in the series (e.g., sudden increases in values), it suggests genuine outliers or unexpected events. If anomalies appear during normal fluctuations, it may indicate that the model needs threshold adjustment to reduce false positives. -------------------------------------------------------------------------------- title: "AutoML" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.596Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/automl/" service: "QuickML" -------------------------------------------------------------------------------- ### What is AutoML? Automated Machine learning, or AutoML, is an approach to build machine learning models. It automates the entire process of model development tasks, unlike the traditional method where tasks are done manually. AutoML empowers users ,from data scientists and ML engineers to business owners and stakeholder of varying skill levels, to build quality models without requiring any programming expertise or statistical knowledge by. This approach accelerates the time to take the model from idea to production in a short span, unlike the traditional process. ### Behind the Scenes of an AutoML Pipeline AutoML includes tasks like data-preprocessing, feature engineering, algorithm selection, hyperparameter tuning and model evaluation, which are done in one go without any manual intervention. It performs the essential operations sequentially according to optimize and make the input data ready by itself and executes the tasks automatically until the final goal, a high performing model, is built. General stages of machine learning model development which are automatically handled by AutoML are as follows: 1. **Data Preprocessing** - The first step AutoML performs is to clean and transform the raw data into usable format. It automatically addresses the missing values, outliers, encoding, and data normalization to optimize the data for model input. 2. **Feature Engineering** - AutoML reduces the size of the data and complexity by reducing the noise and unwanted features or creates new features when necessary to help improve the model performance and accuracy. Using important and most relevant features in model building results in a high performing model. 3. **Model Selection** - Various algorithms are tested for better performance that suits the dataset - to find the best fit. AutoML compares and identifies the most suitable model based on its performance. 4. **Hyperparameter Tuning** - AutoML also performs the task that is the heart of building a high performing model, hyperparameter tuning, to find the most effective combination. It applies multiple techniques like grid search, random search, or Bayesian optimization to improve model accuracy and reduce error rates. 5. **Model Evaluation** - Different evaluation metrics are being used for different models to evaluate the performance. After tuning, the model undergoes evaluation using a validation dataset, where key metrics like accuracy, precision, recall, F1 score, or mean squared error are evaluated. This stage helps determine the model’s quality and performance. 6. **Ensemble Methods** - AutoML also uses ensemble techniques, if required, combining multiple models to boost accuracy and performance. These methods can increase accuracy by blending the strengths of different algorithms. ### AutoML in QuickML QuickML’s AutoML feature automatically generates a complete pipeline with essential stages, offering an end-to-end solution that mirrors the custom pipeline building mode. Currently, AutoML in QuickML only supports prediction models, enabling users to build various types of Classification, Regression, and Ensemble models tailored to the input data and specific business requirements. This streamlined AutoML process simplifies model development while ensuring high-quality, data-driven predictions, making it an effective tool for users looking to accelerate their model-building workflow. #### Key benefits of QuickML's AutoML QuickML offers significant advantages with its AutoML capabilities, providing streamlines and accessible approach to build machine learning models. - Easy interpretation of pipeline stages - Triggering AutoML in a few clicks making it user-friendly for non-experts - Reduced resource dependency - Editable pipelines for optimization - Seamless endpoint creation for real-time predictions - Consistency and reliability in model quality - Time and cost efficiency - Scalable and adaptable for various use cases (Classification, Regression, Ensemble models) -------------------------------------------------------------------------------- title: "Clustering" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.596Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/learning-center/clustering/" service: "QuickML" -------------------------------------------------------------------------------- ### Introduction Clustering is a process of grouping similar data points into meaningful clusters, where items in the same group share common characteristics while differing from items in other groups. It falls under the domain of **unsupervised learning**, a branch of machine learning that works without predefined labels or outcomes. Instead of being told what the correct groupings are, the algorithm uncovers hidden structures within the data on its own. This makes clustering especially valuable when dealing with large, unorganized datasets, helping users detect patterns, discover natural groupings, and simplify complex information into more understandable segments. At its core, clustering answers the question: “Which data points are similar to each other?” It can be applied across domains such as customer segmentation, image recognition, anomaly detection, and document categorization, among many others. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} Clustering feature is available in {{%bold class="bold-primary"%}}early access{{%/bold%}} across all the data centers. To use this, request access via support@zohocatalyst.com {{%/note%}} ### Building Intuition Behind Clustering   Before diving into types of clustering and its algorithms, let's begin with an intuitive understanding of clustering. Imagine you have a dataset of movie viewers, with each viewer described by multiple attributes such as the number of movies watched per genre, average rating given, and watch frequency. You don’t know in advance which viewers have similar tastes. By applying a clustering algorithm, each viewer is assigned to a cluster based on overall similarity across all features. At the end, you can visually see natural groupings of viewers: one cluster might contain fans of action and thriller movies, another could be comedy enthusiasts, and a third could represent viewers who prefer romance or drama. The algorithm has uncovered hidden patterns in viewer behavior without ever being told which viewers are similar.  ### Key Success Criteria for Clustering Analysis  Clustering is inherently iterative and exploratory, requiring domain expertise and human judgment. Unlike supervised learning, there are no labeled outcomes, so you cannot use traditional metrics like accuracy or RMSE to evaluate performance. This makes assessing clustering models subjective and dependent on business objectives. **Key criteria for success include:** - **Interpretability**: Can you explain why points were grouped together? - **Business usefulness**: Does the clustering output provide actionable insights? - **Knowledge discovery**: Have you learned new patterns or uncovered hidden structures in the data? Success in clustering often comes from combining algorithmic output with domain knowledge to refine clusters and extract meaningful insights. ### Business Applications of Clustering   Clustering finds applications in a wide range of industries, such as those listed below, because it helps reveal natural groupings in data that businesses can act upon. **Retail and Marketing** Clustering is widely used in retail and marketing to segment customers into groups, such as budget shoppers, seasonal buyers, or premium customers. These insights allow businesses to run targeted campaigns, design loyalty programs, and even optimize store layouts by identifying products that are frequently purchased together. **E-commerce and Online Platforms** E-commerce platforms rely heavily on clustering for personalization. By analyzing browsing and purchase histories, clustering algorithms can group users with similar behavior and generate product recommendations tailored to each segment. Beyond customers, clustering also helps organize vast product catalogs into categories, while unusual browsing or review patterns can be flagged as suspicious activity. **Finance and Banking** In the financial sector, clustering plays a key role in fraud detection by distinguishing normal transaction clusters from rare, isolated outliers that may represent fraudulent activity. Banks also use clustering to segment clients based on spending patterns and income levels, enabling them to design customized credit cards, loan offers, or investment portfolios. Similarly, analysts use clustering to group stocks that behave alike in the market, aiding in portfolio diversification. **Healthcare and Life Sciences** Healthcare and life sciences also benefit from clustering techniques. Patient records can be grouped to identify disease subtypes or high-risk populations, which supports early intervention and personalized treatment plans. Genetic and clinical data, when clustered, often reveal hidden biological groupings that accelerate drug discovery. Hospitals also use clustering to categorize patients by treatment costs, hospital stays, or outcomes, helping improve operational efficiency. **Telecom and Technology** Telecom companies apply clustering to understand user behavior, such as grouping customers by calling or internet usage patterns, which then informs the creation of personalized data plans. It is also used to detect anomalies in network traffic or to streamline customer service by automatically grouping support tickets into categories like billing, login issues, or technical faults. ### How Clustering Works  Let’s understand clustering using the K-Means algorithm, one of the most widely used methods. **Step 1: Choose the number of clusters (k)** The process begins by deciding how many clusters you want the algorithm to identify. This number, denoted as k, can be determined based on statistical techniques such as the Elbow Method which help estimate an optimal balance between accuracy and interoperability. **Step 2: Initialize centroids** Next, the algorithm randomly selects k data points from the dataset as the initial cluster centroids. These serve as the starting reference points around which clusters will begin to form. In some cases, improved initialization methods like K-Means++ are used to select better starting positions, which can help the algorithm converge faster and produce more reliable results. **Step 3: Assign data points to clusters** Once the centroids are initialized, each data point in the dataset is assigned to the nearest centroid based on a chosen distance metric, Euclidean distance, which measures how far apart two points are in space. The formula for Euclidean distance between two points (x1,y1) and (x2,y2) is: {{%code%}}Distance = √((x1 - x2)² + (y1 - y2)²) {{%/code%}} where (x1,y1) and (x2,y2) are the coordinates of two points. This step effectively divides the dataset into k groups, where each point belongs to the cluster with the closest centroid. **Step 4: Update centroids** After all points have been assigned, the algorithm recalculates the position of each centroid. This is done by taking the mean of all the data points that belong to the same cluster. The new mean value becomes the updated centroid location, representing the center of that cluster more accurately than before. **Step 5: Repeat assignment and update** The process of assigning points and updating centroids is repeated iteratively. With each iteration, the centroids adjust their positions slightly, and points may shift between clusters as the algorithm refines the boundaries. This cycle continues until the cluster assignments no longer change significantly or the movement of centroids becomes minimal. **Step 6: Check for convergence** K-Means is said to have converged when the centroids stabilize, meaning they stop moving appreciably between iterations. Alternatively, the process may stop when a predefined maximum number of iterations is reached. At this stage, the algorithm considers the clustering process complete. **Step 7: Generate the final output** In the final step, each data point is assigned a cluster label (such as 0, 1, 2, …, k–1) corresponding to the group it belongs to. The final centroid coordinates represent the centers of these clusters. These results can then be used for real-time use cases such as customer segmentation. **Example** Let's apply these steps to a customer segmentation example for a better understanding. Imagine you run an online store and have a dataset of customers with multiple attributes, such as monthly spending, number of purchases, product preferences, and more. You don’t know in advance which customers behave similarly. After applying a clustering algorithm, such as K-Means, each customer is assigned to a cluster based on overall similarity across all features. To visualize the results, the data is reduced to two components: Component 1 and Component 2. Each customer becomes a point on a 2D scatter plot: - X-axis: Component 1 - Y-axis: Component 2 Points are color-coded according to their cluster. This visualization makes it easy to see natural groupings of customers.  From the above plot, we can conclude that each cluster formed represents a group of customers with similar attributes and interests. And business can target these customer segments to increase the revenue and profits. K-Means is one algorithm falls under centroid based clustering. QuickML also supports various algorithms that fall under different types of clustering. Let's have a look at the types of clustering. ### Types of Clustering   Clustering methods can be broadly classified into four major types, each with its own approach and applications. **1. Centroid-based clustering**: It relies on a central point, or centroid, to represent each cluster. Data points are assigned to the cluster whose centroid they are closest to, with clusters formed by minimizing the distance between data points and these centroids. A common example is customer segmentation, where shoppers are grouped into categories like budget, mid-range, or premium based on their average monthly spending. In this visualization, the data points are grouped around centroids — the red X markers represent these centroids. Each color indicates a distinct cluster, and every point belongs to the cluster whose centroid it’s closest to. The goal is to minimize the total distance between points and their assigned centroids. This method assumes clusters are roughly spherical and equally sized, which works well for simple, well-separated data distributions. Algorithms like K-Means and MiniBatchKMeans are examples of centroid-based clustering. **2. Medoid-based clustering**: It is similar to centroid but instead of using calculated centroids, it chooses actual data points, called medoids, as cluster representatives. This makes the approach more robust to noise and outliers, since the clusters are anchored on real observations. For instance, telecom companies may use medoid-based clustering to group users by selecting actual representative customer profiles from call behavior data. Here, clusters are represented by medoids — actual data points shown as red diamond markers. Unlike centroids (which are averages), medoids are real observations from the dataset that minimize the total dissimilarity to other points in their cluster. This makes the approach more robust to noise and outliers since extreme values have less influence on the cluster centers. The plot shows how clusters are formed around representative data points. Algorithms like K-Medoids, CLARA, and CLARANS follow this principle. **3. Density-based clustering**: It takes a different path by identifying dense regions of data points and treating them as clusters, while areas of low density act as separators. One of its strengths is the natural detection of outliers, since points that don’t belong to any dense region are marked as noise. A typical application is in financial fraud detection, where dense regions correspond to normal spending behavior and sparse or isolated points indicate suspicious transactions. In this plot, clusters are formed based on regions of high data density. The algorithm groups points that are close together and labels points in low-density regions as noise (often marked in a different color, such as -1 in DBSCAN). You can see that this approach captures complex, non-spherical cluster shapes (like the curved “two moons” pattern), which centroid or medoid-based methods struggle with. It’s ideal for datasets with irregular boundaries or when detecting outliers is important — for example, in fraud or anomaly detection. **4. Model-based clustering**: It assumes that data is generated from a mixture of underlying probability distributions, and it assigns probabilities of cluster membership rather than hard labels (each data point is assigned to exactly one cluster, like in K-Means). This probabilistic framework makes it especially useful in complex domains such as speech recognition, where overlapping sound patterns can be modeled as a mixture of Gaussian distributions. This visualization shows clusters formed by modeling the data as a mixture of Gaussian distributions. Each cluster corresponds to one Gaussian component, and data points have probabilistic membership — meaning a point can partially belong to multiple clusters. The colors indicate the most likely cluster assignment for each point. This method handles overlapping and elliptical clusters better than K-Means, making it suitable for more complex, continuous data distributions. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} Each clustering algorithm corresponds to one of the clustering types mentioned above, and you can refer to the details provided in the {{%link href="https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/clustering/" %}}clustering algorithm help doc{{%/link%}} to identify their categorization. {{%/note%}} ### Steps to Build a Clustering Pipeline in QuickML  Building a clustering pipeline in QuickML uses Classic Builder, which ensures accurate identification of clusters and actionable insights. **Step 1: Data ingestion** The process begins with loading the dataset into QuickML, which serves as the foundation for the clustering workflow. This step involves importing data from various sources such as CSV files, databases, or cloud storage systems. **Step 2: Data preprocessing** Once the data is ingested, it undergoes a preprocessing phase to ensure quality and consistency. This step includes handling missing values through techniques like imputation or removal, encoding categorical variables using methods like label or one-hot encoding, and applying data transformation techniques to improve data quality and model performance. Effective preprocessing reduces bias and improves the accuracy of clustering by ensuring that distance-based algorithms treat all features fairly. **Step 3: Algorithm selection** After preprocessing, QuickML allows users to select an appropriate clustering algorithm based on the data characteristics and desired outcome. Algorithms such as K-Means, DBSCAN, BIRCH, or Gaussian Mixture Models (GMM) can be applied depending on whether the dataset has well-separated, density-based, or probabilistic cluster structures. The selection of the right algorithm is crucial, as different methods capture different types of relationships and structures within the data. **Step 4: Model training** In this step, the selected algorithm is applied to the processed data to identify natural groupings. The model iteratively learns by assigning similar data points to the same cluster based on the defined similarity or distance measure. Key parameters, such as the number of clusters in K-Means or the epsilon value in DBSCAN, are fine-tuned to produce meaningful and stable results. The outcome of this phase is the identification of distinct data clusters that reveal hidden patterns and structures within the dataset. **Step 5: Cluster evaluation** After clusters are formed, QuickML evaluates their quality and validity using various statistical metrics. Commonly used metrics include the Silhouette Score, which measures how well each data point fits within its cluster; the Calinski-Harabasz Score, which assesses cluster compactness and separation; and the Davies-Bouldin Score, which evaluates average cluster similarity. These evaluation metrics help in determining whether the clustering results are both interpretable and reliable for practical applications. ### Clustering Evaluation Metrics  The evaluation metrics mentioned below collectively provide insights into different aspects of clustering model performance, such as how well the groups are separated, how compact they are, and whether the clusters reveal meaningful similarities in the data, rather than arbitrary groupings. Unlike supervised learning, clustering does not have predefined labels, so these metrics act as guiding measures to judge the quality of the clusters. Let’s explore and interpret the metrics commonly used in evaluating clustering models in QuickML: **1. Silhouette Score**    **What it tells you**: Measures how similar a data point is to its own cluster compared to other clusters. **Intuition:** - The silhouette score ranges from -1 to +1. - A high score (close to +1) indicates that data points are well-clustered, with tight grouping and clear separation from other clusters. - A score near 0 suggests overlapping clusters or points lying on the boundary between clusters. - A negative score (close to -1) indicates poor clustering where points may be assigned to the wrong cluster or clusters overlap significantly. **Example Inference:** If the silhouette score is 0.52, it means clusters are reasonably well-formed and separated, but there is still some overlap among data points. **2. Calinski-Harabasz Score** **What it tells you**: This score compares how far apart the clusters are (between-cluster dispersion) to how tight the points are within each cluster (within-cluster dispersion). **Intuition:** - The score is unbounded and ranges from -∞ to +∞. (higher is better). - A higher score indicates clusters are compact within themselves and well-separated from others. - A lower score suggests that clusters are either spread out internally or not distinctly separated. **Example Inference:** If the Calinski-Harabasz score is 950, it suggests that the clusters are relatively dense and distinctly partitioned, showing a good separation structure. **3. Davies-Bouldin Score** **What it tells you:** Measures the average similarity between clusters, based on the ratio of intra-cluster distance to inter-cluster separation. **Intuition:** - The score ranges from -∞ to +∞ (lower is better). - A lower score means clusters are more distinct and have less overlap. - A higher score suggests significant similarity between clusters, implying poor separation. **Example Inference:** If the Davies-Bouldin score is 0.60, it indicates the clusters are fairly well-separated, but not perfectly distinct — some overlap still exists. **4. Number of Clusters** Indicates how many distinct groups the algorithm has identified in the dataset. **Example Inference:** If the model outputs 3 clusters, it suggests that the dataset can be meaningfully divided into three distinct groups, for example, three types of customer segments: budget, mid-range, and premium buyers. ### Visual Evaluation of Clusters  Apart from numerical scores, cluster evaluation can also be done through visualization in QuickML. Cluster distribution and cluster plots give an intuitive sense of balance, separation, and structure within the data. **1. Cluster Distribution** **What it tells you:** This shows how many data points fall into each cluster using a histogram chart. **Intuition:** - A balanced distribution means clusters are relatively even in size, which often indicates a natural grouping. - An imbalanced distribution means one or more clusters dominate in size, which may suggest either real-world dominance (e.g., most customers belong to one segment) or a limitation of the clustering method. **Example Inference:** If Cluster 0 has 600 points, Cluster 1 has 150 points, and Cluster 2 has 30 points, it suggests that the majority of data points belong to Cluster 0, while Clusters 1 and 2 represent smaller, niche groups. **2. Cluster Plot** **What it tells you:** Visualizes how clusters are separated in a lower-dimensional space using a scatter plot (usually using PCA or t-SNE for dimensionality reduction). **Intuition:** - Well-separated clusters in the plot indicate good performance of the clustering algorithm. - Overlapping clusters suggest ambiguity in assignments, meaning the algorithm found it difficult to clearly separate groups. **Example Inference:** If the plot shows clearly visible groups with distinct boundaries, it confirms the algorithm has identified meaningful clusters. If groups overlap heavily, it suggests the need to test a different algorithm (e.g., DBSCAN for irregular shapes). **3. Download Model Predictions** **What it tells you:** This allows you to export the model’s clustering results for the entire dataset. Each data point is labeled with the cluster it belongs to. **Benefits:** - Provides a direct mapping of data points to clusters, making it easy to analyze individual assignments outside QuickML. - Enables further downstream analysis, such as profiling clusters, cross-referencing with external data, or performing targeted actions (e.g., marketing to a specific customer segment). - This offers transparency and traceability as you can see exactly which data points were assigned to which cluster, which is useful for validation and reporting. **To download model predictions** 1. Navigate to the desired model details page. 2. Scroll down to the **Visualizations** section. 3. Click **View Model Predictions** button above to the cluster plot chart. A Model Predictions pop-up window appears from which you can download the predictions. ### Data Connectors -------------------------------------------------------------------------------- title: "Zoho Apps" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.597Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-connectors/zoho-apps/" service: "QuickML" -------------------------------------------------------------------------------- # Data Connectors Data is the most crucial part to handle in any analytics or machine learning pipeline platform. Hence, the Data connectors are intended to import data into QuickML service that is stored in local file system, internal and external cloud storage solutions with ease. The data connectors will be acting as a data ingestion engine to import data into QuickML service using **cloud native approaches and asynchronous design**. Asynchronous data importing will be taken care of with schedulers in QuickML service. The data can be from any of the widely-used cloud storage services, such as Google Cloud, AWS S3 object storage, or Azure Blob. With proper access permissions, the data will be imported into QuickML service once the data import is initiated. The connectors will download the data that is requested to be imported by streaming and storing it into QuickML storage. This data can be used for model training or analytical purposes in QuickML. The supported data connector options are listed below. <br /> ## Internal data Connectors The following are the in-house Zoho products that are available as data connectors in QuickML by default: - [Zoho CRM](https://help.zoho.com/portal/en/kb/crm/getting-started/articles/get-started-introduction-zoho-crm#What_is_CRM) - [Zoho Bigin](https://help.zoho.com/portal/en/kb/bigin/get-started-with-bigin/articles/about-bigin#What_is_Bigin) - [Zoho Creator](https://www.zoho.com/creator/help/new-quickstart-guide.html) - [Zoho Recruit](https://help.zoho.com/portal/en/kb/recruit/introduction-to-zoho-recruit/articles/zoho-recruit-introduction) - [Catalyst Data Store](/en/cloud-scale/help/data-store/introduction/) -------------------------------------------------------------------------------- title: "Cloud Storage" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.597Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-connectors/cloud-storage/" service: "QuickML" -------------------------------------------------------------------------------- ## External data Connectors 1. ### Amazon AWS S3 Amazon Simple Storage Service (Amazon S3) is a scalable, high-speed, web-based cloud storage service designed for online backup and archiving of data and applications on Amazon Web Services. The AWS S3 data connector can be used to access objects stored in Amazon S3 buckets. The objects in S3 can be public as well as private. QuickML does not need any authentication keys for fetching public objects. If the bucket has restricted access, then QuickML require accessKey and secretKey. #### Details required to import data: - **Bucket name** : The bucket in which data is stored. - **Object name** : Name of the file which needs to be imported. - **Region** : Data center region in which your bucket is stored. - **Access Key** : The Access key of AWS IAM user. - **Secret Key** : The secret key of AWS IAM user. - **File Type** : The format of the file to be imported. #### To import data from S3: You need to create an AWS IAM user as per this [link](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html). While creating a new IAM user, please make sure you provide the necessary permissions to access the S3 service. At the end of the IAM user creation process, you will get an "Access Key" and a "Secret Key". You have to store the “Secret Key” in some secured place as it cannot be reproduced again. If you have an IAM user with permission to access S3 objects, you can get the access and secret keys for that user from the AWS console. 2. ### Microsoft Azure Object Storage Microsoft Azure contains multiple storage solutions like SQL databases, NoSQL databases and file storage solutions. File storage solutions are mainly used to store and retrieve the unstructured data (text, csv, etc.) using streaming. This includes Blob Storage, Queue Storage, and Disk storage. #### Blob Storage: Blob storage is an object storage solution for the cloud that deals with data in blobs (tiny units) in three different ways, as per the requirement: Block blobs, Page blobs and Append blobs. Amongst them, Block blobs are the most-commonly used method to store objects. #### Details required to import data: - **Blob Name** : File name in azure storage account container that needs to be imported. - **Container Name** : Container name in which file is stored. - **Connection String** : Authentication string to be passed to validate the account and user details with Microsoft Azure. - **File Type** : Type of the file that is being imported. #### To get Connection String from Azure Portal: 1. Login into [Azure Portal.](http://portal.azure.com/) 2. Search for and select the storage account from which the data needs to be imported. 3. Search the access keys for a value of the connection strings provided (key 1 or key 2). 3. ### Google Cloud Storage Google Cloud Storage is a secure, high-performance object storage solution for archiving and storing files online. Using the connector provided, data stored in the any formats can be imported into QuickML. #### Details required to import data: - **Blob Name** : File name in Google storage that needs to be imported - **Bucket Name** : The bucket in which data is stored. - **File Type** : Type of the file that is being imported. - **Authentication json file** : Json file container authentication details to access the object Google Service account details will be required in order to request the data from the storage server. Steps to get the authentication details have been provided below. #### Steps to get the Authentication Service json file: 1. In the IAM & Admin section in [Google cloud portal](https://console.cloud.google.com/iam-admin), go to **Service Accounts.** 2. Create a service account if needed and provide access to the required projects. 3. In *Service Account* navigate to {{%badge%}}Actions{{%/badge%}} tab and select {{%badge%}}Create Key{{%/badge%}}. 4. Save the key as a json file and It can be used as value for file argument of the below-mentioned API parameters. 4. ### OneDrive Microsoft’s OneDrive object storage solution is also available as data connector in QuickML. It allows user to store and manage files of various formats securely. This integration empowers user to effortlessly incorporate OneDrive data into machine learning workflows, enhancing the efficiency and versatility of analysis. Using the connector provided, data stored in the any formats can be imported into QuickML. #### Details required to import data: - **Client Id**: This is the unique identifier assigned to your application by Microsoft when registering it in Azure Active Directory. - **Tenant Id**: This is the unique identifier for your organization's Azure Active Directory tenant. - **Client Secret**: A secret key associated with your registered application to ensure secure communication between the application and the service. - **User Email**: The email associated with your OneDrive account for authentication and access permissions. - **File Name**: The name of the file you wish to import from OneDrive. - **Source File Type**: The format of the file you're importing. -------------------------------------------------------------------------------- title: "External Databases" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.597Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-connectors/databases/" service: "QuickML" -------------------------------------------------------------------------------- ## External Databases: QuickML offers seamless integration with different databases, allowing you to fetch and utilize data directly from these sources for your analysis and machine learning projects. Supported databases include: - **MySQL** - **PostgreSQL** - **SQL Server** - **Amazon Redshift** - **Amazon Aurora** - **Amazon RDS** Alongside the above-mentioned options, you can upload local files from your devices as well. -------------------------------------------------------------------------------- title: "Local File System" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.597Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-connectors/local-file-upload/" service: "QuickML" -------------------------------------------------------------------------------- ## Local file system QuickML supports the dataset imports from local file system in various client machines like laptops, workstations, PC's provided user must have access to stable internet connectivity to ensure the success of file upload. Once the import process begins, a clear progress bar GUI is displayed to show how far the data upload has progressed. Quick ML can import data from a variety of file formats, including CSV, JSON, ORC, Parquet, text, XLS, XLSX, and XLSM. ### Data Preprocessing -------------------------------------------------------------------------------- title: "Data Cleaning" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.598Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-preprocessing/data-cleaning/" service: "QuickML" -------------------------------------------------------------------------------- # Data Preprocessing Techniques in QuickML QuickML provides major data preprocessing techniques divided into three main categories. - **Data Cleaning** - **Data Transformation** - **Dataset Extraction** All the operations listed in upcoming slides are available as stages in the pipeline building process that helps the user to create better Machine learning models. # Data Cleaning Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. If data is incorrect, outcomes and algorithms will be unreliable. 1. ### Fill Columns This is used to change the values of the specific column based on the criteria set by the user. If the criteria is not specified in the configuration, then all the values in that column will be replaced by the user-specific value / method.\ **Example:**\ For population data of a country which contains patient details like name, age, address,eligible for vote etc., We can fill in the eligible for vote column as yes for all people whose age is greater than 18 as criteria. 2. ### Filter This is used to extract the data from the dataset that we need to preprocess by applying criteria in the configuration. We are also able to use not satisfied data for preprocessing\ **Example:** - #### Single Output Filter: In student dataset, if we only require CSE department data, we can set criteria as {{%badge%}}dept=CSE{{%/badge%}}, which will reduce data that we need to preprocess. - #### Double output Filter: If it is required to preprocess both boys and girls data of the student dataset in a different flow, we can use the **"Show unmatched records as secondary output"** checkbox in configuration. This will be helpful if you wanted to do some special operation or process for unmatched data from the filter node. ![Double Output Filter](/images/help/quickml/data-preprocessing/Double_Output_Filter.webp) 3. ### Remove duplicates It is used to remove the duplicate rows in the dataset. We can also control how duplicate rows need to be removed.\ **Example:**\ For a student dataset which had five duplicate rows with student ID 101 at index 1,5,7. Keep preference option:\ &nbsp;&nbsp;&nbsp;&nbsp;**First** - Output dataset will have only row at index 1, other rows at index 5 & 7 will be removed.\ &nbsp;&nbsp;&nbsp;&nbsp;**Last** - Output dataset will have only row at index 7, other rows at index 1 & 5 will be removed.\ &nbsp;&nbsp;&nbsp;&nbsp;**None** - Output dataset will have no duplicate rows, rows at index 1,5 & 7 will be removed. 4. ### Select or Drop This is used to do both select or drop columns in dataset. If a user needs to have only two columns from dataset, can simply select the required two columns from drop down and choose select operation. For dropping columns, choose drop operation. -------------------------------------------------------------------------------- title: "Data Transformation" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.598Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-preprocessing/data-transformation/" service: "QuickML" -------------------------------------------------------------------------------- <!-- # Data type Detection Data type detection is used to detect the date type of feature (column). It verifies and finds Date-time, URL, Email, Currency, Phone, Latitude degree, latitude seconds, latitude minutes, longitude deg, longitude Seconds, and longitude minutes data types. It returns a list of dictionaries that contain the column name and its type.\ --> <!-- **Sample output:** {{%code class="language-json"%}}[{"name":"created_time","datatype":"DateTime"},{"name":"mail","datatype":"Email"},....]{{%/code%}} --> <!-- # DType Transformation{#DTypeTransformation} This ML operation generates new columns from URL, Email, Date-time and json type columns. --> # Data Transformation Data transformation is the process of converting data from one format or structure into another format or structure. 1. ### Date time Transformation This is used to extract date-time features, such as date, year, and month, from a column that contains dates. **Usecase 1:** A retail company wants to track the sales of their products over time. DateTime transformation helps them to extract the date, month, and year from the list of sales transactions. **Usecase 2:** A transportation company can use DateTime Transformation to extract the date, time, and location from a list of past deliveries to optimize their delivery routes. **Sample input:** <table class="content-table" style="width:300px;"> <thead> <tr> <th style="text-align:center">dt</th> </tr> </thead> <tbody> <tr> <td style="text-align:center">2021-11-29 11:52:59</td> </tr> </tbody> </table> **Sample output:** <table class="content-table" style="width:300px;"> <tbody> <tr> <th style="text-align:center">dt_day_of_week</th> <td style="text-align:center">1</td> </tr> <tr> <th style="text-align:center">dt_date_no</th> <td style="text-align:center"><a>29</a></td> </tr> <tr> <th style="text-align:center">dt_year_no</th> <td style="text-align:center"><a>2021</a></td> </tr> <tr> <th style="text-align:center">dt_month_no</th> <td style="text-align:center">11</td> </tr> <tr> <th style="text-align:center">dt_business_day</th> <td style="text-align:center">1</td> </tr> <tr> <th style="text-align:center">dt_week_no_year</th> <td style="text-align:center">48</td> </tr> <tr> <th style="text-align:center">dt_day_of_year</th> <td style="text-align:center">333</td> </tr> <tr> <th style="text-align:center">dt_AM_PM</th> <td style="text-align:center">AM</td> </tr> <tr> <th style="text-align:center">dt_quarter_year</th> <td style="text-align:center">4</td> </tr> </tbody> </table> 2. ### Email Transformation This is used to extract features like username, domain, and suffix from a column that contains emails. **Usecase 1:** A sales team can use Email Transformation to extract the username and domain name from a list of email addresses to personalize their email outreach to potential customers. **Usecase 2:** To identify phishing emails, a security team need features such as the domain name and suffix from a list of email addresses that have been flagged as suspicious **Sample input:** <table class="content-table" style="width:450px;"> <thead> <tr> <th style="text-align:center">mail</th> </tr> </thead> <tbody> <tr> <td style="text-align:center"><a>abc@zylker.com</a></td> </tr> </tbody> </table> **Sample output:** <table class="content-table" style="width:450px;"> <thead> <tr> <th style="text-align:center">mail_first</th> <th style="text-align:center">mail_middle</th> <th style="text-align:center">mail_last</th> </tr> </thead> <tbody> <tr> <td style="text-align:center">abc</td> <td style="text-align:center">zylker</td> <td style="text-align:center">com</td> </tr> </tbody> </table> 3. ### Extract Data Extract data component utilizes regex patterns to retrieve information from text columns. These patterns may encompass various date formats, emails, or numerical values embedded within the text columns **Example:**\ In the car price prediction dataset, to extract the Year of manufacture of the car from the name column, we use the POSIX regex pattern /d{4} applied on the same column. **Sample POSIX Regex patterns:** - **Date format:** \d{4}-\d{2}-\d{2} - **Email:** [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+ - **Number:** \d+ 4. ### Format This is used to modify the column values in dataset based on the function chosen by the user. The Function will be displayed based on the data types of the column. **Example:**\ In patient dataset, formatting age column using the absolute value function. The age column will no longer have any negative values in it. 5. ### Group By This is used to partition rows into groups based on their values in one or more columns. It is mandatory to provide at least one **aggregate** function for a column. Users can also set the {{%badge%}}HAVING{{%/badge%}} criteria to filter the grouped dataset. **Example:**\ Example: In an employee salary dataset, users can sum the salaries given to each department by grouping the data by department. This allows users to see the total salary expenditure for each department. In {{%badge%}}HAVING{{%/badge%}} section it can filter these groups to only include departments where the total salary exceeds 50,000. {{< figure src="/images/help/quickml/data-preprocessing/groupby.webp" alt="Join" width="1000px" >}} 6. ### Hash Generator This is used to hash one or several columns in the dataset. Users are also able to drop the source column once the hash is generated. 7. ### Join This is used to join two datasets into single dataset based on the primary column in both tables. User can able to perform Left, Right, Inner and Outer joins.\ **Note** : This stage requires selecting a dataset node.\ **Example:** {{< figure src="/images/help/quickml/data-preprocessing/Join.webp" alt="Join" width="600px" >}} 8. ### Merge Column This is used to merge two or more columns into a single column using a separator specified by the user. User can also able to drop the source column after performing the merge operation. **Example:**\ In passenger dataset, We can combine first_name, middle_nam,e and last_name as Name column using a space separator. 9. ### Normalization Normalization is a statistical method that helps scale and compare variables on different scales or units. This node is used to normalize two or more columns. 10. ### Outliner Handler Outliers are the data points that falls outside of the normal distribution of the data, which can distort the results of data analysis and leads to incorrect conclusions. Outlier Handler can be used to remove outliers from a dataset or replace them with more reasonable values, such as the cap, mean, or median values of the respective columns in the data. 11. ### Sort This is used to sort the dataset by set of columns. Users are also able to choose the sorting order. 12. ### Split Column This is used to split a single column into two or more columns based on the separator specified by the user. The number of output columns will be determined based on the output names given by the user. **Example:**\ In passenger dataset, User can split Name column into three columns by specifying first_name, middle_name, and last_name in the output column text box. 13. ### String Transformation Used to produce word embeddings from columns that contain text. **Usecase 1:** To provide recommended content to its users, a social media company can utilize the String Transformation to get the word embeddings from the text of the posts that users have liked and shared. **Usecase 2:** From the text of the tickets, a customer support team needs to classify customer support tickets by identifying common issues that their customers are experiencing by using word embeddings produced by string transformation. 14. ### Type Conversion This is used to convert type of column into data type specified by the user. Users are able to handle the resultant dataset if an error occurs by using the {{%badge%}}On Error{{%/badge%}} option in configuration. {{%badge%}}Throw{{%/badge%}} will throw an exception if an error occurs and will stop the execution, while {{%badge%}}Nullify{{%/badge%}} will return the original record if an error occurs for that particular record. 15. ### URL Transformation This is used to extract features such as sub_domain, domain, and suffix from a column that contains URLs. **Usecase 1:** A marketing team needs to determine which websites are most frequently visited by their target market. They can utilize URL Transformation to extract the domain names from a list of URLs that users of their website have clicked. **Usecase 2:** A fraud detection team can use URL transformation to find out the list of domains, subdomains, and suffixes of malicious websites from a list of URLs that has been flagged as suspicious. **Sample input:** <table class="content-table" style="width:450px;"> <thead> <tr> <th style="text-align:center">link</th> </tr> </thead> <tbody> <tr> <td style="text-align:center"><a>https://www.google.in/library?fetch=query#fragment_part</a></td> </tr> </tbody> </table> **Sample output:** <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">link_url_protocol</th> <td style="text-align:center">https</td> </tr> <th style="text-align:center">link_url_domain</th> <td style="text-align:center"><a>www.google.in</a></td> </tr> <th style="text-align:center">link_url_path</th> <td style="text-align:center"><a>/library</a></td> </tr> <th style="text-align:center">link_url_query</th> <td style="text-align:center">fetch=query</td> </tr> <th style="text-align:center">link_url_fragment</th> <td style="text-align:center">fragment_part</td> </tr> <th style="text-align:center">link_domain_tld</th> <td style="text-align:center">in</td> </tr> <th style="text-align:center">link_domain_country</th> <td style="text-align:center">IN</td> </tr> <th style="text-align:center">subdomain</th> <td style="text-align:center">www</td> </tr> <tbody> </tbody> </table> 16. ### Union This is used to combine two datasets into a single dataset. Users can able to drop duplicate rows after combining two datasets.\ **Note:** This stage requires selecting a dataset node. 17. ### Windowing This is used to perform window average operation in number columns. User should need to specify the windowing range which should be greater than 0.Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting. This can be used for time series dataset. 18. ### Fill Columns This is used to change the values of a specific column based on the criteria set by the user. If the criteria are not specified in the configuration, then all the values in that column will be replaced by the user-specified value or method. **Example:**\ For a population dataset of a country that contains patient details like name, age, address, and eligibility for voting, we can update the “eligible for vote” column to “yes” for all people whose age is greater than 18. 19. ### Rename This is used to rename any particular column name from the dataset. **Example:**\ For a dataset containing columns such as name, age, address, and eligible_for_vote, we can rename the eligible_for_vote column to voter_eligibility. 20. ### Custom Expression This allows users to create custom expressions to manipulate or calculate values in a dataset. Custom expressions can be used to derive new columns, transform existing data, or apply complex calculations based on the data. **Example:**\ For a dataset containing columns such as name, age, address, and salary, we can create a custom expression to calculate the yearly salary from the monthly salary. -------------------------------------------------------------------------------- title: "Data Extraction" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.598Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-preprocessing/data-extraction/" service: "QuickML" -------------------------------------------------------------------------------- # Data Extraction: 1. ### Add Dataset This option allows you to choose a dataset different from the current one to perform various operations such as joins, unions, and more. 2. ### Split Dataset This is used to divide a single dataset into multiple datasets, following the method specified by the user. 1. #### Row Based Split: The dataset will be split based on the ratio provided by the user.\ **Example:**\ If the dataset contains 8 rows and the ratio is set as 1:3:4, the dataset will be split into three datasets with 1, 3, and 4 rows respectively. 2. #### Column Based Split: The dataset will be divided based on the column selected by the user. ### Data Visualization -------------------------------------------------------------------------------- title: "Overview" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.598Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/overview/" service: "QuickML" -------------------------------------------------------------------------------- # Data Visualization ## Introduction to Data Visualization: **Data Visualization** is the graphical representation of data in order to gain insights from complex datasets. It plays a crucial role in understanding the data by identifying correlations, patterns, and trends, aids in feature engineering, and more. It's very useful in building an efficient model. **Examples include:** - Scatter Plot: Used to visualise the relationship between two variables, such as input features and target outputs. They help identify correlations, outliers, and data clusters and facilitating feature selection. - Heatmap: Employed to evaluate the feature to feature and feature to target correlations, anomalies, patterns visually represented with colour grading. - Pie chart: Used to show the proportion of each category of a feature in slices comprising a whole. **Data Visualization charts are segregated into the following four categories:** <table class="content-table quickml-content-table"> <thead> <tr> <th class="w15p">Visualization types</th> <th class="w35p">Brief use case </th> </tr> </thead> <tbody> <tr> <td>Composition charts</td> <td>Composition charts show how the parts of the data make up the whole</td> </tr> <tr> <td>Distribution charts</td> <td>Distribution charts visually display the distribution or spread of categories in a group.</td> </tr> <tr> <td>Comparison charts</td> <td>Comparison charts allow to compare two or more sets of data or variables.</td> </tr> <tr> <td>Relationship charts</td> <td>Relationship charts visually illustrate the relationship between two or more variables and how they are correlated with each other.</td> </tr> </tbody> </table> -------------------------------------------------------------------------------- title: "Composition Charts" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.598Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/composition-charts/" service: "QuickML" -------------------------------------------------------------------------------- # Composition Charts Composition charts, also known as part-to-whole charts, are graphical representations used to show how individual parts contribute to the whole of a dataset. These charts help display the proportions, percentages, or absolute values of different categories within a larger entity or dataset. Composition charts are particularly effective for illustrating the distribution of data and emphasizing the relationships between different components. 1. ### Bar Chart A bar chart is a type of data visualization that represents categorical data with rectangular bars. Each bar’s length or height is proportional to the value it represents. Bar charts are commonly used to compare the frequencies or values of different categories within a dataset. They are particularly useful for visualizing discrete data and understanding patterns or trends in the data. **Description:** - **Axes:** The horizontal axis (x-axis) usually represents the categories, and the vertical axis (y-axis) represents the value count. - **Shape:** Each category is represented by a separate bar. - **Sections:** Bars are drawn for each category, with their lengths or heights indicating the data values. 2. ### Pyramid Chart A pyramid chart, also known as a triangle diagram, is a visual tool used to showcase hierarchies, workflows, or simple datasets. It’s shaped like a triangle divided into horizontal sections. These charts are effective for comparing data, displaying proportions, or illustrating hierarchies and workflows. They’re commonly used in business presentations to show steps taken towards a task, market shares, hierarchy levels, or a sequence of actions. **Description:** - **Shape:** Resembles an three-dimensional triangle, forming a pyramid-like structure. - **Axes:** Pyramid charts don’t have traditional x and y axes like bar charts. Instead, they rely on the hierarchical structure of the data to convey information. - **Hierarchy Representation:** The layers of the pyramid chart represent different levels of a hierarchy, workflow, or a sequence of actions. The widest layer is at the bottom, and the layers become narrower as they move upward. - **Sections:** The triangular shape is divided into horizontal sections or layers. Each layer corresponds to a category, step in a process, or level in a hierarchy. - **Width and Height:** The height of each layer that indicates the proportion of a category. The width of the layers doesn't carry a specific meaning. 3. ### Pie Chart A pie chart is a circular visualization used to display categorical data as a set of “slices.” Each slice represents a specific category’s proportion within a whole. The size of each slice corresponds to the percentage of data it represents. Pie charts are valuable for showcasing parts of a whole and comparing relative contributions of different categories. **Description:** - **Circular Shape:** A pie chart is circular, resembling a full pie. The entire circle represents 100% of the data. - **Slices:** The circle is divided into slices, with each slice representing a specific category or data point. - **Proportional Representation:** The size of each slice is proportional to the percentage it contributes to the whole data set. Larger slices indicate higher proportions. 4. ### Stacked Bar Chart A stacked bar chart is a graphical representation used to show how different components contribute to a whole, while also comparing the contributions of multiple categories across different groups or segments. Each bar in the chart is divided into segments, with each segment representing a specific category’s contribution. The total height of each bar indicates the combined contribution of all categories, while the segments show the distribution of contributions within each bar. **Description:** - **Segmented Bars:** Bars are divided into segments, each representing a category. - **Vertical Structure:** Bars are aligned vertically, with the y-axis representing value count and the x-axis showing different groups or segments. - **Contribution Comparison:** Stacked bars allow for comparing the contributions of different categories within each group and across multiple groups. 5. ### Stacked Area Chart A stacked area chart is a visual representation used to showcase the cumulative contributions of multiple categories or groups over time. It’s built upon the same principles as a regular area chart, but with multiple areas stacked on top of each other. Each area represents a category, and the combined height of the areas at any point on the chart reveals the total value at that point. **Description:** - **Layered Areas:** Areas are layered on top of each other, with each layer representing a category or group. - **Axes:** The x-axis represents different categories or groups, and the y-axis represents the cumulative value of the data points within those categories. - **Cumulative Visualization:** The chart illustrates how the contributions of different categories accumulate to form the whole over time. -------------------------------------------------------------------------------- title: "Distribution Charts" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.599Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/distribution-charts/" service: "QuickML" -------------------------------------------------------------------------------- # Distribution Charts Distribution charts are graphical representations used to visualize the frequency or distribution of values within a dataset. They provide insights into the spread, variability, and patterns of data points, allowing for a better understanding of how values are distributed across different ranges. Distribution charts are particularly useful for identifying central tendencies, outliers, and the overall shape of data distributions. 1. ### Scatter Plot A ScatterPlot is a graphical representation used to display the relationship between two variables or attributes. It comprises individual data points placed on a Cartesian plane, where each point represents a combination of values for the two variables. Scatter plots are particularly useful for revealing patterns, correlations, or outliers within data. **Description:** - **Individual Data Points:** Each data point on the chart represents a specific combination of values for the two variables. - **Cartesian Plane:** The chart is based on a Cartesian coordinate system, with one variable plotted along the x-axis and the other along the y-axis. - **Variable Relationships:** Scatter plots help visualize how changes in one variable correspond to changes in the other, indicating potential correlations or patterns. 2. ### BoxPlot A BoxPlot, also known as a box-and-whisker plot, is a graphical representation used to visualize the distribution, spread, and outliers within a dataset. It provides a concise summary of key statistics, such as the median, quartiles, and potential outliers. The chart comprises a rectangular **“box”** and lines extending from it, known as **“whiskers.”** **Description:** - **Box:** The central rectangle (box) spans the interquartile range (IQR), showing where the middle 50% of the data is concentrated. The box’s vertical line inside represents the median. - **Whiskers:** Lines (whiskers) extend from the edges of the box to the minimum and maximum values within a defined range, excluding outliers. - **Outliers:** Points outside the whiskers are marked as individual data points and can indicate potential outliers. 3. ### Histogram A histogram is a graphical representation used to visualize the distribution of numerical data. It divides the data into bins or intervals along the x-axis and displays the frequency or count of data points falling within each bin on the y-axis. Histograms help you understand the shape, central tendency, and spread of a dataset. **Description:** - **Binning:** Data is divided into intervals (bins) along the x-axis, and the y-axis represents the frequency or count of data points within each bin. - **Bar Heights:** The height of each bar corresponds to the number of data points in the respective bin. - **Shape Interpretation:** Histograms provide insights into data distribution patterns, such as whether it’s skewed, symmetric, or bimodal. -------------------------------------------------------------------------------- title: "Comparison Charts " description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.599Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/comparison-charts/" service: "QuickML" -------------------------------------------------------------------------------- # Comparison Charts Comparison charts are graphical representations used to compare data between different categories, groups, or time periods. They provide a visual way to highlight similarities, differences, trends, and relationships within datasets. Comparison charts are particularly effective for making data-driven decisions, identifying patterns, and conveying insights through easy-to-understand visualizations. 1. ### Area Chart An area chart is a graphical representation used to showcase the change or distribution of a dataset over time or along a continuous axis. It resembles a line chart, but the area between the line and the x-axis is filled, providing a visual depiction of the cumulative values. Area charts are particularly effective for revealing trends, comparing multiple datasets, and emphasizing the total values over time. **Description:** - **Filled Area:** Represents data using filled areas bounded by lines, visually emphasizing the cumulative or individual values over time or other continuous axis. - **Axes:** The x-axis of an area chart represents the distinct categories, groups, or timeperiods , while the y-axis represents the contributions associated with each category or time periods - **Trend Visualization:** Area charts help display trends, changes, or distributions of data over time or along a continuous axis. 2. ### Grouped Bar Chart A grouped bar chart is a graphical representation used to compare values of multiple categories across different groups or segments. It displays bars for each category within each group, allowing for direct comparisons between categories within the same group and across various groups. Grouped bar charts are effective for illustrating patterns, differences, and relationships in dataset values. **Description:** - **Bar Groups:** Bars are grouped side by side for each category, and each group represents a distinct segment or category. - **Vertical Structure:** The chart is arranged vertically, with the y-axis representing values and the x-axis indicating the groups or segments. - **Comparison:** Grouped bars allow for easy visual comparison of values within categories across different groups. -------------------------------------------------------------------------------- title: "Relationship Charts" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.599Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/relationship-charts/" service: "QuickML" -------------------------------------------------------------------------------- # Relationship Charts Relationship charts are graphical representations used to visualize the connections, associations, or interactions between different entities or variables within a dataset. These charts help reveal patterns, dependencies, and correlations, making it easier to understand how various elements are related. Relationship charts are particularly useful for identifying trends, uncovering dependencies, and gaining insights into complex data structures. 1. ### Line Chart A line chart is a graphical representation used to display trends, changes, or relationships within a dataset over time or along a continuous axis. It connects individual data points with lines, providing a visual depiction of how values evolve over a given period. Line charts are particularly effective for showing patterns, fluctuations, and comparisons between different data series. **Description:** - **Connected Data Points:** Data points are connected by lines, forming a continuous line that represents the trend or change. - **Axes:** The x-axis typically represents distinct categories or groups, or any other continuous variable. The y-axis represents the values associated with each category or group. - **Trend Visualization:** Line charts help visualize trends, fluctuations, or relationships within data over time or along a continuous axis. 2. ### Bubble Chart A bubble chart is a graphical representation used to display three dimensions of data simultaneously. It adds a third variable to a scatter plot, where each data point is represented as a circle (or bubble) with its position determined by two variables (usually on the x and y axes) and its size determined by the third variable. Bubble charts are effective for visualizing relationships, comparing multiple dimensions, and understanding patterns within data. **Description:** - **Bubbles:** Each data point is represented by a circle (bubble) on the chart. - **Position:** The position of the bubble along the x and y axes represents the values of two variables. - **Size:** The size of the bubble represents the value of a third variable. - **Color:** The color of the bubble represents the group differences. 3. ### Heatmap A heatmap chart is a graphical representation used to visualize the intensity or magnitude of values within a dataset by using color variations. It presents data in a grid format, where each cell is colored based on the value it represents. Heatmaps are effective for identifying patterns, trends, and variations within large datasets and making data-driven decisions based on visualized patterns. **Description:** - **Color Mapping:** Each cell in the grid is colored according to a color scale, with the color intensity representing the value’s magnitude. - **Two-Dimensional Grid:** The data is organized in rows and columns, forming a grid where each cell represents a specific value. - **Pattern Detection:** Heatmaps help detect patterns, correlations, and anomalies by visually highlighting areas of high or low values. -------------------------------------------------------------------------------- title: "Chart Insights" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.599Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/data-visualization/chart-insights/" service: "QuickML" -------------------------------------------------------------------------------- ## Chart Insights for Data Visualization QuickML includes a Chart Insights feature that generates key insights in natural language based on visualizations created during data analysis. This helps you interpret and explore the underlying dataset by providing meaningful explanations for each chart. These insights are generated using the Qwen 2.5 - 7B Vision Language model, a powerful 7-billion-parameter multimodal model designed to understand both text and images. It excels in reasoning tasks and is primarily used to generate accurate outputs based on images uploaded by the end-user.   By clicking the Chart Insights icon next to any chart, a side panel will open, presenting key insights and explanations about the selected chart. Within this panel, you can also choose to either copy the generated explanation or regenerate it, letting you refine the insights as needed. This feature provides a deeper understanding of the visualized data without requiring additional manual analysis. The Chart Insights can be accessed in two places: **While creating a chart** During chart creation, you will find an option to Generate Key Insights. This displays the newly created chart along with its auto-generated explanation. In the displayed histogram for the DiabetesPedigreeFunction attribute, the chart shows the occurrence count against bins. The single bin, labeled around 0.03, exhibits a high occurrence count of nearly 760. No other bins are present, indicating a lack of variation in the data within the shown range. QuickML’s Chart Explainer interprets this as a highly concentrated distribution around the value 0.03, suggesting that most records cluster tightly at this point with minimal spread across other values. **After chart creation** In the Visualization section, you can view explanations for previously created charts. Here, the model provides a breakdown of key points and insights related to the chart. ### Large Language Model -------------------------------------------------------------------------------- title: "LLM Serving" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.600Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/generative-ai/llm-serving/" service: "QuickML" -------------------------------------------------------------------------------- ## Introduction to Large Language Model (LLM) A **Large Language Model (LLM)** is an advanced AI system trained to generate human-like text based on patterns learned from vast amounts of data such as books, websites, and articles. It leverages deep learning techniques, particularly the Transformer neural network architecture, which processes entire sequences of input in parallel—unlike earlier models like Recurrent Neural Networks (RNNs), which handled inputs sequentially. This parallelism enables the use of GPUs for efficient training, significantly accelerating the process. This training typically involves unsupervised or self-supervised learning, where the model learns to predict the next word in a sentence by recognizing patterns, grammar, and context within the data. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} QuickML's LLM Serving feature is currently available in US, IN, and EU data centers. {{%/note%}} ### Mechanism of LLM A fundamental aspect of how Large Language Models (LLMs) operate lies in their approach to representing words. Earlier machine learning models relied on simple numerical tables to represent individual words, which made it difficult to capture the relationships between words—particularly those with similar meanings. This limitation was addressed with the introduction of multi-dimensional vectors, known as word embeddings. These embeddings map words into a vector space where semantically or contextually similar words are positioned close to each other. With these embeddings, transformers convert text into numerical formats using the encoder, allowing the model to grasp context, meaning, and linguistic relationships—such as synonyms or grammatical roles. This processed understanding is then used by the decoder to generate meaningful and coherent text, enabling LLMs to produce responses that reflect the structure and flow of natural language. ### Applications of LLM     LLM enables real-time AI-powered solutions across various industries. Here are key applications where LLMs are making a significant impact: **Chatbots and virtual assistants** LLMs power AI-driven chatbots and virtual assistants used in customer service, e-commerce, and enterprise support. These assistants can handle customer queries, provide automated responses, and assist with troubleshooting. Businesses use them to enhance user experience and reduce response time. Content generation and summarization   Organizations leverage LLMs to generate high-quality content, such as articles, reports, and product descriptions. Additionally, LLMs summarize lengthy documents or news articles into concise and easy-to-read formats, saving time and effort for users. **Code generation and debugging** Developers benefit from LLMs that assist with writing, optimizing, and debugging code in multiple programming languages. These models help streamline software development by providing instant code suggestions and explanations, reducing development time. **Language translation and localization** LLMs enhance translation services by providing real-time, context-aware translations for businesses and individuals. This is particularly useful in global communication, enabling seamless interaction between people speaking different languages. Image and text analysis (Multimodal AI)    Advanced multimodal LLMs, such as Qwen 2.5 - 7B Vision Language, process both text and images. These models can describe images, recognize objects, and answer visual questions. Industries such as healthcare, accessibility, and digital content moderation use these capabilities for enhanced automation and decision-making. Advanced multimodal LLMs, such as Qwen 2.5 - 7B Vision Language, process both text and images. These models can describe images, recognize objects, and answer visual questions. Industries such as healthcare, accessibility, and digital content moderation use these capabilities for enhanced automation and decision-making. ## Understanding LLM Serving LLM Serving involves deploying and running large language models (LLMs) so they can handle real-time requests for predictions or responses. When an LLM is trained, it can be used to perform a variety of tasks, such as generating text, answering questions, translating languages, or even understanding and analyzing large datasets. ### Purpose of LLM Serving The primary purpose of LLM Serving is to bridge the gap between a trained model and its real-world usage. It enables organizations to: - Operationalize AI models in production environments - Ensure scalability and reliability for handling real-time, concurrent requests - Offer seamless integration into products, tools, and business processes - Transform LLMs from research tools into practical, usable systems that drive business outcomes ## Architecture of LLM Serving  A well-structured architecture ensures that an LLM-serving system runs efficiently and responds quickly. It typically consists of the following layers: - **Client layer**: Receives requests from users or applications, such as questions or text inputs. - **API layer**: Converts the requests into a format the LLM can understand and sends them to the model. - **Model layer**: Runs the LLM, processes the request, and generates a response. - **Data layer**: Handles input and output data, ensuring smooth data flow between the model and users. ### Unique features of LLM Serving in QuickML QuickML makes it easy to use various large language models (LLMs) within a chat interface. Users can select different models based on their needs and get real-time responses. Unlike many competing platforms that offer limited or rigid parameter tuning, QuickML empowers you with extensive customization options, ensuring greater flexibility and control. Here’s what sets QuickML apart: - **Fine-tune responses effortlessly**: Adjust creativity, coherence, and output length directly within the user-friendly interface. - **Seamless model switching**: Easily switch between multiple LLM models within a single chat interface to find the best fit for your needs. - **Effortless integration**: Deploy models into third-party applications using the provided endpoint URL for smooth and scalable implementation. - **Optimized performance & cost control**: Tailor responses to minimize unnecessary token usage, optimizing both speed and cost-efficiency. - **Enhanced accessibility**: No complex setup required; QuickML makes advanced AI capabilities accessible even for non-technical users. ### Available models in LLM serving Below are some of the models available in QuickML, along with their capabilities and use cases. **Qwen 2.5 -14B Instruct** A lightweight yet efficient language model designed for general-purpose tasks, such as answering questions, summarizing text, and content generation. **Qwen 2.5 - 14B Instruct model details** To view the model details, go to the LLM Serving tab, select Models, and choose the Qwen 2.5 - 14B Instruct model. The model details include: - **Model Size**: The model consists of 14 billion parameters, enabling high-level language understanding and generation. - **Training Size**: It has been trained on a massive 18 trillion tokens, providing broad knowledge coverage across diverse domains. - **Parameters**: The model uses 14 billion learnable weights to generate accurate and context-aware responses. - **Input Token Limitations**: Supports inputs up to 128,000 tokens, allowing for processing of very large contexts and documents. - **Endpoint URL**: The API address used to send prompts to the model. - **OAuth Scope**: The permission level required to access the model. - **Authentication**: Specifies OAuth as the method for verifying user identity. - **HTTP Method**: Indicates that API requests must be made using the POST method. - **Headers**: Requires metadata, including organization ID and OAuth token, for authorization. - **Integration Section**: Provides sample code to connect your application with the model. - **Sample Request 1**: A sample input JSON format showing how to send a prompt to the Qwen 2.5 - 14B Instruct model along with parameters like top_p, temperature, and max_tokens. - **Sample Response 1**: The output JSON format containing the model's generated response based on the prompt. - **Possible Error Responses**: Lists common HTTP status errors like 400 (Bad Request) and 500 (Internal Server Error). - **Sample Error Response**: A structured error message that includes a code, message, and optional reason for debugging failed API calls. You can refer to *Integrate LLM into your application* [section](http://docs.catalyst.zoho.com/en/quickml/help/generative-ai/llm-serving/#integrate-llm-into-our-applications) for the steps to integrate the model into your application. **Qwen 2.5 - 7B Coder** A specialized model built for programming-related tasks, including code generation, debugging, and explanation. **Qwen 2.5 - 7B model details** To view the model details, go to the LLM Serving tab, select Models, and choose the Qwen 7B model. The model details include: - **Model Size**: The model consists of 7 billion parameters, offering strong performance in natural language understanding and generation. - **Training Size**: It has been trained on 5.5 trillion tokens, enabling broad domain knowledge and contextual comprehension. - **Parameters**: The model utilizes 7 billion learnable weights to generate intelligent, context-aware responses. Input Token Limitations: Supports inputs up to 128,000 tokens, making it suitable for handling lengthy code and complex instructions. - **Additional Capabilities**: Equipped with features like code generation, reasoning, and extended context understanding. Endpoint URL: The API endpoint used to send requests and prompts to the deployed model. - **OAuth Scope**: Defines the access level required to interact with the model (QuickML.deployment.READ). - **Authentication**: Uses OAuth to securely verify and authorize user access. - **HTTP Method**: Requires using the POST method to send data to the model. - **Headers**: Contains necessary metadata like organization ID and OAuth token for authentication. - **Integration Section**: Offers ready-to-use code snippets to help connect and use the model in your application. - **Sample Request 1**: Demonstrates how to structure a prompt and configure parameters such as model name, temperature, and max tokens. - **Sample Response 1**: Shows the structured output generated by the model in response to the sample request. - **Possible Error Responses**: Includes typical errors like 400 (Bad Request) and 500 (Internal Server Error) indicating failed requests. - **Sample Error Response**: Displays a JSON-formatted error message with fields like code, message, and reason for troubleshooting. You can refer to *Integrate LLM into your application* [section](http://docs.catalyst.zoho.com/en/quickml/help/generative-ai/llm-serving/#integrate-llm-into-our-applications) for the steps to integrate the model into your application. **Qwen 2.5 - 7B Vision Language** It is a 7-billion-parameter vision-language model that can understand both images and text. It’s designed for tasks like image captioning, visual question answering, and multimodal reasoning. To view the model details, go to the LLM Serving tab, select Models, and choose the Qwen 2.5 - 7B Vision Language model. The model details include: - **Model Size**: The core language model consists of 7 billion parameters. - **Visual Encoder**: Uses an optimized vision encoder with dynamic resolution and temporal-aware mRoPE. - **Parameters**: A total of 7 billion trainable weights power the model. - **Input Token Limitations**: Supports up to three images (≈6k tokens) plus text (≈3k tokens), for a total of ~9k tokens. - **Additional Capabilities**: Provides multimodal reasoning, multilingual OCR, document & chart question answering, and structured outputs (JSON, bounding boxes, points). - **Endpoint URL**: The API endpoint used to send text and image prompts to the deployed vision-language model. - **OAuth Scope**: Requires QuickML.deployment.READ permission scope to interact with the model. - **Authentication**: Secured using OAuth authentication for authorized access. - **HTTP Method**: Uses the POST method to transmit prompt and media data to the model. - **Headers**: Includes mandatory metadata such as organization ID and access token for secure API requests. - **Integration Section**: Provides ready-to-use sample code in Python, JavaScript, and other languages for integration. - **Integration Section**: Provides ready-to-use sample code in Python, JavaScript, and other languages for integration. - **Sample Request 1**: Demonstrates multimodal input (text + base64-encoded images), with configurable parameters like system_prompt, top_k, top_p, temperature, and max_tokens. - **Sample Response 1**: Shows structured JSON output, extracting details like contact info, skills, education, and projects from the given document images. - **Possible Error Responses**: Includes standard errors such as 400 (Bad Request) and 500 (Internal Server Error). - **Sample Error Response**: Returns a JSON with code, message, and reason to help debug API issues. You can refer to *Integrate LLM into your application* [section](http://docs.catalyst.zoho.com/en/quickml/help/generative-ai/llm-serving/#integrate-llm-into-our-applications) for the steps to integrate the model into your application. ### Breakdown of Chat Interface Before looking into how to access the LLM Serving feature in QuickML, let's have a brief look at how the chat interface is structured. **Model selection** Located at the top left under the “Chat” tab, this section allows you to choose from the available LLM models. It enables quick model switching within the same interface, making it easier to test different models for specific use cases without leaving the chat interface. **View model details** Next to the selected model name, the View Model Details option opens a pop-up window showing detailed information about the model. This includes aspects like model size, input token limits, training data, and integration options. It gives you deeper insights into what the model can do and how it can be used. **Chat interface (Conversation panel)** This central area is where all interactions take place. You can type prompts and view model-generated responses in a threaded format. Each response includes icons for actions like copying and regenerating individual responses. **Parameters panel** Found on the right side, this panel lets you fine-tune model behavior by adjusting parameters such as Temperature, Top-K, Top-P, and Max Tokens. There’s also an instruction field to specify tone, format, or domain-specific requirements. These settings help align the output to enterprise or user-specific needs. **Chatbox** At the bottom of the screen lies the chat input box—the main area where you can type your prompts or questions to interact with the selected language model. ### Parameters in LLM serving  QuickML provides a robust and flexible LLM (Large Language Model) serving environment, enabling developers and businesses to fine-tune AI model behavior through a variety of customizable parameters. These settings allow you to control how the model interprets inputs, generates responses, and aligns its output with your desired tone, structure, and purpose. Whether you're building a legal advisor bot, a content creation tool, or a customer service assistant, QuickML’s parameter configuration options ensure your application delivers responses that are not only relevant, but also contextually tailored. Unlike other platforms that restrict user control, QuickML prioritizes transparency and adaptability—allowing you to balance creativity with precision, determine the ideal response length, and guide the model to behave in ways that best serve your domain-specific needs. Here’s a breakdown of the available parameters: **Model name** The model field allows you to copy the model name for easy reference and deployment. **Instructions** In the Instructions field box, you can input detailed instructions, guiding the model’s response style and content generation approach. This enhances output relevance and consistency for specific applications. For example, a legal firm can enter instructions like "Provide responses in a formal legal tone with citations where applicable.", ensuring AI-generated content aligns with compliance and professional standards. **Temperature** Controls the creativity level of the model’s responses: - **Lower values (e.g., 0.0 - 0.3)**: More deterministic and precise responses. - **Higher values (e.g., 0.7 - 1.0)**: Increases variability and creativity, making responses more diverse and engaging. For example, a financial institution setting the Temperature to 0.2 would get precise responses like "The Federal Reserve increased interest rates by 0.25%." while setting it to 0.8 might generate "The Federal Reserve's recent rate hike of 0.25% aims to curb inflation, affecting mortgage and loan rates." **Top-K** Determines the number of top probable words considered for the next token prediction: - **Lower values (e.g., 10 - 20)**: Produces more predictable and controlled responses. - **Higher values (e.g., 50 - 100)**: Enhances diversity and variation in generated text. For example, a corporate HR chatbot with Top-K set to 10 may generate standard responses like "We value diversity in our hiring process.", whereas 50 allows for richer, more engaging responses like "At [Company Name], diversity is at the core of our hiring process, fostering innovation and inclusivity." **Top-P** Top-P sampling is also known as Nucleus sampling that generally ranges from 0.0 - 1.0. This parameter sets a cumulative probability threshold for the next token selection: - **Lower values (e.g., 0.1 - 0.3)**: Produces highly deterministic responses. - **Higher values (e.g., 0.8 - 0.9**): Enables more diverse text generation while maintaining coherence. For example, a customer support AI using a Top-P setting of 0.3 might generate straightforward responses like "Your order will arrive in 3 days.", while 0.9 could lead to "Your order is expected to arrive within 3 days. You'll receive a tracking update soon! Let us know if you need further assistance." **Max tokens** Defines the maximum number of tokens (between 1 and 4096) that the model can generate in a response. QuickML allows precise control over response length, optimizing cost and latency. For example, a compliance team setting Max Tokens to 50 ensures concise regulatory summaries, like "GDPR mandates data protection for EU citizens.", while 500 could generate a comprehensive legal analysis detailing key provisions and compliance measures. ### Working of LLM Serving in QuickML  1. When a user submits a request in QuickML's LLM serving chat interface, the selected model (such as Qwen 2.5 - 14B Instruct) processes the input. 2. The model analyzes the context, intent, and meaning of the query to generate a relevant and coherent response. 3. The response is formulated based on the model's training data, ensuring accuracy and contextual relevance. 4. Users can refine their queries or adjust model parameters to influence the response style, tone, and level of detail. **Key Points to be noted:** 1. The LLM serving feature is available to users who have access to the QuickML platform. 2. You can seamlessly switch between multiple LLM models within the QuickML chat interface. 3. At present, you cannot erase the previous chat conversations or open a new chat window; all conversations remain within the same chat interface. 4. Chats are user-specific, ensuring that one user cannot access another user's conversation. 5. You can upload images and type your query using the Qwen 2.5 - 7B Vision Language model for generating responses. 6. QuickML's LLM serving does not use your data for model training. All responses are generated based on its pre-trained data. ### Accessing LLM Serving in QuickML  LLM Serving can be accessed in QuickML by following the below steps, 1. Log in to {{%link href="https://catalyst.zoho.com/quickml.html" %}}QuickML{{%/link%}} platform. 2. Under the *Generative AI* section, select **LLM Serving**. 3. Navigate to the **Chat** tab. 4. Select the desired LLM model from the drop-down. - You can select either *Qwen 2.5 - 14B Instruct*, *Qwen 2.5 - 7B Coder*, or *Qwen 2.5 - 7B Vision Language* 5. Start entering your queries in the chat interface. **Note:** The chat interface generates responses based on the default parameter configuration. However, you can adjust the settings as needed to suit your requirements. **To configure the parameters settings** 1. Under the *Generative AI* section, select **LLM Serving** tab. 2. Navigate to the **Chat** tab. 3. Under *Parameters* in the right panel, configure the following: - Enter the detailed instructions on how the model should perform. - Control the creativity level of the model’s responses by adjusting the **Temperature**. - Adjust the number of top probable words considered for the next token prediction in the **Top-K** field. - Set the parameter for a cumulative probability threshold for the next token selection in the **Top-P** field. - Define the maximum number of tokens the model generates per response in the **Max Tokens** field. 4. Click **Save**. Let's consider a sample use case to understand how these parameters can be configured in a real-time business scenario. **Configuring parameters for a customer support chatbot in a financial institution**   **Usecase:** A bank wants to assist customers with general banking inquiries while ensuring responses are accurate, concise, and aligned with regulatory guidelines. **Step 1**: Adjusting Parameters for Optimal Performance <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <style> body { font-family: "Google Sans", Roboto, Arial, sans-serif; font-size: 15px; color: #1c1c1c; line-height: 1.6; } table { width: 100%; border-collapse: collapse; table-layout: fixed; margin-top: 20px; } th, td { border: 1px solid #ccc; padding: 12px; vertical-align: top; font-size: 16px; } th { background-color: #f9f9f9; font-weight: 600; text-align: center; } td { text-align: left; } col:nth-child(1) { width: 20%; } col:nth-child(2) { width: 30%; } col:nth-child(3) { width: 50%; } </style> </head> <body> <table> <colgroup> <col> <col> <col> </colgroup> <thead> <tr> <th>Parameter</th> <th>Configuration</th> <th>Reason</th> </tr> </thead> <tbody> <tr> <td>Model</td> <td>Qwen 2.5 -14B Instruct</td> <td>Selected for its balance of accuracy and efficiency.</td> </tr> <tr> <td>Instructions</td> <td>Provide responses in a formal and professional tone. Ensure compliance with banking regulations. Keep answers concise but informative.</td> <td>Ensures consistency in customer interactions.</td> </tr> <tr> <td>Temperature</td> <td>0.2</td> <td>Keeps responses factual and avoids unnecessary creativity.</td> </tr> <tr> <td>Top-K</td> <td>20</td> <td>Limits word selection to the most relevant choices, reducing variability.</td> </tr> <tr> <td>Top-P</td> <td>0.3</td> <td>Keeps responses focused and predictable, ensuring regulatory compliance.</td> </tr> <tr> <td>Max Tokens</td> <td>200</td> <td>Prevents overly long responses while maintaining enough detail for clarity.</td> </tr> </tbody> </table> </body> </html> **Step 2**: Engaging with the Chat Interface   Once the parameters are set, the chatbot is ready for use. Here’s an example interaction: User:"What are the eligibility criteria for a home loan?" Chatbot Response (Qwen 2.5 - 14B Instruct, with adjusted parameters): Eligibility criteria for a home loan typically include several key factors to ensure the borrower's ability to repay the loan. These criteria generally encompass: - **Age**: Borrowers must be within a specific age range, usually between 21 and 65 years old at the time of loan maturity. - **Income**: A stable and verifiable income is required to demonstrate the ability to repay the loan. Lenders may require proof of income through salary slips, bank statements, or tax returns. - **Credit Score**: A good credit score is essential as it reflects the borrower's creditworthiness and repayment history. - **Employment**: Borrowers should be employed with a stable job or have a consistent business income. Loan-to-Value (LTV) Ratio: This ratio compares the amount of the loan to the value of the property. Lenders have specific LTV limits based on the type of property and other factors. ### Integrate LLM into your Applications   Beyond utilizing LLM models within the QuickML platform, you can seamlessly integrate them into your own applications using the provided endpoint URL. This allows businesses to incorporate AI-powered responses into various business processes, such as customer support bots, content automation tools, and data analysis applications. To enable secure and efficient integration, QuickML supports OAuth-based authentication for access token generation. You can refer to this {{%link href="https://www.zoho.com/accounts/protocol/oauth/self-client/overview.html"%}}documentation{{%/link%}} for details on different types of OAuth applications and the steps required to generate and manage access tokens. **To get the endpoint URL** 1. Navigate to the **Generative AI** section within {{%link href="https://catalyst.zoho.com/quickml.html" %}}QuickML{{%/link%}}. 2. Select the **LLM Serving** tab. 3. In the *Models* tab, choose the desired model.  4. In the *Model Details* pop-up window, scroll to the **API Details** section to get the endpoint URL. -------------------------------------------------------------------------------- title: "RAG" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.601Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/generative-ai/rag/" service: "QuickML" -------------------------------------------------------------------------------- ### What is Retrieval-Augmented Generation (RAG)? To understand RAG, let's consider a simple analogy: A user contacts tech support with an error code—“7C05”—displayed on their printer. Though experienced, the support agent hasn't encountered that specific code before. Instead of making an educated guess, the agent consults the official troubleshooting manual for that particular printer model to find an accurate solution. In this scenario, the agent represents the language model, and the manual represents the external knowledge source that RAG retrieves, allowing the agent to respond correctly without needing prior exposure to that exact issue. In technical terms, RAG is an advanced technique that enhances the accuracy, relevance, and reliability of large language model (LLM) outputs by incorporating information retrieved from authoritative, external knowledge sources. Rather than relying solely on the data available during training, RAG enables a model to consult up-to-date and domain-specific content at the time of inference, grounding its responses in verifiable and contextually appropriate information. LLMs are built upon neural network architectures and trained on vast volumes of textual data. Their performance is largely driven by billions of parameters, which capture generalized patterns in human language. This parameterized knowledge equips LLMs to perform a wide range of tasks, such as answering questions, translation, and text completion with impressive fluency. However, these models can be limited when asked to generate responses that require specific, detailed, or time-sensitive information beyond their training scope. RAG addresses this limitation by optimizing the generative process to retrieve relevant content from a curated knowledge base first—such as internal company documents, domain-specific databases, or trusted online sources—and then generate responses that are both informed and contextually grounded. This particular approach does not require any retraining of the base model, which makes it a cost-effective and scalable method for tailoring LLM capabilities to specific organizational needs or specialized domains. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} QuickML's RAG feature is currently available in US, IN, and EU data centers. {{%/note%}} ### Benefits of RAG RAG offers a range of significant advantages that enhance the effectiveness, flexibility, and trustworthiness of generative AI solutions, particularly for organizations looking to implement domain-specific intelligence without the overhead of retraining large models. **Cost-efficient deployment** Most chatbot and AI application development begins with foundation models such as large language models (LLMs) trained on extensive, generalized datasets and typically accessed via APIs. Tailoring these models through retraining to accommodate organizational or industry-specific content is often prohibitively expensive and resource-intensive. RAG provides a more scalable and economical alternative. By enabling the model to retrieve and reference external data during inference, RAG enables businesses to integrate specific knowledge without modifying the underlying model, which makes generative AI solutions more attainable and cost-effective. **Access to timely, dynamic information** Maintaining up-to-date responses is a major challenge for static models, as their training data quickly becomes outdated. RAG addresses this issue by allowing the users to feed the model with continuously updated information sources. Whether it's the latest scientific research, breaking news, or real-time social media feeds, RAG equips generative AI models with access to current data. This ensures responses remain relevant and accurate, even in rapidly evolving domains. **Improved transparency and user trust** One of the key strengths of RAG is its ability to provide source-based answers. By referencing external documents and including citations, it offers transparency into where the information originated. Users can review these sources to verify the content or explore further, enhancing their trust in the system. This traceability makes generative AI responses not only more credible but also more aligned with compliance or quality assurance standards in regulated industries.   ### Applications of RAG Retrieval-Augmented Generation (RAG) is increasingly being adopted across industries to enhance the capabilities of large language models (LLMs), especially in areas requiring domain-specific knowledge, real-time information, or transparent outputs. Below are several key application areas where RAG delivers significant impact: **Enterprise knowledge assistants** Enterprises are increasingly turning to RAG-based assistants to streamline internal knowledge access for employees. These AI-powered tools are designed to query structured and unstructured data from company-specific repositories, such as internal wikis, SOPs, HR guidelines, compliance checklists, IT documentation, and onboarding materials. Unlike static chatbots, which rely on predefined rules or outdated training data, RAG dynamically fetches the most relevant documents, ensuring employees receive accurate, up-to-date, and policy-aligned responses. This leads to higher productivity, reduced dependency on support teams, and a unified source of truth across departments. **Customer support automation** Customer service platforms can significantly elevate the user experience by integrating RAG to deliver more intelligent and context-aware interactions. By retrieving content directly from product manuals, troubleshooting guides, warranty documents, and helpdesk knowledge bases, RAG-powered chatbots can resolve customer queries more effectively—even those related to recently launched or rarely encountered products. This eliminates the constant need to retrain models on new content and enables scalable support operations with minimal human intervention. The inclusion of traceable sources also builds customer trust and increases customer satisfaction. **Legal and regulatory research** Legal professionals operate in a domain where accuracy, citation, and traceability are paramount. RAG-based tools empower legal teams by enabling direct retrieval of clauses from statutes, prior case law, government policies, or internal compliance documents. Instead of sifting through voluminous texts manually, users can obtain concise summaries, cross-referenced with the original documents for legal soundness. These applications are especially useful in drafting legal opinions, conducting regulatory audits, or preparing responses for compliance reviews, ensuring that every output is grounded in verifiable legal context. **Healthcare and medical decision support** In the healthcare sector, timely and evidence-based decision-making can directly impact patient outcomes. RAG systems support clinicians, researchers, and administrators by referring to data from medical sources such as treatment guidelines and institutional clinical records. By incorporating relevant information into its responses, an RAG model can assist with clinical decision support, patient-specific care recommendations, diagnostic differentials, or drug interactions, all while citing source material. This approach enhances confidence in AI-driven recommendations. **Scientific research and technical writing** RAG plays a vital role in accelerating research workflows and ensuring content precision in technical writing. Researchers can query expansive academic databases to summarize the state of the art, generate literature reviews, or validate hypotheses with current findings. Similarly, technical writers can use RAG to draft product documentation based on the latest data and user manuals. Whether summarizing scientific developments or drafting reports, RAG ensures the generated content is supported by reliable, domain-specific sources, and thereby maintains both accuracy and credibility.   ### How does RAG work?  RAG integrates two key processes: retrieval and generation. Instead of relying solely on the knowledge encoded during a language model's training, RAG supplements it by dynamically pulling in relevant information from external sources (like databases, documents, or knowledge bases) at inference time. This approach enables the model to deliver responses that are more accurate, current, and contextually appropriate. Here's a simplified flow of how RAG works: - **User input/query**: The process begins when a user inputs a question or prompt. - **Retrieval**: The system uses the query to search a large external knowledge source (e.g., vector database, document store, or search index) to retrieve the top-k most relevant pieces of information (often called “passages” or “contexts”). This is typically done using semantic search powered by embeddings. - **Fusion/contextualization**: The retrieved passages are then passed along with the original query to a language model. These passages provide context that grounds the generation in factual or domain-specific knowledge. - **Generation step**: The language model (e.g., a transformer-based LLM) takes both the query and the retrieved documents as input and generates a coherent and informed response. - **Citation/traceability**: Since RAG models rely on actual documents for answers, they can provide traceable references or links to the source material, increasing transparency and trust. ### What makes RAG in QuickML unique?  QuickML's RAG is designed to offer a seamless, secure, and transparent experience when generating responses powered by a knowledge base. Here's what sets RAG in QuickML apart: - Once a response is generated, a detailed response breakdown becomes available. This breakdown indicates which parts of the retrieved documents were referenced during the generation process and clearly shows the origin of the supporting information. It enhances transparency by allowing you to see exactly which documents contributed to the final output - QuickML's RAG implementation leverages the Zoho ecosystem—such as WorkDrive and Zoho Learn—to import relevant documents into the knowledge base seamlessly. This integration ensures that the most accurate and context-specific information is always available to support your queries. ### Available models in QuickML's RAG  RAG in QuickML leverages the Qwen 2.5 14B Instruct model to deliver contextual and relevant responses. Qwen 2.5 14B Instruct is a highly capable AI model engineered to perform a diverse range of language tasks with consistency and precision. Trained on large-scale, high-quality datasets, it delivers reliable outcomes and demonstrates strong performance across multiple benchmarks. What sets Qwen 2.5-14B-Instruct apart is its ability to quickly adapt and respond to dynamic real-world scenarios, making it exceptionally suited for enterprise and production-grade applications. **Qwen 2.5 14B Instruct model details** To view the model's details, go to the RAG tab and click on the View API option in the top-right corner of the chat interface. The model details include: - **Model size**: Comprises 14 billion parameters, enabling sophisticated comprehension and language generation capabilities. - **Training size**: Trained on 18 trillion tokens to ensure extensive coverage across a wide range of domains. - **Parameters**: Utilizes 14 billion trainable weights for generating nuanced and highly relevant responses. - **Input token limitations**: Supports 128K tokens with input context length, allowing for deep context and long document references during RAG execution. - **Endpoint URL**: This is the URL used to send API requests. - **OAuth Scope**: QuickML.deployment.READ; defines the access level required to use the deployment. - **Authentication**: OAuth is used to verify the client identity securely. - **HTTP method**: POST; all API calls must be made via the POST method. - **Headers**: Include required metadata and the authorization token to authenticate requests. - **Sample request**: A predefined JSON format shows how to structure your input prompt. - **Sample response**: Displays the model's output, including contextually grounded text generated based on both the prompt and the retrieved information from KB documents.   You can refer to the *Integrating RAG into your applications* section for the steps to integrate the model into your application. ### Understanding the RAG interface in QuickML   The RAG feature in QuickML enables you to get document-based responses using Qwen 2.5 14B Instruct. The interface is designed to make it easy to upload documents, ask questions, and trace the source of each response. Here's a breakdown of how it looks. **Model selection** At the top of the chat panel, users can select the available model. Currently, Qwen 2.5 14B Instruct is supported for RAG-based conversations. This model is optimized for generating grounded and context-aware responses from uploaded documents. **Chat interface (conversation panel)**   At the center is the core interaction space where you enter your queries and view responses. Each AI-generated answer is threaded beneath the user input and can include: - Action icons for copying or regenerating the response - A “View Response Breakdown” option, which gives insights into how the answer was generated and from which documents **Response breakdown panel**   When you click on “View Response Breakdown”, a pop-up window appears, offering detailed insights: - **Thought process**: Shows the specific content snippets referenced during answer generation, along with their corresponding sources and document IDs. - **Citations**: Lists down the documents used and highlights the exact sections that are used to generate the response. **Document store** On the right side of the interface, the document store lists all documents currently active for retrieval. Each document shows its name, format, and size. Users can include relevant documents for response generation from the knowledge base by selecting the Add Documents option. **Adding documents** Clicking **Add Documents** opens a panel labeled Add Documents From Knowledge Base. Here, users can either select existing documents or upload new ones. The panel includes a search bar to locate documents in the knowledge base quickly. - **Upload options** When uploading a new document, three convenient methods are supported: - **From desktop**: Supports .pdf, .docx, and .txt files, with a size limit of 500KB per file. - **From WorkDrive**: Allows users to import files directly from cloud storage. - **Via Zoho Learn link**: Enables document import by pasting the URL of the desired Zoho Learn article. Each uploaded file is assigned a unique ID and becomes available for retrieval during conversations. **Uploading the documents to the knowledge base** To ensure that documents appear in the Add Documents From Knowledge Base panel, relevant files must first be uploaded to the Knowledge Base repository. Please note that, upon clicking on **New Document** in the *Add Documents From Knowledge Base* panel, the documents would get automatically added to the knowledge base. The Knowledge Base functions as a centralized document repository where users can upload and manage content critical for context-aware generation. Integration with Zoho ecosystems like WorkDrive and Zoho Learn enables seamless import of internal files, manuals, FAQs, and other resources. This setup ensures that the Knowledge Base remains current and comprehensive, supporting the model’s ability to provide source-backed responses tailored to user queries. **To upload documents from the knowledge base** 1. Navigate to the *Knowledge Base* tab in the **QuickML** platform. 2. Click **Upload Document**. 3. Choose any of the following ways to upload the document, - Select **Upload a file** to upload the document from your desktop. Here, you have to enter the name and select the file to upload from your local system. - Select **Zoho Workdrive** to import the document from workdrive. Here, you have to provide the document name and workdrive link of your document. - Select **Zoho Learn** to import an article from Zoho Learn. Here, you have to enter the document name, select whether you're importing an article or a manual, and provide the article link. 4. Once uploaded, the document will appear in the knowledge base repository. From there, you can **delete** it or **copy** the document ID as needed. **Notes:**   - Uploaded documents in the document store can be deleted by hovering over the desired document and clicking the Delete icon. - When a query is submitted with specific knowledge base documents added to the document store, RAG limits its search to only those added documents. If no documents are added to the document store, RAG automatically searches across all active knowledge base documents. This ensures that the query is answered using the most relevant available content without requiring manual document selection. **View API** The chat interface also includes a View API option in the top-right corner. Selecting this opens a panel that displays detailed information about the current model, such as its size, token limits, endpoint URL, and authentication requirements. ### How RAG works in QuickML    Without RAG, a large language model (LLM) generates responses solely based on its training data. With RAG, the process is enhanced by introducing an external information retrieval step, allowing the model to access fresh, relevant data at the time of the query. Here's how it works in brief: **Ask a question** The user begins by entering a natural-language query, for example, "What's our company's return policy?" or "Summarize the latest updates in our product documentation." This query acts as an initial input that triggers the RAG pipeline. **Retrieve relevant information from your documents** Instead of relying solely on the language model's pre-trained knowledge, RAG connects to your knowledge base (like internal files, PDFs, etc.). Using advanced embedding and semantic search techniques, the system identifies and fetches the most relevant pieces of information based on the user's question. **Refine results with re-ranking** Once relevant documents are retrieved, a re-ranking process is applied to better align the selected content with the user’s intent. This step evaluates multiple signals—such as semantic similarity and keyword presence—to reorder the results and surface the most contextually appropriate content for the query. **Pass it to the LLM** The retrieved content is then combined with the original query and sent as an augmented prompt to the LLM (Qwen 2.5-14B-Instruct). This step allows the LLM to read the context-specific information and use it to generate an informed response. **Generate context-aware answers** With both the user query and supporting data in hand, the LLM crafts a response that's not only relevant but based on the actual source material. This approach helps ensure responses are based on verifiable information and aligned with the source material. It also allows users to trace answers back to original documents, which promotes transparency and builds confidence in the system. **Important notes** - The RAG feature is available to users who have access to the QuickML platform. - Chats are user-specific, ensuring that one user cannot access any other user's conversation. Currently, chat history is not supported. Conversations remain visible until the page is refreshed; once refreshed, all chats are cleared. ### Accessing RAG in QuickML    RAG can be accessed within QuickML via these steps: 1. Log in to your **QuickML** account. 2. Under the *Generative AI* section, select **RAG**. 3. Under *Document Store* in the right panel, click **Add Documents**. This will open a panel labeled Add Documents From Knowledge Base. 4. In the *Add Documents From Knowledge Base* panel, select existing files or click **New Document** to upload from your desktop, WorkDrive, or Zoho Learn. 5. Enter your queries in the chat interface. The model will retrieve relevant information from the knowledge base and provide context-aware responses, including the detailed response breakdown. **Note**: Responses are generated based on the documents stored in the knowledge base. You can add new documents as necessary to ensure the model has the required data to answer queries. Let's consider some sample use cases to understand how RAG can be used in a real-time business scenario. **Sample use case 1: Implementing RAG for employee policy assistance in an organization** A company aiming to improve employee understanding of ethics and training policies begins by gathering all relevant documents—such as the Code of Conduct, Office Etiquette Policy, Anti-Harassment Policy, Workplace Behavior Guidelines, and Mandatory Training Manuals—and uploading them into QuickML's RAG knowledge base. From the RAG chat interface, the administrator accesses the **Add Documents From Knowledge Base** panel to import or select existing files. Once uploaded, the documents form a centralized, structured repository that Qwen 2.5 14B Instruct can reference during user interactions. When an employee asks a question like *“What are the employee ethics policy?”*, QuickML’s RAG system conducts a semantic search across the uploaded policies, retrieves the most relevant information, and combines it with the query. This context is then sent to Qwen 2.5 14B Instruct, which generates a concise and contextually accurate answer. The user can also view a detailed breakdown of which documents and specific sections were used to generate the response—ensuring transparency and trust in the information provided. **Sample use case 2: Enhancing customer support with RAG for a SaaS platform** A SaaS company aiming to improve its customer support experience leverages RAG to build an intelligent help assistant. The support team collects all relevant resources—product FAQs, user manuals, troubleshooting guides, release notes, and API documentation—and uploads them into the RAG knowledge base within QuickML. Through the RAG chat interface, the admin uses the **Add Documents From Knowledge Base** panel to populate the document store with this content, ensuring it stays up to date with each product release. Once set up, the RAG system becomes a central support repository. When a user submits a query like *“What are the possible error messages?”*, the system semantically searches the document store for the most relevant information, retrieves matching content, and forwards both the query and supporting context to the language model. The model then generates an accurate, easy-to-follow solution, referencing the precise documentation sections. Users can view the source material behind the response for transparency and further reading, reducing ticket volume and improving self-service effectiveness. ### Integrating RAG into your applications  You can integrate QuickML's RAG into your applications using the provided endpoint URL. This enables businesses to enhance customer support tools, internal chatbots, and document automation systems with context-rich AI capabilities. To enable secure and efficient integration, QuickML supports OAuth-based authentication for access token generation. You can refer to this {{%link href="https://www.zoho.com/accounts/protocol/oauth/self-client/overview.html"%}}documentation{{%/link%}} for details on different types of OAuth applications and the steps required to generate and manage access tokens. **To get the endpoint URL** 1. Navigate to the **Generative AI** section in QuickML. 2. Select the **RAG** tab. 3. Click the **View API** option in the top right corner of the chat interface. 4. In the *Model Details* pop-up window, scroll to the **API Details** section to get the endpoint URL. -------------------------------------------------------------------------------- title: "Knowledge Base" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.602Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/generative-ai/knowledge-base/" service: "QuickML" -------------------------------------------------------------------------------- ### Introduction to knowledge base Organizations strive to manage and harness vast amounts of data and information. Whether it's for internal knowledge sharing, training materials, customer support, or feeding AI models, having a centralized repository of data is essential. This is where a knowledge base (KB) comes into play. A knowledge base is a structured repository that allows organizations to store, manage, and retrieve information efficiently. It typically includes documents, FAQs, guides, standard operating procedures, and more. These documents are curated to ensure accessibility, reliability, and contextual relevance. A well-maintained knowledge base helps reduce repetitive queries, supports onboarding, and empowers users with instant access to critical content. For example, a support agent might refer to the knowledge base to resolve a client issue more quickly, while a chatbot trained on the same repository can provide instant answers to end-users. In the context of QuickML, this knowledge base serves an even more powerful purpose—it enriches AI models, making machine responses smarter, contextual, and more accurate. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} QuickML's Knowledge Base feature is currently available in US, IN, and EU data centers. {{%/note%}} ### The role of knowledge base in QuickML In QuickML, the knowledge base is not just a document storage system; it is a foundational layer that interacts directly with AI capabilities. Once documents are uploaded, they are intelligently indexed and leveraged as contextual data sources. This becomes especially crucial for features like {{%link href="https://docs.catalyst.zoho.com/en/quickml/help/generative-ai/rag/" %}}Retrieval-Augmented Generation (RAG){{%/link%}}, where AI models tap into the knowledge base to retrieve and deliver accurate, context-rich answers in response to user queries. By enabling documents to be referenced by ML models dynamically, QuickML ensures that responses generated via APIs or chat interfaces are contextually relevant, factually accurate, and grounded in real-time organizational knowledge. This fusion of data and AI transforms passive documents into active, living inputs for intelligent decision-making systems. ### Accessing the knowledge base    In QuickML, you can access the **Knowledge Base** tab directly from the left navigation panel under the *GENERATIVE AI* section, alongside LLM Serving and RAG options, highlighting its strategic relevance in AI pipeline development. ### Uploading documents to the knowledge base    To populate your knowledge base, QuickML provides three distinct upload methods. Each method offers flexibility based on your document’s source. **Upload a File (From Local Desktop)** This is the most straightforward method and is used to upload documents directly from your computer. <u>**Fields required**</u> - **Name**: A title for the document. - **Document file**: Drag and drop or select a file (.pdf, .docx, .txt; Max 500 KB per file). **Ideal for**: One-time static document uploads like HR policies, SOPs, or technical whitepapers. **Zoho WorkDrive** This option lets you import documents hosted on Zoho WorkDrive via URLs. It includes options for dynamic sync to keep documents updated. <u>**Fields required**</u> - **Name**: The document's title in the knowledge base. - **Link**: A direct WorkDrive link to the document. (pdf, .docx, .txt; Max 500 KB per file). - **Sync frequency**: Choose how frequently the system should fetch updates (e.g., None, Hourly, Daily, Weekly, or Custom Frequency). **Ideal for**: Living documents like design specs, onboarding manuals, or internal wiki pages updated regularly. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} You must be part of an active Zoho WorkDrive account to use this option. If not, you will not be able to upload the documents through this option. {{%/note%}} **Zoho Learn** Use this to pull documentation from your Zoho Learn hub. This is useful if your organization already uses Zoho Learn for training or policy documentation. <u>**Fields required**</u> - **Name**: A custom name for internal reference. - **Link**: A direct URL to the article or manual. - **Import option**: Choose whether to "Import an article using the article link" or "Import from a manual". Sync frequency: Choose how frequently the system should fetch updates (e.g., None, Hourly, Daily, Weekly, or Custom Frequency). **Important note on Zoho Learn document links:** 1. You must be part of a Zoho Learn hub to use this option. If not, you will not be able to upload the documents using this method. 2. When sharing or uploading document links in Zoho Learn, do not copy the URL directly from the address bar, as it may include additional path elements (like /team/ or specific project names) that can render the link non-functional for others. <u>Correct vs. incorrect link formats:</u> **OCR Article** **Working**: https://learn.zoho.com/portal/zohocorp/manual/image/article/ocr-wiki **Not Working**: https://learn.zoho.com/portal/zohocorp/team/crmzia-ml-dl/manual/image/article/ocr-wiki **To upload documents to the knowledge base** 1. Log in to your {{%link href="https://catalyst.zoho.com/quickml.html" %}}QuickML{{%/link%}} account. 2. Under the *Generative AI* section, select the **Knowledge Base** tab. 3. Click **Upload Document**. 4. Choose the desired document upload method: - Choose **Upload a file** to upload a document directly from your local computer. - Choose **Zoho Workdrive** to upload a document stored in your Zoho Workdrive. - Choose **Zoho Learn** to either import an article or a manual from your Zoho Learn workspace. 5. The uploaded documents will appear in the knowledge base repository. ### Managing uploaded documents Once your documents are uploaded, QuickML provides a clean card-style interface to interact with each document through certain actions in the knowledge base. **Document actions available** - **Delete**: Permanently remove the document from the knowledge base repository. - **Copy Document ID**: Useful for referencing the document in your APIs or within RAG pipelines. - **Meta description preview**: Upon hovering below the document title, you can view the first few lines of the document's contents as the meta description. ### Exploring the RAG API options At the top-right corner, you’ll notice the Rag API button. This option is for technical users or developers looking to embed the RAG system into their applications using knowledge base documents. When configuring a RAG API call, several parameters can be set. **OAuth Scope** This defines the specific level of access that the API token (access token) will grant. In this case, the scope is: QuickML.deployment.READ. This grants read-only access to the QuickML deployment, which includes querying knowledge base documents using the RAG (Retrieval-Augmented Generation) feature. **Authentication** Authentication ensures that only authorized users can make API calls. This RAG API uses OAuth, a secure authorization framework that enables applications to obtain limited access to user accounts. **HTTP method** The HTTP method used for this API call is: POST. This method is used because the API is not just retrieving static data; it is sending input (query and documents) to the server to receive a generated answer based on dynamic reasoning. **Endpoint URL** The Endpoint URL specifies the exact location of the RAG API service to which your application will send the POST request. **Headers** Headers are meta data sent along with the request. They help authenticate and route the request correctly. **Request body** The request body contains the actual payload sent to the API. In this case, the user’s query and which documents to search. ### Best practices for effective knowledge base usage  To maximize the value of the knowledge base tab:   1. Ensure documents are formatted cleanly with headers, bullet points, and readable structure. 2. Regularly review and update the documents to maintain relevance. 3. Always copy and store the Document ID securely when planning to use it with APIs. ### ML Algorithms -------------------------------------------------------------------------------- title: "Classification Algorithms" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.602Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/classification-algorithms/" service: "QuickML" -------------------------------------------------------------------------------- <!-- ### ML Algorithms QuickML ML operations are categorized into multiple sections. All of these operations are available as stages in Pipeline Builder as mentioned and the main categories are: 1. **Algorithms** - Classification algorithms - Regression algorithms 2. **Encoding** 3. **Imputation** 4. **Binning** 5. **Transformers** 6. **Data type detection** 7. **DType transformation** 8. **Feature generation** All the operations listed above have detailed categories and their own use cases in the ML pipeline environment. Those can be checked in detail in upcoming pipeline articles. --> # ML Algorithms in QuickML QuickML mainly focuses on powering ML Pipelines with Machine Learning Operations in an effortless manner to provide smooth pipeline execution environment. Hence, it has been integrated with a wide range of ML algorithms & features to provide the best analytical results out of data. ML algorithms are programs that can learn from data and improve from experience, without any external intervention. The following algorithms and operations are all available in **QuickML as stages** that can be configured in one or more pipeline executions. The most widely used algorithms in the data science domain are, 1. **Classification algorithms** 2. **Regression algorithms** # Classification Algorithms Classification is the task of predicting a discrete class label. QuickML features following classification algorithms 1. ### AdaBoost Classification AdaBoost is a machine-learning algorithm that builds a series of small, one-step (one level) decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. This classification begins by fitting a classifier on the original dataset, followed by additional copies of the classifier on the same dataset. The weights of these instances are adjusted according to the error of the current prediction. So that, subsequent classifiers focus more on difficult cases. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w15p">Parameter</th> <th class="w35p">Description</th> <th class="w15p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}base_estimator{{%/badge%}}</td> <td>The base estimator from which the boosted ensemble is built. If none, then the base estimator is {{%badge%}}DecisionTreeClassifier{{%/badge%}} initialized with {{%bold%}}max_depth=1{{%/bold%}}.</td> <td>object</td> <td>Any classification model except KNN Classification model</td> <td>None</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}} <br>(number of estimators)</td> <td>The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.</td> <td>int</td> <td>[1, 500]</td> <td>50</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}algorithm{{%/badge%}}</td> <td>If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.</td> <td>string</td> <td>{‘SAMME’, ‘SAMME.R’}</td> <td>’SAMME.R’</td> </tr> </tbody> </table> 2. ### CatBoost Classification CatBoost is based on gradient-boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees. The number of trees is controlled by the starting parameters. <br /> This classification has much **less prediction time** compared to others. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w25p">Description</th> <th class="w15p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Used for reducing the gradient step.</td> <td>float</td> <td>(0,1]</td> <td>0.03</td> </tr> <tr> <td>{{%badge%}}l2_leaf_reg{{%/badge%}} (l2_leaf_regularization)</td> <td>Coefficient at the L2 regularization term of the cost function.</td> <td>float</td> <td>[0,+inf)</td> <td>3.0</td> </tr> <tr> <td>{{%badge%}}rsm{{%/badge%}} (random subspace method)</td> <td>The percentage of features to use at each split selection, when features are selected over again at random</td> <td>float</td> <td>(0,1]</td> <td>None</td> </tr> <tr> <td>{{%badge%}}loss_function{{%/badge%}}</td> <td>The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters.</td> <td>string</td> <td>{&#39;Logloss&#39;, &#39;CrossEntropy&#39;, &#39;MultiClass&#39;, &#39;MultiClassOneVsAll&#39;}</td> <td>&#39;MultiClass&#39;</td> </tr> <tr> <td>{{%badge%}}nan_mode{{%/badge%}}</td> <td>The method for processing missing values in the input dataset.</td> <td>string</td> <td>{&#39;Forbidden&#39;, &#39;Min&#39;, &#39;Max&#39;}</td> <td>Min</td> </tr> <tr> <td>{{%badge%}}leaf_estimation_method{{%/badge%}}</td> <td>The method used to calculate the values in leaves.</td> <td>string</td> <td>{&quot;Newton&quot;, &quot;Gradient&quot;}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}score_function{{%/badge%}}</td> <td>The score type used to select the next split during the tree construction.</td> <td>string</td> <td>{L2, Cosine}</td> <td>Cosine</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum depth of the tree.</td> <td>int</td> <td>[1,+Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter</td> <td>int</td> <td>[1, 500]</td> <td>None</td> </tr> </tbody> </table> 3. ### Decision-Tree Classification Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets, while at the same time an associated decision tree is incrementally developed. Decision trees can handle **both categorical and numerical data**. when predicting the output value of a set of features, it will predict the output based on the subset that the set of features falls into. #### Hyper Parameters <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w10p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split.</td> <td>string</td> <td>{“gini”, “entropy”}</td> <td>”gini”</td> </tr> <tr> <td>{{%badge%}}splitter{{%/badge%}}</td> <td>The strategy used to choose the split at each node.</td> <td>string</td> <td>{“best”, “random”}</td> <td>”best”</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node.</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.</td> <td>int or float</td> <td>[1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.</td> <td>float</td> <td>[0, 0.5]</td> <td>0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”, “log2”}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> </tbody> </table> 4. ### GB Classification Gradient-boosting classification calculates the difference between the current prediction and the known correct target value. This difference is called residual. After finding this value, the gradient-boosting classifier trains a weak model (Decision Tree) that maps features to that residual. This residual predicted by a weak model is added to the existing model input and thus this process nudges the model towards the correct target. Repeating this step multiple times improves the overall model prediction. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}loss{{%/badge%}}</td> <td>The loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs.</td> <td>string</td> <td>{&#39;deviance&#39;, &#39;exponential&#39;}</td> <td>&#39;deviance&#39;</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>The number of boosting stages to perform.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split.</td> <td>string</td> <td>{&#39;friedman_mse&#39;, &#39;mse&#39;, &#39;mae&#39;}</td> <td>’friedman_mse’</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>The fraction of samples to be used for fitting the individual base learners.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>The maximum depth of the individual regression estimators.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node.</td> <td>int or float</td> <td>[1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.</td> <td>float</td> <td>[0, 0.5]</td> <td>0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”, “log2”}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}init{{%/badge%}}</td> <td>An estimator object that is used to compute the initial predictions.</td> <td>object or string</td> <td>estimator (Any classification model except SVM classification and catboost) or ‘zero’</td> <td>None</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> </tbody> </table> 5. ### KNN Classification KNN works by finding the distances between a query (data instance) and all the examples in the data, selecting the specified number examples (K) closest to the query, then voting for the most frequent label in the neighbourhood. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w15p">Parameter</th> <th class="w35p">Description</th> <th class="w15p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}n_neighbors{{%/badge%}}<br>(number of neighbours)</td> <td>Number of neighbors to use by default for kneighbors queries.</td> <td>int</td> <td>[1, n]<br> n = Total number of records in dataset</td> <td>5</td> </tr> <tr> <td>{{%badge%}}weights{{%/badge%}}</td> <td>Weight function used in prediction. Possible values</td> <td>string</td> <td>{‘uniform’, ‘distance’}</td> <td>’uniform’</td> </tr> <tr> <td>{{%badge%}}algorithm{{%/badge%}}</td> <td>Algorithm used to compute the nearest neighbors.</td> <td>string</td> <td>{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}</td> <td>’auto’</td> </tr> <tr> <td>{{%badge%}}leaf_size{{%/badge%}}</td> <td>Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree.</td> <td>int</td> <td> (1, +Inf)</td> <td>30</td> </tr> <tr> <td>{{%badge%}}p{{%/badge%}}</td> <td>Power parameter for the Minkowski metric.</td> <td>int</td> <td>[1,3]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}metric{{%/badge%}}</td> <td>Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2.</td> <td>string</td> <td>{‘cityblock’, ‘cosine’, &#39;euclidean&#39;, &#39;l1&#39;, &#39;l2&#39;, &#39;manhattan&#39;, &#39;nan_euclidean&#39;, ’minkowski’}</td> <td>’minkowski’</td> </tr> </tbody> </table> 6. ### LGBM Classification LGBM works by starting with an initial estimate that is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates. It can be used on any data and provides a high degree of accuracy, as it contains many built-in preprocessing steps. The LightGBM algorithm grows vertically, meaning it grows leaf-wise, while other algorithms grow level-wise. LightGBM chooses the leaf with the largest loss to grow. It can lower more loss than a level-wise algorithm when growing the same leaf. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}boosting_type{{%/badge%}}</td> <td>Method of boosting.</td> <td>string</td> <td>{‘gbdt’, ‘dart’, ‘goss’ }</td> <td>&#39;gbdt&#39;</td> </tr> <tr> <td>{{%badge%}}num_leaves{{%/badge%}}</td> <td>Maximum tree leaves for base learners.</td> <td>int</td> <td>(1, +Inf)</td> <td>31</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum tree depth for base learners, &lt;= 0 means no limit.</td> <td>int</td> <td>(-Inf, +Inf)</td> <td>-1</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>Number of boosted trees to fit.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}subsample_for_bin{{%/badge%}}</td> <td>Number of samples for constructing bins.</td> <td>int</td> <td>(0, +Inf)</td> <td>200000</td> </tr> <tr> <td>{{%badge%}}min_split_gain{{%/badge%}}</td> <td>Minimum loss reduction required to make a further partition on a leaf node of the tree.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}min_child_weight{{%/badge%}}</td> <td>Minimum sum of instance weight (Hessian) needed in a child (leaf).</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-3</td> </tr> <tr> <td>{{%badge%}}min_child_samples{{%/badge%}}</td> <td>Minimum number of data needed in a child (leaf).</td> <td>int</td> <td>[0, +Inf)</td> <td>20</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>Subsample ratio of the training instance.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}subsample_freq{{%/badge%}} (subsample_frequency)</td> <td>Frequency of subsample, &lt;= 0 means no enable.</td> <td>int</td> <td>(-Inf, +Inf)</td> <td>0</td> </tr> <tr> <td>{{%badge%}}colsample_bytree{{%/badge%}} (column sample by tree)</td> <td>Subsample ratio of columns when constructing each tree.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}reg_alpha{{%/badge%}} (alpha)</td> <td>L1 regularization term on weights.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}reg_lambda{{%/badge%}} (lambda)</td> <td>L2 regularization term on weights.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}importance_type{{%/badge%}}</td> <td>The type of feature importance to be filled into feature<em>importances</em>. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.</td> <td>string</td> <td>{ ‘gain’, &#39;split&#39;}</td> <td>&#39;split&#39;</td> </tr> </tbody> </table> 7. ### Logistic Regression When the target is **binary value**, we can use logistic classification. It maps the value between 0 and 1. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w20p">Parameter</th> <th class="w35p">Description</th> <th class="w10p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}penalty{{%/badge%}}</td> <td>Specify the norm of the penalty:<ul style="text-align: left;"><li><strong>&#39;none&#39;</strong>: no penalty is added;<li><strong>&#39;l2&#39;</strong>: add a L2 penalty term and it is the default choice;<li><strong>&#39;l1&#39;</strong>: add a L1 penalty term;<li><strong>&#39;elasticnet&#39;</strong>: both L1 and L2 penalty terms are added.</ul></td> <td>string</td> <td>{ ‘l1’, ‘l2’, ‘elasticnet’, ‘none’}</td> <td>’l2’</td> </tr> <tr> <td>{{%badge%}}dual{{%/badge%}}</td> <td>Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Tolerance for stopping criteria.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> <tr> <td>{{%badge%}}C{{%/badge%}}</td> <td>Inverse of regularization strength; must be a positive float.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}solver{{%/badge%}}</td> <td>Algorithm to use in the optimization problem.</td> <td>string</td> <td>{ ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ }</td> <td>’lbfgs’</td> </tr> <tr> <td>{{%badge%}}fit_intercept{{%/badge%}}</td> <td>Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}l1_ratio{{%/badge%}}</td> <td>The Elastic-Net mixing parameter, with<br>0 &lt;= l1_ratio &lt;= 1. Only used if penalty=&#39;elasticnet&#39;.</td> <td>float</td> <td>[0, 1]</td> <td>None</td> </tr> <tr> <td>{{%badge%}}multi_class{{%/badge%}}</td> <td>If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary.</td> <td>string</td> <td>{&#39;auto&#39;, &#39;ovr&#39;, &#39;multinomial&#39;}</td> <td>&#39;auto&#39;</td> </tr> <tr> <td>{{%badge%}}intercept_scaling{{%/badge%}}</td> <td>Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. The intercept becomes intercept_scaling * synthetic_feature_weight.</td> <td>float</td> <td>(0, +Inf)</td> <td>1.0</td> </tr> </tbody> </table> {{%note%}}{{%bold%}}Note:{{%/bold%}} The values in the "solver" parameter support only few of the values in the "penalty" parameter. So the supported penalties by the solver are mentioned below: * newton-cg’ - [‘l2’, ‘none’] * ‘lbfgs’ - [‘l2’, ‘none’] * ‘liblinear’ - [‘l1’, ‘l2’] * ‘sag’ - [‘l2’, ‘none’] * ‘saga’ - [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]{{%/note%}} 8. ### Naive Bayes Classification Naive Bayes is a classifier that uses the **Bayes Theorem**. It predicts membership probabilities for each class, such as the probability that a given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. 9. ### Random-Forest Classification The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building individual trees to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. A Bagging is an ensemble meta-estimator that fits base classifiers/regressors on random subsets of the original dataset, then aggregates their individual predictions (either by voting or by averaging) to form a final prediction. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w20p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>The number of trees in the forest.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split.</td> <td>string</td> <td>{“gini”, “entropy”}</td> <td>”gini”</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.</td> <td>int or float</td> <td>[1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.</td> <td>float</td> <td>[0, 0.5]</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}bootstrap{{%/badge%}}</td> <td>Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}oob_score{{%/badge%}}(out of bag score)</td> <td>Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> </tbody> </table> 10. ### SVM Classification SVM, or Support Vector Machine, is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane that separates the data into classes. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w35p">Description</th> <th class="w10p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}C{{%/badge%}}</td> <td>Regularization parameter. The strength of the regularization is inversely proportional to C.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}kernel{{%/badge%}}</td> <td>Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used.</td> <td>string</td> <td>{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}</td> <td>’rbf’</td> </tr> <tr> <td>{{%badge%}}degree{{%/badge%}}</td> <td>Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.</td> <td>int</td> <td>[0, +Inf)</td> <td>3</td> </tr> <tr> <td>{{%badge%}}gamma{{%/badge%}}</td> <td>Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.</td> <td>string or float</td> <td>{‘scale’, ‘auto’} or (0.0, +Inf)</td> <td>’scale’</td> </tr> <tr> <td>{{%badge%}}coef0{{%/badge%}}</td> <td>Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.</td> <td>float</td> <td>(-Inf, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}shrinking{{%/badge%}}</td> <td>Whether to use the shrinking heuristic.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}probability{{%/badge%}}</td> <td>Whether to enable probability estimates.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Tolerance for stopping criterion.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-3</td> </tr> <tr> <td>{{%badge%}}decision_function_shape{{%/badge%}}</td> <td>Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2).</td> <td>string</td> <td>{‘ovo’, ‘ovr’}</td> <td>’ovr’</td> </tr> <tr> <td>{{%badge%}}break_ties{{%/badge%}}</td> <td>If true, decision_function_shape=&#39;ovr&#39;, and number of classes &gt; 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> </tbody> </table> 11. ### XGB Classification XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine-learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve many data science problems quickly and accurately. It uses L1 and L2 regularisation to predict points and it fast in training. Each model has Fit (to train model), predict (to predict new data), get metrics (to get model's accuracy and other metrics), and feature_importances (importances of the input features for the prediction). Adaboost, CatBoost, Decision tree, Gradient boost(GB), LGBM, RandomForest, SVM, and XGB's basic working principles are almost identical for both regression and classification. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}booster{{%/badge%}}</td> <td>Decides which booster to use.</td> <td>string</td> <td>{‘gbtree&#39;, &#39;gblinear&#39;, &#39;dart&#39; }</td> <td>’gbtree’</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Step size shrinkage used in update to prevents over fitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.</td> <td>float</td> <td>[0,1]</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>Number of trees to fit.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}objective{{%/badge%}}</td> <td>Logistic regression for binary classification.</td> <td>string</td> <td>Mentioned below the table .</td> <td>&quot;binary:logistic&quot;</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>Control the sample&#39;s proportion.</td> <td>int</td> <td>(0,1]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum depth of a tree.</td> <td>int</td> <td>(0, +Inf)</td> <td>3</td> </tr> <tr> <td>{{%badge%}}max_delta_step{{%/badge%}}</td> <td>If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.</td> <td>int or float</td> <td>[0, +Inf)</td> <td>0</td> </tr> <tr> <td>{{%badge%}}colsample_bytree{{%/badge%}} (column sample by tree)</td> <td>Column&#39;s fraction of random samples.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}colsample_bylevel{{%/badge%}} (column sample by level)</td> <td>It is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}min_child_weight{{%/badge%}}</td> <td>Minimum sum of weights.</td> <td>int</td> <td>[0, +Inf)</td> <td>1</td> </tr> <tr> <td>{{%badge%}}reg_alpha{{%/badge%}} (alpha)</td> <td>L1 regularization term on weights.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}reg_lambda{{%/badge%}} (lambda)</td> <td>L2 regularization term on weights.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}scale_pos_weight{{%/badge%}} (scale positive weight)</td> <td>Control the balance of positive and negative weights, useful for unbalanced classes.</td> <td>int</td> <td>[0, +Inf)</td> <td>1</td> </tr> </tbody> </table> **POSSIBLE VALUES FOR "OBJECTIVE" PARAM :** {binary:logistic, binary:logitraw, binary:hinge, multi:softmax, multi:softprob} -------------------------------------------------------------------------------- title: "Regression Algorithms" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.603Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/regression-algorithms/" service: "QuickML" -------------------------------------------------------------------------------- # Regression algorithms Regression is the task of predicting a continuous quantity. QuickML features following regression algorithms 1. ### AdaBoost Regression Adaboost is a machine-learning algorithm that builds a series of small, one-step (one level) decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. This regression begins by fitting a regressor on the original dataset, followed by fitting additional copies of the regressor on the same dataset. The weights of these instances are adjusted according to the error of the current prediction. So that, subsequent regressors focus more on difficult cases. Boosting in machine learning is a way of combining multiple simple models into a single composite model. This is also why boosting is known as an additive model, since simple models (also known as weak learners) are added one at a time, while keeping existing trees in the model unchanged. As we combine more and more simple models, the complete final model becomes a stronger predictor. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}base_estimator{{%/badge%}}</td> <td>The base estimator from which the boosted ensemble is built. If None, then the base estimator is DecisionTreeRegressor initialized with max_depth=3.</td> <td>object</td> <td>Any regression model</td> <td>None</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}} <br>(number of estimators)</td> <td>The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.</td> <td>int</td> <td>[1, 500]</td> <td>50</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Weight applied to each regressor at each boosting iteration. A higher learning rate increases the contribution of each regressor.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}loss{{%/badge%}}</td> <td>The loss function to use when updating the weights after each boosting iteration.</td> <td>string</td> <td>{‘linear’, ‘square’, ‘exponential’}</td> <td>&quot;linear&quot;</td> </tr> </tbody> </table> 2. ### CatBoost Regression CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees. The number of trees is controlled by the starting parameters. It has much less prediction time compared to others. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>learning_rate</td> <td>The learning rate used for training.</td> <td>float</td> <td>(0,1]</td> <td>0.03</td> </tr> <tr> <td>{{%badge%}}l2_leaf_reg{{%/badge%}} (l2_leaf_regularization)</td> <td>Coefficient at the L2 regularization term of the cost function.</td> <td>float</td> <td>[0,+Inf)</td> <td>3.0</td> </tr> <tr> <td>{{%badge%}}rsm{{%/badge%}} (random subspace method)</td> <td>The percentage of features to use at each split selection, when features are selected over again at random.</td> <td>float</td> <td>(0,1]</td> <td>None</td> </tr> <tr> <td>{{%badge%}}loss_function{{%/badge%}}</td> <td>The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters.</td> <td>string</td> <td>{&#39;RMSE&#39;, &#39;MAE&#39;, &#39;Quantile:alpha=value, &#39;LogLinQuantile: alpha=value&#39;, &#39;Poisson&#39;, &#39;MAPE&#39;, &#39;Lq:q=value&#39;, &#39;SurvivalAft:dist=value; scale=value&#39;} Note : range of value = [0, 1]</td> <td>&#39;RMSE&#39;</td> </tr> <tr> <td>{{%badge%}}nan_mode{{%/badge%}}</td> <td>The method for processing missing values in the input dataset.</td> <td>string</td> <td>{&#39;Forbidden&#39;, &#39;Min&#39;, &#39;Max&#39;}</td> <td>Min</td> </tr> <tr> <td>{{%badge%}}leaf_estimation_method{{%/badge%}}</td> <td>The method used to calculate the values in leaves.</td> <td>string</td> <td>{&quot;Newton&quot;, &quot;Gradient&quot;}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}score_function{{%/badge%}}</td> <td>The score type used to select the next split during the tree construction.</td> <td>string</td> <td>{L2, Cosine}</td> <td>Cosine</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum depth of the tree.</td> <td>int</td> <td>[1,+Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}} <br>(number of estimators)</td> <td>The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.</td> <td>int</td> <td>[1, 500]</td> <td>None</td> </tr> </tbody> </table> 3. ### Decision-Tree Regression Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. Decision trees can handle both categorical and numerical data. When predicting the output value of a set of features, it will predict the output based on the subset that the set of features falls into. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split.</td> <td>string</td> <td>{&quot;mse&quot;, &quot;friedman_mse&quot;, &quot;mae&quot;}</td> <td>&quot;mse”</td> </tr> <tr> <td>{{%badge%}}splitter{{%/badge%}}</td> <td>The strategy used to choose the split at each node.</td> <td>string</td> <td>{“best”, “random”}</td> <td>”best”</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.</td> <td>int or float</td> <td>[1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.</td> <td>float</td> <td>[0, 0.5]</td> <td>0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”, “log2”},</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> </tbody> </table> 4. ### ElasticNet Regression Elastic net is a popular type of regularized linear regression that combines two popular penalties, specifically the L1 (Lasso Regression) and L2 (Ridge Regression) penalty functions. Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training. Regularization is a technique to prevent the model from over-fitting by adding extra information to it. In regularization technique, we reduce the magnitude of the features by keeping the same number of features. Sometimes, the lasso regression can cause a small bias (difference between predicted and actual value) in the model where the prediction is too dependent upon a particular variable. In these cases, elastic bet proves to be better performing by combining the regularization of both lasso and ridge regression. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}alpha{{%/badge%}}</td> <td>Constant that multiplies the penalty terms.</td> <td>float</td> <td>(0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}l1_ratio{{%/badge%}}</td> <td>The ElasticNet mixing parameter, with<br>0 &lt;= l1_ratio &lt;= 1.<br> For l1_ratio = 0 the penalty is an L2 penalty.<br> For l1_ratio = 1 it is an L1 penalty.<br> For 0 &lt; l1_ratio &lt; 1, the penalty is a combination of L1 and L2.</td> <td>float</td> <td>[0, 1]</td> <td>0.5</td> </tr> <tr> <td>{{%badge%}}fit_intercept{{%/badge%}}</td> <td>Whether the intercept should be estimated or not.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}normalize{{%/badge%}}</td> <td>This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}positive{{%/badge%}}</td> <td>When set to True, it forces the coefficients to be positive.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}selection{{%/badge%}}</td> <td>If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default.</td> <td>string</td> <td>{&quot;cyclic&quot;, &quot;random&quot;}</td> <td>&quot;cyclic&quot;</td> </tr> </tbody> </table> 5. ### GB Regression Gradient-boosting regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After obtaining this value, gradient-boosting regression trains a weak model (Decision Tree) that maps features to that residual. This residual predicted by a weak model is added to the existing model input, nudging the model towards the correct target. Repeating this step multiple times improves the overall model prediction. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}loss{{%/badge%}}</td> <td>Loss function to be optimized. ‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile regression (use alpha to specify the quantile).</td> <td>string</td> <td>{&#39;ls&#39;, &#39;lad&#39;, &#39;huber&#39;, &#39;quantile&#39;}</td> <td>’ls’</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Learning rate shrinks the contribution of each tree by learning_rate.</td> <td>float</td> <td>(0.0, +inf)</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.</td> <td>int</td> <td>[1, 500)</td> <td>100</td> </tr> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split.</td> <td>string</td> <td>{&#39;friedman_mse&#39;, &#39;mse&#39;, &#39;mae&#39;}</td> <td>’friedman_mse’</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>The fraction of samples to be used for fitting the individual base learners.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.</td> <td>int or float</td> <td> [1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.</td> <td>float</td> <td>[0, 0.5]</td> <td>0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”, “log2”}</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}init{{%/badge%}}</td> <td>An estimator object that is used to compute the initial predictions. init has to provide fit and predict. If ‘zero’, the initial raw predictions are set to zero.</td> <td>object</td> <td>estimator (Regression model except cat boost ) or ‘zero’</td> <td>None</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> </tbody> </table> 6. ### KNN Regression KNN Regression works by finding the distances between a query (data instance) and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the point that is the average of the observations in the same neighbourhood. In other words, it approximates the association between independent variables (input variables) and the continuous outcome (target) by averaging the observations in the same neighbourhood. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}n_neighbors{{%/badge%}} <br>(number of neighbours)</td> <td>Number of neighbors to use by default for kneighbors queries.</td> <td>int</td> <td>[1, n]<br> n = Total number of records in dataset</td> <td>5</td> </tr> <tr> <td>{{%badge%}}weights{{%/badge%}}</td> <td>Weight function used in prediction.<ul style="text-align: left; ;"><li><strong>‘uniform’</strong>: uniform weights. All points in each neighborhood are weighted equally. <li><strong>‘distance’</strong>: weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.</ul></td> <td>string</td> <td>{‘uniform’, ‘distance’}</td> <td>’uniform’</td> </tr> <tr> <td>{{%badge%}}algorithm{{%/badge%}}</td> <td>Algorithm used to compute the nearest neighbors</td> <td>string</td> <td>{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}</td> <td>’auto’</td> </tr> <tr> <td>{{%badge%}}leaf_size{{%/badge%}}</td> <td>Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.</td> <td>int</td> <td> (1, +Inf)</td> <td>30</td> </tr> <tr> <td>{{%badge%}}p{{%/badge%}}</td> <td>Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) <br>for p = 2. For arbitrary p, minkowski_distance (l_p) is used.</td> <td>int</td> <td>[1,3]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}metric{{%/badge%}}</td> <td>Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2.</td> <td>str</td> <td>{‘cityblock’, ‘cosine’, &#39;euclidean&#39;, &#39;l1&#39;, &#39;l2&#39;, &#39;manhattan&#39;, &#39;nan_euclidean&#39;, ’minkowski’}</td> <td>’minkowski’</td> </tr> </tbody> </table> 7. ### kernel Regression This regression simply fits a line to a scatter plot. Kernel values are used to derive weights to predict outputs from given inputs. Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables X and Y. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}alpha{{%/badge%}}</td> <td>Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization.</td> <td>float</td> <td>[0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}kernel{{%/badge%}}</td> <td>Kernel mapping used internally. This parameter is directly passed to pairwise_kernel. If kernel is a string, it must be one of the metrics in pairwise. PAIRWISE_KERNEL_FUNCTIONS or “precomputed”. If kernel is “precomputed”, X is assumed to be a kernel matrix.</td> <td>string</td> <td>{‘additive_chi2’,&#39;chi2&#39; ‘linear’, ‘poly’, ‘polynomial’, ‘rbf’, ‘laplacian’, ‘sigmoid’, &#39;cosine’}</td> <td>”linear”</td> </tr> <tr> <td>{{%badge%}}gamma{{%/badge%}}</td> <td>Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise.</td> <td>float</td> <td>[0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}degree{{%/badge%}}</td> <td>Degree of the polynomial kernel.</td> <td>float</td> <td>[0, +Inf)</td> <td>3</td> </tr> <tr> <td>{{%badge%}}coef0{{%/badge%}}</td> <td>Zero coefficient for polynomial and sigmoid kernels.</td> <td>float</td> <td>(-Inf, +Inf)</td> <td>1</td> </tr> </tbody> </table> 8. ### LGBM Regression LGBM works by starting with an initial estimate that is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates. It can be used on any data and provides a high degree of accuracy, as it contains many built-in preprocessing steps. The LightGBM algorithm grows vertically, meaning it grows leaf-wise, while other algorithms grow level-wise. LightGBM chooses the leaf with the largest loss to grow. It can lower more loss than a level-wise algorithm when growing the same leaf. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}boosting_type{{%/badge%}}</td> <td>Method of Boosting.</td> <td>string</td> <td>{‘gbdt’, ‘dart’, ‘goss’}</td> <td>&#39;gbdt&#39;</td> </tr> <tr> <td>{{%badge%}}num_leaves{{%/badge%}}</td> <td>Maximum tree leaves for base learners.</td> <td>int</td> <td>(1, +Inf)</td> <td>31</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum tree depth for base learners, &lt;= 0 means no limit.</td> <td>int</td> <td>(-Inf, +Inf)</td> <td>-1</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Boosting learning rate.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br> (number of estimators)</td> <td>Number of boosted trees to fit.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}subsample_for_bin{{%/badge%}}</td> <td>Number of samples for constructing bins.</td> <td>int</td> <td>(0, +Inf)</td> <td>200000</td> </tr> <tr> <td>{{%badge%}}min_split_gain{{%/badge%}}</td> <td>Minimum loss reduction required to make a further partition on a leaf node of the tree.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}min_child_weight{{%/badge%}}</td> <td>Minimum sum of instance weight (Hessian) needed in a child (leaf).</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-3</td> </tr> <tr> <td>{{%badge%}}min_child_samples{{%/badge%}}</td> <td>Minimum number of data needed in a child (leaf).</td> <td>int</td> <td>[0, +Inf)</td> <td>20</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>Subsample ratio of the training instance.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}subsample_freq{{%/badge%}} (subsample_frequency)</td> <td>Frequency of subsample, &lt;= 0 means no enable.</td> <td>int</td> <td>(-Inf, +Inf)</td> <td>0</td> </tr> <tr> <td>{{%badge%}}colsample_bytree{{%/badge%}} (column sample by tree)</td> <td>Subsample ratio of columns when constructing each tree.</td> <td>float</td> <td>(0.0, 1.0]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}reg_alpha{{%/badge%}} (alpha)</td> <td>L1 regularization term on weights.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}reg_lambda{{%/badge%}} (lambda)</td> <td>L2 regularization term on weights.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}importance_type{{%/badge%}}</td> <td>The type of feature importance to be filled into feature<em>importances</em>. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.</td> <td>string</td> <td>{ ‘gain’, &#39;split&#39;}</td> <td>&#39;split&#39;</td> </tr> </tbody> </table> 9. ### Lasso Regression Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}alpha{{%/badge%}}</td> <td>Constant that multiplies the <br>L1 term, controlling regularization strength. alpha must be a non-negative float</td> <td>float</td> <td>(0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}fit_intercept{{%/badge%}}</td> <td>Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}normalize{{%/badge%}}</td> <td>This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}positive{{%/badge%}}</td> <td>When set to True, forces the coefficients to be positive.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}selection{{%/badge%}}</td> <td>If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default.</td> <td>string</td> <td>{&quot;cyclic&quot;, &quot;random&quot;}</td> <td>&quot;cyclic&quot;</td> </tr> </tbody> </table> 10. ### Linear Regression Linear regression is a regression model that estimates the linear relationship between independent variable (input) and dependent variable (target) using a straight line. It is the basic algorithm for regression type of problems. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}fit_intercept{{%/badge%}}</td> <td>Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}normalize{{%/badge%}}</td> <td>This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the <br>l2-norm.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> </tbody> </table> 11. ### Random-Forest Regression The random forest is a classification and regression algorithm consisting of many decisions trees. It uses bagging and feature randomness when building individual trees to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. A Bagging is an ensemble meta-estimator that fits base classifiers/regressors on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}</td> <td>The number of trees in the forest.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}criterion{{%/badge%}}</td> <td>The function to measure the quality of a split. Supported criteria are “squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion, “absolute_error” for the mean absolute error.</td> <td>string</td> <td>{&quot;mse&quot;, &quot;mae&quot;}</td> <td>”mse”</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.</td> <td>int</td> <td>(0, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_samples_split{{%/badge%}}</td> <td>The minimum number of samples required to split an internal node</td> <td>int or float</td> <td>[2, +Inf) or (0, 1.0]</td> <td>2</td> </tr> <tr> <td>{{%badge%}}min_samples_leaf{{%/badge%}}</td> <td>The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.</td> <td>int or float</td> <td>[1, +Inf) or (0, 0.5]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}min_weight_fraction_leaf{{%/badge%}}</td> <td>The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.</td> <td>float</td> <td>[0, 0.5]</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}max_features{{%/badge%}}</td> <td>The number of features to consider when looking for the best split</td> <td>int, float or string</td> <td>(0, n_features] or { “sqrt”, “log2”}, None</td> <td>None</td> </tr> <tr> <td>{{%badge%}}max_leaf_nodes{{%/badge%}}</td> <td>Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.</td> <td>int</td> <td>(1, +Inf)</td> <td>None</td> </tr> <tr> <td>{{%badge%}}min_impurity_decrease{{%/badge%}}</td> <td>A node will be split if this split induces a decrease of the impurity greater than or equal to this value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}bootstrap{{%/badge%}}</td> <td>Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}oob_score{{%/badge%}} (out of bag score)</td> <td>Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}warm_start{{%/badge%}}</td> <td>When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> </tbody> </table> 12. ### Ridge Regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It can be used when the input variables are highly correlated with the target. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}alpha{{%/badge%}}</td> <td>Constant that multiplies the L2 term, controlling regularization strength.</td> <td>float</td> <td>(0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}fit_intercept{{%/badge%}}</td> <td>Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}normalize{{%/badge%}}</td> <td>This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the<br> l2-norm.</td> <td>bool</td> <td>True or False</td> <td>False</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Precision of the solution.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-4</td> </tr> <tr> <td>{{%badge%}}solver{{%/badge%}}</td> <td>Solver to use in the computational routines:</td> <td>string</td> <td>{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}</td> <td>’auto’</td> </tr> </tbody> </table> {{%note%}}{{%bold%}}Note:{{%/bold%}} Values of the {{%bold%}}solver{{%/bold%}} are: * '{{%bold%}}auto{{%/bold%}}' chooses the solver automatically based on the type of data. * '{{%bold%}}svd{{%/bold%}}' uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than ‘cholesky’ at the cost of being slower. * '{{%bold%}}cholesky{{%/bold%}}' uses the standard scipy.linalg.solve function to obtain a closed-form solution. * '{{%bold%}}sparse_cg{{%/bold%}}' uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter). * '{{%bold%}}lsqr{{%/bold%}}' uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure. * '{{%bold%}}sag{{%/bold%}}' uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.{{%/note%}} 13. ### SVM Regression Support vector regression is used to predict discrete values. Support vector regression uses the same principle as the SVMs. The basic idea behind SVM is to find the best fit line. In SVM, the best fit line is the hyperplane that has the maximum number of points. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}C{{%/badge%}}</td> <td>Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.</td> <td>float</td> <td>(0.0, +Inf)</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}kernel{{%/badge%}}</td> <td>Specifies the kernel type to be used in the algorithm. If none is given, {{%badge%}}rbf{{%/badge%}} will be used. If a callable is given it is used to precompute the kernel matrix.</td> <td>string</td> <td>{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}</td> <td>’rbf’</td> </tr> <tr> <td>{{%badge%}}degree{{%/badge%}}</td> <td>Degree of the polynomial kernel function (‘poly’).</td> <td>int</td> <td>[0, +Inf)</td> <td>3</td> </tr> <tr> <td>{{%badge%}}gamma{{%/badge%}}</td> <td>Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.</td> <td>string or float</td> <td>{‘scale’, ‘auto’} or (0.0, +Inf)</td> <td>’scale’</td> </tr> <tr> <td>{{%badge%}}coef0{{%/badge%}}</td> <td>Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.</td> <td>float</td> <td>(-Inf, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}shrinking{{%/badge%}}</td> <td>Whether to use the shrinking heuristic.</td> <td>bool</td> <td>True or False</td> <td>True</td> </tr> <tr> <td>{{%badge%}}tol{{%/badge%}} (tolerance)</td> <td>Tolerance for stopping criterion.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>1e-3</td> </tr> <tr> <td>{{%badge%}}epsilon{{%/badge%}}</td> <td>Epsilon in the <br>epsilon-SVM model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.</td> <td>float</td> <td>[0, +Inf)</td> <td>0.1</td> </tr> </tbody> </table> 14. ### XGB Regression XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine-learning algorithms under the gradient-boosting framework. It provides a parallel tree boosting to solve many data science problems quickly and accurately. It uses L1 and L2 regularization to predict points and trains quickly. #### Hyper Parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w10p">Data Type</th> <th class="w25p">Possible Values</th> <th class="w20p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}booster{{%/badge%}}</td> <td>Decides which booster to use.</td> <td>string</td> <td>{‘gbtree&#39;, &#39;gblinear&#39;, &#39;dart&#39; }</td> <td>’gbtree’</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.</td> <td>float</td> <td>[0,1]</td> <td>0.1</td> </tr> <tr> <td>{{%badge%}}n_estimators{{%/badge%}}<br>(number of estimators)</td> <td>Number of trees to fit.</td> <td>int</td> <td>[1, 500]</td> <td>100</td> </tr> <tr> <td>{{%badge%}}objective{{%/badge%}}</td> <td>Logistic regression for binary classification.</td> <td>string</td> <td>Mentioned below the table .</td> <td>&quot;reg:linear&quot;</td> </tr> <tr> <td>{{%badge%}}subsample{{%/badge%}}</td> <td>Control the sample&#39;s proportion.</td> <td>int</td> <td>(0,1]</td> <td>1</td> </tr> <tr> <td>{{%badge%}}max_depth{{%/badge%}}</td> <td>Maximum depth of a tree.</td> <td>int</td> <td>(0, +Inf)</td> <td>3</td> </tr> <tr> <td>{{%badge%}}max_delta_step{{%/badge%}}</td> <td>If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.</td> <td>int or float</td> <td>[0, +Inf)</td> <td>0</td> </tr> <tr> <td>{{%badge%}}colsample_bytree{{%/badge%}} (column sample by tree)</td> <td>Column&#39;s fraction of random samples.</td> <td>float</td> <td>(0, 1]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}colsample_bylevel{{%/badge%}} (column sample by level)</td> <td>It is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.</td> <td>float</td> <td>(0, 1]</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}min_child_weight{{%/badge%}}</td> <td>Minimum sum of weights.</td> <td>int</td> <td>[0, +Inf)</td> <td>1</td> </tr> <tr> <td>{{%badge%}}reg_alpha{{%/badge%}} (alpha)</td> <td>L1 regularization term on weights.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}reg_lambda{{%/badge%}} (lambda)</td> <td>L2 regularization term on weights.</td> <td>float</td> <td>[0.0, +Inf)</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}scale_pos_weight{{%/badge%}} (scale positive weight)</td> <td>Control the balance of positive and negative weights, useful for unbalanced classes.</td> <td>int</td> <td>[0, +Inf)</td> <td>1</td> </tr> </tbody> </table> **POSSIBLE VALUES FOR "OBJECTIVE" PARAM :** { "rank:pairwise", reg:tweedie, "reg:gamma", "reg:linear", "count:poisson"} -------------------------------------------------------------------------------- title: "Ensembling" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.606Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/ensemble/" service: "QuickML" -------------------------------------------------------------------------------- # Ensembling Ensemble learning is a machine learning technique that enhances prediction accuracy and robustness by combining the outputs of multiple models. #### Stacking Ensembling technique 1. **Classification Ensemble** - A classification model that combines the predictions of multiple other models using voting techniques, namely hard voting and soft voting. In hard voting, the class that receives the majority of the votes among all the models is selected for the outcome, while soft voting takes the probabilities of each class prediction by all the base models and the class with the highest average probability is selected as the final prediction. 2. **Regression Ensemble** - A regression model that combines the predictions of multiple other models by averaging techniques, namely weighted average and non-weighted average. 3. **Stacking Classification** - An ensemble learning classifier that combines the predictions of multiple base classifiers (or models) to create a more powerful and robust meta-classifier. This is particularly useful when you have several diverse classifiers with complementary strengths and weaknesses. 4. **Stacking Regression** - An ensemble learning regressor that combines the predictions of multiple base regressors (or models) to create a more powerful and robust meta-regressor. This is particularly useful when you have several diverse regressors with complementary strengths and weaknesses. -------------------------------------------------------------------------------- title: "Recommendation" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.606Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/recommendation/" service: "QuickML" -------------------------------------------------------------------------------- # Recommendation Algorithms Recommendation algorithms leverage various datasets, such as historical transactions or interaction data, item attributes, and user demographics, to analyze patterns in user behavior and generate highly personalized recommendations that align with user interests. They play a critical role in driving interactions, improving retention, and enhancing the user experience across various platforms In QuickML, we currently support Information Retrieval algorithms designed to meet various business needs for building diverse recommendation models. ## Types of Recommendation Systems Types of recommendation models that are being used in real-time environments include: - Sequential recommendation system - Personalized recommendation system - Recurrence cycle recommendation system ## a. Information retrieval A recommender system is an intelligent tool that analyses past interactions, preferences, and behavior to suggest personalized products that are likely to be of interest to each user. It uses advanced algorithms to understand their tastes and needs, making accurate recommendations that match users' unique preferences. The information retrieval algorithms aim to generate accurate suggestions that enhance user experience and engagement by providing personalized recommendations. The algorithms that are used in each type of recommender systems within QuickML include. 1. ## Sequential recommendation system Sequential Recommendation algorithms utilize machine learning techniques to analyze the historical interaction data and predict the next item or items likely to be consumed in a sequence. This algorithm considers the sequential order of your past interactions to suggest products that align with your buying patterns. Algorithm that is used to build these models are: ### SubSequence **SuBSeq** or Succinct BWT-Based Sequence Prediction, is a powerful algorithm that utilizes the Burrows-Wheeler Transform (BWT), designed specifically for sequential recommendation systems, leveraging advanced techniques to extract meaningful patterns from transactions/interactions data. It focuses on subsequence mining, enabling it to identify recurring sequences and intricate patterns within user behavior sequences. Through a combination of efficient data processing and pattern recognition, SubSeq excels in capturing temporal dependencies, user preferences, and contextual nuances, ultimately leading to highly personalized recommendations tailored to individual behavior. It is particularly efficient in terms of memory usage and computational complexity #### Hyper-parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}transactions_column{{%/badge%}}</td> <td>Column name for itemsets in the input data</td> <td>str</td> <td>column name in the dataset</td> <td>specified</td> </tr> <tr> <td>{{%badge%}}min_similar_sequence{{%/badge%}}</td> <td>Minimum number of similar sequences required</td> <td>int</td> <td>any positive integer</td> <td>2</td> </tr> </tbody> </table> #### Sample scenarios to choose this algorithm: An online media platform aims to enhance user engagement by predicting the sequences of items users might consume based on their past behavior and interactions. The goal is to understand and anticipate user preferences, enabling the platform to provide personalized content recommendations. For instance, if a user watches three comedy movies, then switches to historical documentaries, a sequential recommendation algorithm identifies this change in consumption pattern and suggests more documentaries or a mix of both genres 2. ## Personalized Recommendation system Personalized recommendation algorithms aim to provide tailored recommendations to individual users based on their preferences, past interactions, product attributes, and demographic information. These algorithms leverage user-item interaction data to identify patterns and similarities among users and items. Algorithms that are used to build these models are: ### LightFM **LightFM** is a robust recommendation algorithm leveraging user profiles, product details, and interaction data to provide personalized recommendations. By merging collaborative filtering with matrix factorization techniques, LightFM captures user preferences and item features, ensuring accurate recommendations in the given domain. Analyzing user profiles and item characteristics, it delivers more relevant suggestions, simplifying the discovery of related products and adjusting recommendations promptly based on user behavior. With insights into user demographics, preferences, and interaction history, LightFM enhances recommendation experiences across diverse industries and applications, fostering engagement and satisfaction #### Hyper-parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}no_components{{%/badge%}}</td> <td>The dimensionality of the latent feature embeddings. Higher values capture more complex relationships but may overfit if too large for the dataset.</td> <td>int</td> <td>Any positive integer</td> <td>10</td> </tr> <tr> <td>{{%badge%}}n{{%/badge%}}</td> <td>For k-OS training, this specifies the maximum number of positive samples considered for each user during each update step.</td> <td>int</td> <td>Any positive integer</td> <td>10</td> </tr> <tr> <td>{{%badge%}}learning_schedule{{%/badge%}}</td> <td>Determines the learning rate schedule to be used. Options include adaptive schedules like adagrad or adadelta, which adjust learning rates based on gradient information during training.</td> <td>str</td> <td>'adagrad', 'adadelta'r</td> <td>'adagrad'</td> </tr> <tr> <td>{{%badge%}}loss{{%/badge%}}</td> <td>Specifies the loss function to optimize: <ul style="text-align: left;"><li><strong>&#39;logistic&#39;</strong>: no penalty is added;<li><strong>&#39;bpr&#39;</strong>: add a L2 penalty term and it is the default choice;<li><strong>&#39;warp&#39;</strong>: add a L1 penalty term;<li><strong>&#39;warp-kos&#39;</strong>: both L1 and L2 penalty terms are added.</td> <td>int</td> <td>Any positive integer</td> <td>10</td> </tr> <tr> <td>{{%badge%}}learning_rate{{%/badge%}}</td> <td>The initial learning rate for the gradient descent optimizer. Smaller values result in slower but potentially more stable convergence.</td> <td>float</td> <td>Any positive float</td> <td>0.05</td> </tr> <tr> <td>{{%badge%}}item_alpha{{%/badge%}}</td> <td>L2 regularization strength on item feature embeddings. Helps prevent overfitting by penalizing large weights. Higher values enforce stronger regularization.</td> <td>float</td> <td>Any non-negative float</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}user_alpha{{%/badge%}}</td> <td>L2 regularization strength on user feature embeddings. Helps prevent overfitting by penalizing large weights. Similar behavior to item_alpha.</td> <td>float</td> <td>Any non-negative float</td> <td>0.0</td> </tr> <tr> <td>{{%badge%}}train_split_ratio{{%/badge%}}</td> <td>used for training. The remaining data is reserved for evaluation. A higher ratio means more data for training but less for testing.</td> <td>float</td> <td>between 0 and 1</td> <td>0.8</td> </tr> <tr> <td>{{%badge%}}td_uid_column{{%/badge%}}</td> <td>The name of the column in the transactions dataset representing unique user IDs.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}ud_uid_column{{%/badge%}}</td> <td>The name of the column in the user features dataset representing unique user IDs.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}pd_pid_column{{%/badge%}}</td> <td>The name of the column in the product features dataset representing unique product IDs.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> </tbody> </table> #### Sample scenarios to choose this algorithm: - In e-commerce platforms, personalized recommendation algorithms suggest relevant items based on real-time user interactions and search activity. By analyzing user behavior, item characteristics, and transaction data, these models generate highly personalized suggestions that resonate with individual users. This approach not only enhances the user experience but also increases business value by promoting relevant and related products. - In e-learning platforms, facilitating the discovery of related courses that complement users' learning interests and objectives. These models enhance the discovery by analyzing the course similarities and user interests, ultimately leading to increased exploration and engagement. ### Pixie Building systems that provide high-quality personalized recommendations presents a major challenge due to a massive pool of items and a large number of users. These recommendations should be generated responsive to the user actions on demand. Pixie is a scalable real-time graph-based recommendation system that addresses this problem using its technique called random walks to explore a large graph of interconnected items to recommend relevant items to users. The graph comprises nodes and edges, where nodes represent items or users, and edges represent relationships or interactions between them. This algorithm is designed to be highly scalable, allowing it to operate efficiently in environments with a vast number of items and users. #### Hyper-parameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}user_id_colum{{%/badge%}}</td> <td>unique column name of the user_dataset</td> <td>str</td> <td>column name in the dataset</td> <td>specified</td> </tr> <tr> <td>{{%badge%}}product_id_column{{%/badge%}}</td> <td>unique column name of the product_dataset</td> <td>str</td> <td>column name in the dataset</td> <td>specified</td> </tr> <tr> <td>{{%badge%}}depth{{%/badge%}}</td> <td>Number of steps in each random walk.</td> <td>int</td> <td>Any Positive Integers</td> <td>10</td> </tr> <tr> <td>{{%badge%}}n_epochs{{%/badge%}}</td> <td>Maximum number of steps for the random walk process.</td> <td>int</td> <td>Any Positive Integers</td> <td>50</td> </tr> <tr> <td>{{%badge%}}higher_weight{{%/badge%}}</td> <td>Weight for biasing the graph edge towards more likely connections.</td> <td>float</td> <td>A float between 0 and 1</td> <td>1.0</td> </tr> <tr> <td>{{%badge%}}lower_weight{{%/badge%}}</td> <td>Weight for biasing the graph edge towards less likely connections.</td> <td>float</td> <td>A float between 0 and 1</td> <td>0.0001</td> </tr> <tr> <td>{{%badge%}}recommendation_type{{%/badge%}}</td> <td>Type of prediction to be made.</td> <td>str</td> <td>'fbt', 'cwbab'</td> <td>'cwbab'</td> </tr> <tr> <td>{{%badge%}}with_feature_encoding{{%/badge%}}</td> <td>Whether to use encoding of user features.</td> <td>bool</td> <td>True, False</td> <td>False</td> </tr> </tbody> </table> #### Sample scenarios to choose this algorithm: Pixie can suggest products to users based on their browsing history and purchase behavior, leading to improved product discovery and higher sales conversion rates. #### For example: An e-commerce platform specializing in consumer electronics leverages Pixie to improve its recommendation system. When a user purchases a smartphone, the algorithm uses recommendation_type: **fbt (Frequently Bought Together)** to recommend necessary co-purchases, such as screen protectors or chargers, which are immediately relevant to the current purchase. At the same time, if algorithm uses recommendation_type: "**cwbab (Customers who bought also bought)**" to suggest complementary products, like smartwatches or Bluetooth speakers, which the customer might consider buying later. By differentiating between immediate and potential future needs, the platform not only increases the average order value but also builds long-term customer engagement and satisfaction. 3. ## Recurrence cycle recommendation system The Recurrence Cycle Recommendation Model is trained to identify and suggest items based on recurring patterns in user behavior. By analyzing historical data, it detects the recurring cycles or intervals at which users are likely to repeat specific interactions, such as purchases, subscriptions, or engagements. The Recurrence Finder algorithm predicts the items users are most likely to repurchase, simplifying the process of restocking products or re-engaging with preferred items Algorithms that are used to build the recurring recommendation models are: ### Recurrence Finder **Recurrence Finder** identifies and predicts recurring events, including customer product purchases, event attendance, and daily alarm settings, leveraging historical timestamps to forecast future occurrences. By analyzing patterns from historical events, it provides the future occurrences of events, aiding in effective planning and decision-making. This model assists in optimizing strategies for customer retention, event management, and time management tools. With its ability to predict future occurrences, Recurrence Finder enhances efficiency and productivity across various domains, from e-commerce to healthcare. #### Hyperparameters: <table class="content-table quickml-content-table"> <thead> <tr> <th class="w25p">Parameter</th> <th class="w30p">Description</th> <th class="w15p">Data Type</th> <th class="w15p">Possible Values</th> <th class="w15p">Default Values</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}user{{%/badge%}}</td> <td>The name of the column in the dataset that contains unique user IDs. This column is essential for identifying the recurrence of transactions based on users.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}product{{%/badge%}}</td> <td>The name of the column in the dataset that contains unique product IDs. This column is used to track the recurrence of specific products in transactions.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}timestamp_column{{%/badge%}}</td> <td>The name of the column in the dataset that contains timestamps for the transactions. This column is critical for calculating the time intervals between recurring transactions.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}quantity{{%/badge%}}</td> <td>A boolean flag indicating whether the recurrence analysis should consider transaction quantities. If True, the quantity_column will be used to incorporate quantity-based recurrence patterns.</td> <td>bool</td> <td>True, False</td> <td>User-specified</td> </tr> <tr> <td>{{%badge%}}quantity_column{{%/badge%}}</td> <td>The name of the column in the dataset that contains the quantity of products purchased in each transaction. This is required if the quantity parameter is set to True.</td> <td>str</td> <td>Any valid column name in the dataset</td> <td>User-specified</td> </tr> </tbody> </table> #### Sample scenarios to choose this algorithm: - Subscription-based businesses can utilize recurrence predictions to forecast when customers are likely to renew their subscriptions. This allows them to implement targeted retention strategies, such as personalized offers or reminders, to increase renewal rates and customer loyalty. - Service-based businesses, such as healthcare providers or salons, can use recurrence predictions to schedule appointments efficiently. By anticipating when clients are likely to book appointments, they can optimize staff schedules, minimize wait times, and enhance customer satisfaction. ## Data Validation criteria Recommendation models are trained using three datasets: 1. Transactions/Interactions data 2. Users’ demographic data 3. Items’ attribute data The **Transactions dataset** contains transaction details, such as transactionID, userID, itemID, order value, purchase date, timestamp, etc., capturing each purchase the user has made. The **Users dataset** holds demographic information about the users of the business, providing insights into their characteristics and preferences. The **Items/Product features** dataset contains attributes and characteristics of the items being purchased by users, such as category, brand, and price. A validation check involving three datasets is automatically performed by the algorithm itself before model training begins. If any of the following criteria are not met, the algorithm stops training and throws an error: - No missing values should be present in the transactions, users, or items datasets. - Any userID or itemID/productID present in the Transactions dataset must also exist in their respective Users or Items datasets. -------------------------------------------------------------------------------- title: "Time Series" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.611Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/time-series/" service: "QuickML" -------------------------------------------------------------------------------- # Time Series Algorithms Time series forecasting is a predictive analytics task that involves the predicting future data points by utilizing the historical data collected over time. Let's deep dive into how forecasting in time series works and explore different algorithms that are being used to generate forecasting models. ### Forecasting Forecasting involves using statistical methods to predict future values by examining the trends and patterns in past data. This analysis helps business owners make well-informed decisions about future course of actions based on the predictions generated by the model. Time series forecasting algorithms can be classified into two categories: univariate and multivariate, each leading to the creation of corresponding types of models. #### 1. Univariate Forecasting Univariate forecasting models are built using univariate datasets, which contain only a single feature or variable that is recorded over time. These models focus exclusively on the temporal patterns, trends, and seasonality of that one variable for future forecasts, without incorporating the influence of other factors. Algorithmsused to build the univariate forecasting models include: - Moving Average - MA - Exponential Smoothing - Holt Winter's Method - Auto Autoregressive Integrated Moving Average - Auto ARIMA - Seasonal Autoregressive Integrated Moving Average - SARIMA - Autoregressive Moving Average - ARMA - Auto Regressor Let's understand how each of the algorithms works in detail. **a. Moving Average** The Moving Average (MA) model is a commonly used statistical technique for forecasting in time series analysis. Instead of calculating the average of past data points, it predicts future values based on a linear combination of past forecast errors (or residuals) over a fixed number of periods, known as the “window size.” This method smooths out random fluctuations, making it helpful for identifying short-term trends and patterns in the data. While the MA model is easy to apply and can be useful for various types of time series data, it relies solely on past errors and does not directly account for past values or other external factors that may affect the data. As a result, it is often recommended to combine the Moving Average model with other forecasting methods, like AutoRegressive models, to improve accuracy and capture more complex dynamics in the data. The number of past forecast errors used in a **Moving Average (MA)** model is defined by the MA order or lag (q). The MA order indicates how many previous error terms (lags of the residuals) are included in the model to predict the current value. For example, in an MA(q) model: - If q = 1, the model uses the previous error term (the first lag of the residual). - If q = 2, it uses the previous two error terms (the first and second lags), and so on. The choice of the MA order can significantly impact the model’s performance. **Advantages** - This Moving Average algorithm is well-suited for short-term forecasting, such as predicting sales for the next few periods. - It can help remove noise from time series data, making it easier to identify underlying patterns or trends by smoothing out random fluctuations. **Disadvantages** - Moving Average (MA) model is not suitable for long-term forecasting as it only relies on past forecast residuals(errors) and does not directly account for past values or other external factors that may affect the data in the long run. - It may react slowly to sudden changes or trends in the data based on the window size of residuals used for predictions. - MA models are primarily used for univariate time series forecasting, where the focus is on a single variable over time and its past errors. However, in multivariate time series forecasting, where multiple variables are involved, moving averages do not perform well. - In multivariate time series analysis, the goal is often to understand the relationships between multiple variables and to leverage these correlations for better forecasting. - The direct application of a simple moving average model to individual variables independently in a multivariate data fails to capture the potential interdependencies between the variables. **b. Auto Regressive Model** AutoRegressive (AR) models are a class of statistical models commonly used in time series analysis to forecast future values by using past values in the time series as input. They are particularly useful for capturing temporal dependencies in the data by assessing the correlation between preceding and succeeding values in the series. The model predicts future values as a linear combination of past observations, relying on the assumption that past values have a direct influence on future behavior. AR order (p), can be defined as the number of past observations used in an AutoRegressive (AR) model. It indicates how many previous values (lags) of the time series are included in the model to predict the current value. For example, in an AR(p) model: - If p = 1, the model uses the previous value (the first lag). - If p = 2, it uses the previous two values (the first and second lags), and so on. The choice of the AR order can significantly impact the model’s performance. **Advantages** - AR models are effective in capturing the temporal dependencies present in time series data. By regressing the current value on its own past values, they can capture patterns and trends in the data. - The parameters of AR models have clear interpretations. Each autoregressive parameter represents the influence of a specific lagged value on the current value, making it easy to understand the relationship between past and present observations. - AR models are robust to outliers and handle noisy data well. They can effectively filter out short-term fluctuations and focus on capturing long-term trends in the data. - AR models are computationally efficient, particularly for small to moderate-sized datasets. They require relatively fewer parameters compared to other time series models, making them suitable for quick analysis and forecasting. **Disadvantages** - AR models assume a linear relationship between past and present observations. This assumption may not hold for all time series data, particularly for non-linear relationships or complex patterns. - AR models assume that the underlying time series is stationary, meaning that its statistical properties do not change over time. However, many real-world time series exhibit non-stationary behavior, which can limit the applicability of AR models. - Choosing the appropriate order (AR lag(p)) for the AR model can be challenging and may require iterative experimentation or statistical diagnostics. Selecting an incorrect order can lead to poor model performance and inaccurate forecasts. - AR models may struggle to provide accurate forecasts for long-term horizons, especially when the underlying data is highly volatile or subject to structural changes. **c. ARMA Model** ARMA (AutoRegressive Moving Average) is a time series forecasting method that combines both AutoRegressive and Moving Average components. It builds a statistical model by using the past values from AutoRegressive and past residuals(errors) from Moving Average methods to predict future values. This model is particularly used in capturing short-term dependencies and patterns in stationary time series data. - Autoregressive (AR) Component: In the autoregressive component, the current value of the time series is modelled as a linear combination of its previous values. The "auto" in autoregressive signifies that the current value is regressed on its own past values. Higher values of p capture more complex dependencies. - Moving Average (MA) Component: In the moving average component, the current value of the time series is modelled as a linear combination of past forecast errors.The moving average parameters determine the weights assigned to past forecast errors in predicting the current value. Similar to autoregressive models, higher values of q capture more complex dependencies. ARMA models combine both autoregressive and moving average components to capture the temporal dependencies and random fluctuations present in the data. An ARMA(p, q) model is represented as the sum of the AR (p) and MA (q) components. **Advantages** - ARMA models can capture a wide range of patterns and dynamics present in time series data, making them versatile for various applications. - The parameters of ARMA models have clear interpretations, allowing analysts to understand the underlying relationships between past observations and future predictions. - ARMA models are robust to outliers and can handle noisy data well, making them suitable for real-world datasets with irregularities. - ARMA models are computationally efficient, particularly for small to moderate-sized datasets, allowing for quick model estimation and forecasting. **Disadvantages** - ARMA models assume that the underlying time series is stationary, meaning that its statistical properties do not change over time. However, many real-world time series exhibit non-stationary behavior, which can limit the applicability of ARMA models. - ARMA models may struggle to provide accurate forecasts for long-term horizons, especially when the underlying data is highly volatile or subject to structural changes. - Choosing the appropriate order (p and q) for the AR and MA components of an ARMA model can be challenging and may require iterative experimentation or statistical diagnostics. - The performance of ARMA models can be sensitive to the initial parameter estimates, leading to potential convergence issues or suboptimal solutions, particularly in high-dimensional parameter spaces. **d. ARIMA Model** The ARIMA (AutoRegressive Integrated Moving Average) model is a popular statistical method designed for time series forecasting to handle non-stationary data by incorporating differencing component in addition to AR and MA components as in ARMA model. It is a versatile and powerful method, as it can be applied to both stationary and non-stationary time series data. Here's a breakdown of its components: - **AutoRegressive (AR) term**: Represents the relationship between an observation and a certain number of lagged observations (previous time steps). In an AR (p) model, the value of the series at time 't' depends linearly on the values at times 't-1', 't-2', ..., 't-p'. - **Integrated (I) term**: Refers to the differencing of raw observations to make the time series stationary. Stationarity implies that the statistical properties of a time series, such as mean and variance, do not change over time. The order of differencing, denoted as 'd', indicates how many differences are required to achieve stationarity. - **Moving Average (MA) term**: Accounts for the relationship between an observation and a residual error from a moving average model applied to lagged observations. In an MA(q) model, the value of the series at time 't' depends linearly on the error terms at times 't-1', 't-2', ..., 't-q'. The ARIMA model is denoted as ARIMA(p, d, q), where: - 'p' is the order of the autoregressive part. - 'd' is the degree of differencing. - 'q' is the order of the moving average part. The ARIMA model makes predictions based on the linear combination of past observations, differencing to stabilise the series, and an error term that captures unexpected fluctuations not explained by the model. **Auto ARIMA** Auto ARIMA automatically searches through a range of possible ARIMA models, including different combinations of autoregressive (AR), integrated (I), and moving average (MA) components, to identify the model that best fits the data. It evaluates each model based on statistical criteria such as AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) to determine the optimal model. **Advantages** - ARIMA can handle a wide range of time series data, including economic, financial, and social data, making it applicable in various fields. - The model parameters (p, d, q) can provide insights into the underlying dynamics of the time series, such as the lag effect and the impact of differencing. - ARIMA is based on solid statistical principles, making it a reliable method for time series analysis. - ARIMA can generate forecasts for future time periods, providing valuable insights for decision-making and planning. - ARIMA models do not require additional external factors or covariates, making them relatively straightforward to implement and interpret. **Disadvantages** - ARIMA assumes that the time series is stationary or can be made stationary through differencing. In practice, achieving stationarity can be challenging for some datasets. - Traditional ARIMA models are not well-suited for capturing seasonal patterns in the data. Seasonal ARIMA (SARIMA) or other methods are needed for seasonal data. - They are linear and may not effectively capture complex non-linear relationships present in some time series data. - ARIMA models require a sufficient amount of historical data to estimate the model parameters accurately. In cases of short or sparse data, ARIMA may not perform well. - ARIMA assumes that observations are independent of each other, which may not hold true for all time series data, particularly in cases of autocorrelation or serial correlation. **e. SARIMA Model** Seasonal Autoregressive Integrated Moving Average (SARIMA) is an extension of the ARIMA model that incorporates seasonality into the analysis and forecasting of time series data. SARIMA models are particularly useful for data that exhibit both non-seasonal and seasonal patterns: - **Seasonal Differencing**: SARIMA involves differencing the time series not only to remove trends but also to remove seasonal patterns. This is done by subtracting the observation at time 't' from the observation at time 't-s', where 's' represents the seasonal period. - **Seasonal Autoregressive (SAR) Term**: Seasonal AR lag, component accounts for the relationship between the current observation and past observations at seasonal intervals. It captures the seasonal patterns in the data. - **Seasonal Moving Average (SMA) Term**: Seasonal MA lag component models the dependence between the current observation and past forecast errors at seasonal intervals. - **Integration**: Like in ARIMA, integration is used to make the time series stationary by differencing. The order of integration (denoted by 'd') represents the number of non-seasonal differences needed to achieve stationarity. - **Autoregressive (AR) and Moving Average (MA) Terms**: These components, similar to ARIMA, capture the non-seasonal dynamics of the time series. **Advantages** - SARIMA explicitly models seasonal patterns in the data, allowing for more accurate forecasts of time series with periodic fluctuations, such as monthly sales data or quarterly economic indicators. - SARIMA models can handle a wide range of seasonal patterns, including multiplicative and additive seasonality, as well as irregular seasonal patterns. - Like ARIMA, SARIMA models provide interpretable parameters (e.g., AR, MA, seasonal AR, seasonal MA) that can offer insights into the underlying dynamics of the time series. - SARIMA models can be robust to changes in the underlying data patterns, provided that the appropriate seasonal period and model parameters are selected. **Disadvantages** - SARIMA models are more complex than non-seasonal ARIMA models, requiring additional parameters to capture seasonal dynamics. This complexity can make model estimation and interpretation more challenging. - Selecting the appropriate seasonal period and determining the orders of the AR, MA, seasonal AR, and seasonal MA terms can be difficult and may require extensive model diagnostics and testing. - Estimating SARIMA models, especially for large datasets or models with many parameters, can be computationally intensive and time-consuming. - SARIMA assumes that the time series is stationary after differencing. Ensuring stationarity may require careful examination of the data and iterative model fitting. Despite these challenges, SARIMA remains a powerful tool for time series forecasting, for data that exhibit both non-seasonal and seasonal patterns. With careful model specification and parameter selection, SARIMA models can provide accurate forecasts and valuable insights into time series data. **f. Exponential Smoothing Model** Exponential smoothing is a time series forecasting method for univariate data that can be extended to support data with a systematic trend or seasonal component. It also make predictions based on past observations like ARIMA but with a key difference in how they weigh these observations: - **Exponentially Decreasing Weights** - Unlike ARIMA, where weights can be arbitrary, Exponential Smoothing assigns exponentially decreasing weights to past observations. This means recent observations have a much greater influence on the forecast than older ones. - Example: If you’re predicting today’s sales, yesterday’s sales will have a bigger impact on the forecast than sales from a week ago. - **Types of Exponential Smoothing** - **Simple Exponential Smoothing (SES)**: Suitable for data without a trend or seasonality. It only uses past observations. - **Holt’s Linear Trend Model**: Extends SES to capture trends in the data. - **Holt-Winters Seasonal Model**: Further extends to capture seasonality (repeating patterns like monthly sales peaks). **Advantages** - Handles different patterns implies that it can be adapted to handle different types of time series data, including those with: - No trend (Simple Exponential Smoothing) - Linear trends (Holt’s Linear Trend Model), and - Seasonal patterns (Holt-Winters Seasonal Model) - Requires relatively less historical data compared to some other more complex forecasting methods like ARIMA or machine learning models. - Responsive to Changes because recent observations are given more weight, the method can quickly adapt to changes or shifts in the data **Disadvantages** - Exponential smoothing may struggle to capture complex relationships or irregular fluctuations present in the data. - The performance of exponential smoothing models can be sensitive to the choice of smoothing parameters, such as the smoothing factor (alpha). - Exponential smoothing assumes that the underlying time series is stationary, meaning that its statistical properties remain constant over time. However, many real-world time series exhibit non-stationary behavior. - Exponential smoothing models are sensitive to outliers or extreme values in the data. **g. Holt Winter's Seasonal Model** Holt-Winters' method, also known as triple exponential smoothing, is a widely used technique for forecasting time series data, especially when dealing with data that exhibit trend and seasonality. It extends simple exponential smoothing to handle these components more effectively. - **Level Component (lt)**: It represents the average value of the series over time. It is updated at each time step based on the observed value and the previous level.The updated level at time 't' is a combination of the observed value 'yt', the previous level 'lt-1', and the previous trend 'bt-1'. It's calculated using a smoothing parameter alpha. - **Trend Component (bt)**: It captures the direction and rate of change in the series over time. It is updated to reflect the trend observed in recent data.The updated trend at time 't' is a combination of the difference between the current level and the previous level, and the previous trend. It's calculated using a smoothing parameter beta. - **Seasonal Component (st)**: It accounts for seasonal variations or patterns that repeat at fixed intervals (e.g., daily, weekly, monthly). It's updated to reflect the seasonal behaviour observed in the data. The updated seasonal component at time 't' is a combination of the observed value and the corresponding seasonal component observed at the same time in previous seasons. It's calculated using a smoothing parameter gamma. There are two main types of Holt-Winters models **Additive model**: The Additive model is used when the seasonal variations are roughly constant through the series. It is suitable when the magnitude of the seasonal effect does not depend on the level of the time series. ``` Level : Lt=α(Yt−St−m)+(1−α)(Lt−1+Tt−1) Trend: Tt=β(Lt−Lt−1)+(1−β)Tt−1 Season: St=γ(Yt−Lt)+(1−γ)St−m Forecast: Yt+h=L t+hTt+St−m+h ``` **Multiplicative Model**: Multiplicative model is used when the seasonal variations change proportionally with the level of the series. It is suitable when the magnitude of the seasonal effect varies with the level of the time series. ``` Level : Lt=α(Yt/St−m)+(1−α)(Lt−1+Tt−1) Trend: Tt=β(Lt−Lt−1)+(1−β)Tt−1 Season: St=γ(Yt/Lt)+(1−γ)St−m Forecast: Yt+h=(Lt+hTt)St−m+h ``` **Advantages** - It's specifically designed to capture and forecast time series data with both trend and seasonality, making it suitable for a wide range of real-world applications. - The method adapts to changes in the underlying data patterns over time, making it robust in dynamic environments where the data may exhibit evolving trends or seasonal patterns. - The resulting forecasts can be easily interpreted as they are based on the level, trend, and seasonal components, providing insights into the future behavior of the time series. - While the method involves multiple components and parameters, it's relatively straightforward to implement compared to more complex forecasting techniques. **Disadvantages** - Selecting appropriate values for the smoothing parameters alpha, beta, and gamma can be challenging and may require expertise or extensive experimentation, especially for datasets with varying characteristics. - Sensitive to outliers or sudden changes in the data, which can impact the accuracy of forecasts, particularly if these anomalies are not appropriately addressed. - Like other exponential smoothing methods, Holt-Winters' method assumes linear relationships between the components, which may not adequately capture complex non-linear patterns present in some time series data. - Estimating and updating the components of the Holt-Winters' method for large datasets or high-frequency data can be computationally intensive, especially if implemented without optimization techniques #### 2. Multivariate Forecasting Algorithms **a. Vector Auto Regressor (VAR)** The Vector Auto Regression (VAR) model is a multivariate time series algorithm used to capture the linear interdependencies among multiple time series. It generalizes the univariate autoregressive (AR) model to multivariate time series data. Each variable in a VAR model is a linear function of past lags of itself and past lags of all the other variables in the system. **Advantages** - VAR models are well-suited for capturing the dynamic relationships between multiple time series without requiring the specification of dependent and independent variables. This makes them flexible in modelling complex interdependencies among variables. - All variables in a VAR model are treated as endogenous, meaning there is no need to categorise them as dependent or independent. This symmetry allows for a more comprehensive understanding of how the variables interact with each other over time. - VAR models explicitly account for the lagged effects of each variable on itself and others, allowing for a detailed analysis of temporal dependencies. - The VAR framework allows for Granger causality testing, which can identify whether one time series can predict another, providing insights into causal relationships between variables. - VAR models facilitate impulse response analysis, which helps to understand the effect of a shock to one variable on the other variables in the system over time. This is particularly useful in policy analysis and economic forecasting. **Disadvantages** - VAR models require the estimation of a large number of parameters, especially when dealing with many variables and lags. This can lead to over fitting, making the model sensitive to noise and reducing its generalizability. - The large number of parameters to estimate means that VAR models require a significant amount of data to achieve stable and reliable results. This can be a limitation when working with short time series or sparse data. - Although VAR models are powerful, interpreting the relationships between variables can be challenging, especially when the model includes many variables and lags. The model’s output can be complex and may require advanced statistical knowledge to interpret correctly. - VAR models assume linear relationships between variables. In reality, relationships between time series variables can be non-linear, which may limit the model's effectiveness in capturing the true dynamics. - The choice of lag length p is crucial in VAR modelling. Too few lags can lead to model misspecification, while too many can cause overfitting. Selecting the optimal lag length is not always straightforward and often requires careful testing and validation. - Unlike Structural VAR (SVAR) models, standard VAR models do not provide a structural interpretation of the relationships between variables. This can be a limitation in economic and policy analysis, where understanding the underlying mechanisms is important. - When the variables in the VAR model are highly correlated, multicollinearity can become a problem, leading to unreliable estimates of the coefficients. This can make it difficult to discern the true impact of each variable on others. - As the number of variables and lags increases, the computational burden of estimating the model parameters grows. This can make VAR models computationally intensive, especially for large datasets. #### Cross Validation Cross-validation in time series is a technique used to evaluate the performance of a model on a time series dataset. Unlike typical k-fold cross-validation used in standard machine learning, time series data has a temporal order that must be preserved. Therefore, specific methods are used to handle this temporal dependency. **i. Rolling Time Series Split** - This method involves splitting the time series data into training and validation sets multiple times in such a way that the training set always precedes the validation set. - It simulates the way new data becomes available over time and ensures that the model is always tested on future data, relative to the training set. - Refer to the below image for data split understanding <br /> **ii. Blocking Time Series Split** - Involves splitting the data into contiguous blocks or folds, while ensuring that there is no overlap between the training and testing sets and maintaining the temporal order of the data. - Prevents the model from seeing future data during training, leading to over fitting. - Ensures that the training set always precedes the validation set, respecting the time order. <br /> **iii. Day Forward Chaining** - Day Forward-Chaining is based on a method called forward-chaining and rolling-origin-recalibration evaluation. Using this method, we successively consider each day as the test set and assign all previous data into the training set - This method produces many different train/test splits. The error on each split is averaged in order to compute a robust estimate of the model error. <br /> **iv. Regular Train Test Split - Default** In a standard train-test split scenario, a time series dataset is split into two subsets: a training set and a test set. The training set is used to train the model, which contains historical data, and then tested on the test set, which includes future unseen data. This allows for evaluating the model's performance and its generalization capabilities by testing its ability to predict future values based on past data. Visual representation of the Train-test split of the data as follows: <br /> ### Build a pipeline QuickML uses smart mode pipeline builder to create time series models. The Smart Builder provides a prebuilt template for time series models, designed to simplify the model development process from data preprocessing to model selection. With these prebuilt templates, operations are predefined, and users are presented with various parameters to configure each stage. This template removes the ambiguity of which stage to use to build a time series model and streamlines the model-building process. ### Visualizations **Decomposition Chart** A decomposition chart breaks down time series data into its key components: trend, seasonal, and residual (or noise). It helps validate the presence of these components and understand their contributions to the overall pattern of the data. Decomposition is particularly useful for analyzing time series data recorded at regular intervals over different periods. In QuickML, the decomposition chart uses the additive technique, where the original time series is represented as the sum of its components: ```Original Series = Trend + Seasonality + Residual``` <br /> This technique assumes that the magnitude of the seasonal and residual variations remains constant across the series. -------------------------------------------------------------------------------- title: "Text Analytics" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.613Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/text-analytics/" service: "QuickML" -------------------------------------------------------------------------------- # Text Analytics Algorithms Text data is fed to algorithms in a vectorized form to generate an NLP model. The NLP models could be broadly classified into supervised and unsupervised learning models. In QuickML, we have algorithms that use labelled data to build supervised learning models. The algorithms include: - Naive Bayes - Support vector machine (SVM) ### Naive Bayes A classification algorithm that works on Bayesian theorem with a naive assumption that there is a conditional independence between every pair of features considered. Bayes theorem calculates probability P(c|x) where c is the class of the possible target labels and x is the given instance which has to be classified, representing some certain features. <b>P(c|x) = P(x|c) * P(c) / P(x)</b> <b>Hyper parameters:</b> <u>Parameters:</u> 1. **priors: array-like of shape (n_classes,), default=None** Prior probabilities of the classes. If specified, the priors are not adjusted according to the data. 2. **var_smoothing: float, default=1e-9** Portion of the largest variance of all features that is added to variances for calculation stability. <u>Attributes:</u> 1. **class_count_: ndarray of shape (n_classes,)** number of training samples observed in each class. 2. **class_prior_: ndarray of shape (n_classes,)** probability of each class. 3. **classes_: ndarray of shape (n_classes,)** class labels known to the classifier. 4. **epsilon_: float** absolute additive value to variances. 5. **n_features_in_: int** Number of features seen during fit. 6. **feature_names_in_: ndarray of shape (n_features_in_,)** Names of features seen during fit. Defined only when X has feature names that are all strings. 7. **var_: ndarray of shape (n_classes, n_features)** Variance of each feature per class. 8. **theta_: ndarray of shape (n_classes, n_features)** Mean of each feature per class. ### Support vector machine (SVM) SVM in another popular classification machine learning algorithm which classifies data by determining the best hyperplane (decision boundary) #### Hyperparameters: 1. **C : float, default=1.0** Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. 2. **kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=‘rbf’** Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples). 3. **degree: int, default=3** Degree of the polynomial kernel function (‘poly’). Must be non-negative. Ignored by all other kernels. 4. **gamma: {‘scale’, ‘auto’} or float, default=‘scale’** Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. - if gamma= ‘scale’ (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, - if ‘auto’, uses 1 / n_features - if float, must be non-negative. 5. **coef0: float, default=0.0** Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’. 6. **shrinking: bool, default=True** Whether to use the shrinking heuristic. See the User Guide. 7. **probability: bool, default=False** Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide. 8. **Tolerance: float, default=1e-3** Tolerance for stopping criterion. 9. **cache_size: float, default=200** Specify the size of the kernel cache (in MB). 10. **class_weight: dict or ‘balanced’, default=None** Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). 11. **verbose: bool, default=False** Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded multi threaded context. 12. **max_iter: int, default=-1** Hard limit on iterations within solver, or -1 for no limit. 13. **decision_function_shape: {‘ovo’, ‘ovr’}, default=‘ovr’** Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification. 14. **break_ties: bool, default=False** If true, decision_function_shape=‘ovr’, and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict. 15. **random_state: int, RandomState instance or None, default=None** Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. -------------------------------------------------------------------------------- title: "Clustering" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.614Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/clustering/" service: "QuickML" -------------------------------------------------------------------------------- # Clustering Algorithms  Clustering is an unsupervised machine learning model that focuses on grouping similar data points together based on their characteristics or patterns. Instead of relying on predefined labels, clustering uncovers natural structures in the data, making it especially useful for exploratory analysis and segmentation. Let’s dive deeper into how clustering works and explore the different types of clustering algorithms that can be used to build effective clustering models. Below are the clustering algorithms supported in Catalyst QuickML, along with their {{%link href="https://docs.catalyst.zoho.com/en/quickml/help/learning-center/clustering/#introduction" %}}Clustering type,{{%/link%}} explanation, real-world use cases, and key parameters. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} Clustering is available in {{%bold class="bold-primary"%}}early access{{%/bold%}} across all the data centers. To use this, request access via support@zohocatalyst.com {{%/note%}} # Centroid-Based Algorithms  ### K-Means **Explanation:** K-Means partitions data into k clusters by minimizing the variance within each cluster, measured as the sum of squared Euclidean distances between data points and their cluster centroids. It works by iteratively performing two steps: (1) assigning each data point to the nearest centroid (hard assignment), and (2) recalculating centroids as the mean of all points in the cluster. This process repeats until assignments no longer change or a maximum number of iterations is reached. **Mathematical formula:** {{%code%}} min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − μi||² {{%/code%}} Where: - Ci = cluster i - μi = centroid (mean) of cluster i - ∣∣x−μi∣∣2 = squared Euclidean distance between data point x and centroid μi **Use case:** Retail companies use K-Means to segment customers into groups like “budget,” “regular,” and “premium” based on spending habits. **Key parameters:** {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} In the parameter ranges provided, square brackets [ ] indicate that the boundary value is included, while parentheses ( ) indicate that the boundary value is excluded. Example: [1, ∞) means the range starts from 1 (included) and extends to infinity (excluded). hereas, (0, 1] means 0 is excluded but 1 is included. {{%/note%}} <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td style="text-align:left;">n_clusters</td> <td style="text-align:left;"> Defines how many clusters the algorithm should create and, accordingly, how many centroids will be generated. Choosing the right value is critical, as too few clusters may oversimplify the data while too many may lead to overfitting. </td> <td style="text-align:left;">int</td> <td style="text-align:left;">[1, n_rows]</td> <td style="text-align:left;">8</td> </tr> <tr> <td style="text-align:left;">init</td> <td style="text-align:left;"> Specifies the method used to initialize the centroids before the clustering iterations begin. </td> <td style="text-align:left;">string</td> <td style="text-align:left;">{'k-means++', 'random'}</td> <td style="text-align:left;">'k-means++'</td> </tr> <tr> <td style="text-align:left;">max_iter</td> <td style="text-align:left;"> Determines the maximum number of iterations the K-Means algorithm will perform in a single run. </td> <td style="text-align:left;">int</td> <td style="text-align:left;">[1, ∞)</td> <td style="text-align:left;">300</td> </tr> <tr> <td style="text-align:left;">algorithm</td> <td style="text-align:left;"> Indicates the specific variant of the K-Means algorithm to use. </td> <td style="text-align:left;">string</td> <td style="text-align:left;">{'lloyd', 'elkan'}</td> <td style="text-align:left;">'lloyd'</td> </tr> </tbody> </table> ### MiniBatchKMeans    **Explanation:** MiniBatchKMeans is a faster and more memory-efficient version of K-Means that processes small random subsets (mini-batches) of the data at each iteration instead of the entire dataset. By using these mini-batches to update centroids, it converges much faster while giving results close to standard K-Means. This makes it especially useful for large datasets or streaming data, where running full K-Means would be too slow or resource-intensive. **Mathematical formula:** {{%code%}} min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Bt ∩ Ci) ||x − μi||² {{%/code%}} Where Bt is a random mini-batch at iteration t. **Use case:** E-commerce platforms with millions of users cluster browsing behavior in real time for product recommendations. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Defines the number of clusters to be formed as well as the number of centroids to be generated.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>init</td> <td>Specifies the method used to initialize the centroids.</td> <td>string</td> <td>{'k-means++', 'random'}</td> <td>'k-means++'</td> </tr> <tr> <td>max_iter</td> <td>Determines the maximum number of iterations the K-Means algorithm will perform in a single run.</td> <td>int</td> <td>[1, ∞)</td> <td>300</td> </tr> <tr> <td>batch_size</td> <td>Size of the mini batches.</td> <td>int</td> <td>[1, ∞)</td> <td>1024</td> </tr> </tbody> </table> ### Fuzzy C-Means    **Explanation:** Fuzzy C-Means is similar to K-Means but allows soft assignments, meaning each data point can belong to multiple clusters with different degrees of membership. Instead of assigning each point to a single cluster, Fuzzy C-Means calculates a membership probability for every cluster based on the point’s distance from each centroid. This is particularly useful for data where clusters overlap, as it provides a more nuanced view of membership rather than forcing a hard choice. **Mathematical formula:** {{%code%}} J = Σ(i=1 to k) Σ(j=1 to n) uij^m ||xj − ci||² {{%/code%}} Where: - uij = degree of membership of data point xj in cluster i - m = fuzziness parameter (m>1) - ci = centroid of cluster i **Use case:** Music apps group users whose tastes overlap across multiple genres (pop, rock, electronic). **Key parameter:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Defines the number of centers that should be initialized.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> </tbody> </table> ### K-Medians    **Explanation:** K-Medians works like K-Means but uses the median instead of the mean to compute cluster centers. This small change makes the algorithm more robust to outliers, since medians are less influenced by extreme values. The algorithm still alternates between assigning points to the nearest cluster center and recalculating centers, but by using medians, it produces clusters that better represent the typical data point in skewed or heavy-tailed distributions. **Mathematical formula:** {{%code%}} min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − mi|| {{%/code%}} Where mi is the median of all points in cluster i. **Use case:** Healthcare organizations use K-Medians to group patients based on median treatment costs or hospital stay durations, ensuring clusters are not skewed by a few patients with exceptionally high bills or unusually long stays. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Defines the number of clusters to be allocated.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> </tbody> </table> ### K-Modes  **Explanation:** K-Modes is designed for clustering categorical data. It replaces means with modes (the most frequent category value in each cluster) and uses a simple matching dissimilarity measure to assign points to clusters. This makes it well suited for scenarios where numeric distance does not make sense, such as clustering customers by preferences or product categories. The algorithm iteratively updates cluster modes until they no longer change, producing groups defined by the most common category values. **Mathematical formula:** {{%code%}} min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) d(x, θi) {{%/code%}} Where d(x,θi) is the number of mismatches between categorical attributes of x and mode θi. **Use case:** Retailers cluster customers using categorical data like “prefers online/offline,” “electronics/clothing,” “discount/luxury.” **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Defines the number of clusters to be formed as well as the number of centroids to be generated.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>max_iter</td> <td>Determines the maximum number of iterations of the k-modes algorithm for a single run.</td> <td>int</td> <td>[1, ∞)</td> <td>100</td> </tr> <tr> <td>init</td> <td>Specifies the method used for initialization.</td> <td>string</td> <td>{'Huang', 'Cao', 'random'}</td> <td>'Cao'</td> </tr> </tbody> </table> ### Affinity Propagation   **Explanation:** Affinity Propagation clusters data by exchanging similarity messages between pairs of points until some points are identified as “exemplars,” which act as cluster centers. Unlike K-Means, you do not need to specify the number of clusters beforehand — the algorithm automatically determines how many clusters to form based on the input similarities and a “preference” parameter that controls how likely each point is to become an exemplar. This makes it especially useful when the natural number of clusters is unknown. **Use case:** In customer service, it can group incoming support queries into clusters without prior knowledge of how many categories exist. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>damping</td> <td>Specifies the damping factor.</td> <td>float</td> <td>[0.5, 1.0)</td> <td>0.5</td> </tr> <tr> <td>max_iter</td> <td>Determines the maximum number of iterations.</td> <td>int</td> <td>[1, ∞)</td> <td>200</td> </tr> <tr> <td>convergence_iter</td> <td>Specifies the number of consecutive iterations during which the estimated number of clusters remains unchanged before the algorithm declares convergence.</td> <td>int</td> <td>[1, ∞)</td> <td>15</td> </tr> <tr> <td>affinity</td> <td>Determines the similarity measure used between data points.</td> <td>string</td> <td>{'precomputed', 'euclidean'}</td> <td>'euclidean'</td> </tr> </tbody> </table> ### Birch – Balanced Iterative Reducing and Clustering using Hierarchies  **Explanation:** BIRCH is designed for large or streaming datasets. It incrementally builds a CF (Clustering Feature) tree, which stores compact summaries of data points, and uses these summaries to perform clustering. The CF tree allows BIRCH to quickly group incoming data without having to store or process the full dataset at once. After building the tree, a final global clustering step (often using K-Means) can refine the clusters. This approach makes BIRCH extremely memory-efficient and scalable. **Use case:** In IoT applications, Birch can cluster continuous streams of sensor data (like smart city traffic monitoring) into meaningful groups in real time. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>threshold</td> <td>Defines the maximum radius of a subcluster when merging with the closest existing subcluster.</td> <td>float</td> <td>(0, 1.0)</td> <td>0.5</td> </tr> <tr> <td>branching_factor</td> <td>Determines the maximum number of CF subclusters in each node.</td> <td>int</td> <td>[1, ∞)</td> <td>50</td> </tr> <tr> <td>n_clusters</td> <td>Determines the final number of clusters to form after the CF tree is built.</td> <td>int, None</td> <td>[1, ∞)</td> <td>3</td> </tr> </tbody> </table> ### K-Prototypes   **Explanation:** K-Prototypes extends K-Means and K-Modes to handle datasets containing both numerical and categorical variables. It calculates cluster centers using means for numerical features and modes for categorical features, combining the two using a weighting parameter that balances their influence. This makes it possible to cluster mixed-type data without extensive preprocessing or conversion. **Mathematical formula:** {{%code%}} d(x, y) = Σ(j ∈ num) (xj − yj)² + γ Σ(j ∈ cat) δ(xj, yj) {{%/code%}} Where: - δ(xj,yj)=0 if equal, else 1 - γ = weight balancing numeric and categorical distances **Use case:** In HR analytics, companies can group employees based on mixed data — numerical (e.g., years of experience, salary) and categorical (e.g., department, role) — to identify workforce patterns. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Specifies the number of clusters to form and the number of centroids to generate.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>max_iter</td> <td>Determines the maximum number of iterations of the k-modes algorithm for a single run.</td> <td>int</td> <td>[1, ∞)</td> <td>100</td> </tr> <tr> <td>init</td> <td>Specifies the method for initialization.</td> <td>string</td> <td>{'Huang', 'Cao', 'random'}</td> <td>'Cao'</td> </tr> <tr> <td>categorical</td> <td>Indicates the list of categorical columns.</td> <td>list, None</td> <td>List of columns</td> <td>None</td> </tr> </tbody> </table> # Density-Based Algorithms    ### MeanShift    **Explanation:** MeanShift finds clusters by treating data points as samples from a probability density function and iteratively shifting them toward the nearest region of highest density (mode). Points that converge to the same mode form a cluster. This approach can automatically determine the number of clusters and works well with arbitrarily shaped clusters, but its performance depends on the choice of bandwidth (the kernel size used for density estimation). **Mathematical formula:** {{%code%}} m(x) = Σ(xi ∈ N(x)) K(xi − x) xi / Σ(xi ∈ N(x)) K(xi − x) {{%/code%}} Where: - K is the kernel function (e.g., Gaussian) - N(x) is the neighborhood around x **Use case:** Image segmentation where regions like “sky,” “water,” and “land” form clusters of similar pixels. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>bin_seeding</td> <td>When set to True, the algorithm uses a binned version of the input data to initialize kernel locations instead of using every data point. This reduces the number of initial seeds, making the algorithm faster, especially on large datasets.</td> <td>bool</td> <td>{False, True}</td> <td>False</td> </tr> <tr> <td>cluster_all</td> <td>Controls how unassigned points (orphans) are handled. If True, all points are forced into the nearest cluster. If False, orphan points that do not fall within any cluster are labeled as -1 (noise).</td> <td>bool</td> <td>{False, True}</td> <td>True</td> </tr> <tr> <td>max_iter</td> <td>Specifies the maximum number of iterations allowed for each seed point. If a seed point has not converged by this limit, the process stops for that seed.</td> <td>int</td> <td>[1, ∞)</td> <td>300</td> </tr> </tbody> </table> ### DBSCAN    **Explanation:** DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together (dense regions) into clusters and labels points that lie alone in low-density regions as noise. It does not require specifying the number of clusters in advance, making it ideal for datasets with irregularly shaped clusters or when you want to identify outliers explicitly. Its performance depends heavily on the eps (neighborhood radius) and min_samples (minimum number of points in a dense region) parameters. **Mathematical formula:** {{%code%}} |Nε(p)| ≥ minPts {{%/code%}} Where p is a core point and Nε(p) = set of points within distance ε of p. **Use case:** Banking systems cluster normal transaction patterns, while suspicious isolated ones are flagged as fraud. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>eps</td> <td>Defines the maximum distance between two points for them to be considered neighbors. This parameter strongly influences the size and density of clusters, making it one of the most critical settings in DBSCAN.</td> <td>float</td> <td>(0, ∞)</td> <td>0.5</td> </tr> <tr> <td>min_samples</td> <td>Specifies the minimum number of neighboring samples required (including the point itself) for a data point to be classified as a core point. It controls how dense a region must be to form a cluster.</td> <td>int</td> <td>[1, ∞)</td> <td>5</td> </tr> <tr> <td>metric</td> <td>Determines the distance measure used to calculate similarity between data points.</td> <td>string</td> <td>{'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan', 'minkowski'}</td> <td>'euclidean'</td> </tr> <tr> <td>algorithm</td> <td>Indicates the nearest neighbor search method used to find pointwise distances efficiently.</td> <td>string</td> <td>{'auto', 'ball_tree', 'kd_tree', 'brute'}</td> <td>'auto'</td> </tr> </tbody> </table> # Medoid-Based Algorithms    ### CLARA   **Explanation:** CLARA (Clustering Large Applications) is a scalable version of the Partitioning Around Medoids (PAM) algorithm that works by clustering multiple small random samples of the data rather than the entire dataset. It then evaluates the clustering quality on the full dataset and selects the best set of medoids. This sampling approach makes CLARA feasible for very large datasets where running PAM directly would be computationally expensive. **Mathematical formula:** Runs K-Medoids on small random samples St of the data: {{%code%}} min(M) Σ(x ∈ St) d(x, m(x)) {{%/code%}} Then evaluates total cost on full dataset to pick the best sample. **Use case:** Telecom operators cluster millions of call records into categories like “international callers” using data samples instead of the entire dataset. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Determines the number of clusters to form as well as the number of medoids to generate.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>max_iter</td> <td>Specifies the maximum number of iterations when fitting.</td> <td>int</td> <td>[1, ∞)</td> <td>300</td> </tr> </tbody> </table> ### CLARANS    **Explanation:** CLARANS (Clustering Large Applications based on Randomized Search) improves on CLARA by using a randomized search technique to explore different possible medoid configurations. Instead of checking all possible swaps, it explores only a subset of neighbors at each step, which strikes a balance between efficiency and clustering quality. This makes CLARANS better at finding near-optimal solutions for large datasets without exhaustive computation. **Mathematical concept:** Iteratively searches for better medoids using randomized local swaps: {{%code%}} min(M) Σ(x ∈ D) d(x, m(x)) {{%/code%}} CLARANS explores random neighbor solutions rather than all possibilities, balancing speed and accuracy.   **Use case:** City transport authorities group bus routes by travel patterns to optimize scheduling. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Determines the number of clusters to form as well as the number of medoids to generate.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>maxneighbor</td> <td>Specifies the maximum number of neighbors examined.</td> <td>int</td> <td>[1, ∞)</td> <td>5</td> </tr> </tbody> </table> ### K-Medoids   **Explanation:** K-Medoids is similar to K-Means, but instead of using the mean of data points as the cluster center, it uses an actual data point called the medoid. This makes it more robust to noise and outliers since medoids are less influenced by extreme values compared to centroids. K-Medoids is often used when datasets contain many outliers or when interpretability of clusters through actual data points is important. **Mathematical formula:** {{%code%}} min(M) Σ(i=1 to k) Σ(x ∈ Ci) d(x, mi) {{%/code%}} Where: - M={m1,...,mk} are medoids (actual data points) - d(x,mi) is a distance metric (e.g., Euclidean, Manhattan) **Use case:** In healthcare, K-Medoids can cluster patient records based on symptoms and test results, using real patient profiles as representative medoids. This ensures that clusters are not skewed by patients with unusually rare or extreme medical conditions. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_clusters</td> <td>Determines the number of clusters to form as well as the number of medoids to generate.</td> <td>int</td> <td>[1, n_rows]</td> <td>8</td> </tr> <tr> <td>init</td> <td>Specifies the medoid initialization method.</td> <td>string</td> <td>{'random', 'heuristic', 'k-medoids++', 'build'}</td> <td>'heuristic'</td> </tr> <tr> <td>metric</td> <td>Specifies the distance function used to measure similarity or dissimilarity between data points.</td> <td>string</td> <td>{'cityblock', 'cosine', 'euclidean', 'haversine', 'l2', 'l1', 'manhattan', 'nan_euclidean'}</td> <td>'euclidean'</td> </tr> <tr> <td>method</td> <td>Determines which algorithm to use. 'alternate' is faster while 'pam' is more accurate.</td> <td>string</td> <td>{'pam', 'alternate'}</td> <td>'alternate'</td> </tr> <tr> <td>max_iter</td> <td>Defines the maximum number of iterations the algorithm will run over the entire dataset. If the clustering process does not converge before reaching this limit, it stops and returns the current cluster assignments.</td> <td>int</td> <td>[1, ∞)</td> <td>300</td> </tr> </tbody> </table> # Model-Based Algorithms    ### Gaussian Mixture Model (GMM)  **Explanation:** GMM models assumes that the data as a mixture of several Gaussian (normal) distributions, each representing a cluster. Unlike K-Means, which assigns each point to exactly one cluster, GMM assigns probabilities for each point to belong to each cluster (soft clustering). The model parameters (means, covariances, and mixing weights) are optimized using the Expectation-Maximization (EM) algorithm. This allows GMM to capture clusters with different shapes, sizes, and orientations, making it more flexible than K-Means when data are not well separated or have elliptical distributions. **Mathematical formula:** {{%code%}} p(x) = Σ(i=1 to k) πi 𝓝(x | μi, Σi) {{%/code%}} Where: - πi = weight (probability) of component i - μi = mean vector - Σi = covariance matrix - N(x∣μi,Σi) = multivariate Gaussian distribution **Use case:** In marketing analytics, GMM is used to cluster customers into overlapping groups based on behavior and spending patterns, where each customer can have a probability of belonging to multiple segments. **Key parameters:** <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:40%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:15%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_components</td> <td>Specifies the number of mixture components.</td> <td>int</td> <td>(0, ∞)</td> <td>2</td> </tr> <tr> <td>covariance_type</td> <td>Defines the shape of the clusters by specifying the structure of the covariance matrix.</td> <td>string</td> <td>{'full', 'tied', 'diag', 'spherical'}</td> <td>'full'</td> </tr> <tr> <td>max_iter</td> <td>Sets the maximum number of iterations for the Expectation-Maximization (EM) algorithm.</td> <td>int</td> <td>[1, ∞)</td> <td>100</td> </tr> <tr> <td>init_params</td> <td>Determines how the initial parameters (weights, means, and covariances) are set before the EM algorithm starts.</td> <td>string</td> <td>{'kmeans', 'k-means++', 'random', 'random_from_data'}</td> <td>'kmeans'</td> </tr> </tbody> </table> -------------------------------------------------------------------------------- title: "Anomaly Detection" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.614Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/ml-algorithms/anomaly-detection/" service: "QuickML" -------------------------------------------------------------------------------- # Anomaly Detection Algorithms   Anomaly detection is an **unsupervised machine learning** technique that focuses on identifying data points, events, or observations that deviate significantly from the majority of the data. These outliers can represent critical incidents such as fraud, network intrusions, equipment failures, or data quality issues. Instead of predicting predefined labels, anomaly detection models learn normal behavior within the data and flag deviations that may require attention. {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} Anomaly Detection feature is available in {{%bold class="bold-primary"%}}early access{{%/bold%}} across all the data centers. To use this, request access via support@zohocatalyst.com {{%/note%}}  {{%note%}} To have a conceptual understanding of anomaly detection, kindly refer to the {{%link href="https://docs.catalyst.zoho.com/en/quickml/help/learning-center/anomaly-detection/" %}}Anomaly Detection Help Doc{{%/link%}}. {{%/note%}} Anomaly detection in Catalyst QuickML is broadly categorized into Time Series and Non-Time Series, each offering its own set of algorithms and configurable parameters. # Time-Series Anomaly Detection Algorithms  Within Catalyst QuickML, anomaly detection for time series data focuses on identifying abnormal patterns or deviations in sequential data points over time. These methods account for temporal dependencies and seasonality, enabling the detection of sudden spikes, drops, or trend changes. Before exploring the individual time-series anomaly detection algorithms, it’s important to understand the common parameters used for **Anomaly Sensitivity Control**, as shown in the image below. These parameters are consistent across all time-series anomaly detection algorithms in Catalyst QuickML and are essential for controlling how anomalies are detected and interpreted. They allow users to fine-tune the threshold, window size, and anomaly scoring method, ensuring that the system accurately differentiates between normal fluctuations and true anomalous behavior. **Common parameters**: - **Threshold**: Determines how sensitive the model is to deviations from expected behavior. A lower threshold increases sensitivity, flagging smaller fluctuations as anomalies, while a higher threshold reduces false positives. - **Window size**: Defines the rolling window of data points used for analysis. Applicable up to a maximum of 24 time steps. - **Anomaly scoring method**: Currently supports Z-Score, which measures how far a point deviates from the mean in terms of standard deviations. Below are the time series anomaly detection algorithms supported in Catalyst QuickML, along with their explanations, real-world use cases, and key parameters. ### Auto Regressor    **Explanation**: The Auto Regressor (AR) model predicts future values based on a linear combination of past observations. It assumes that past behavior directly influences the present, with coefficients representing the relationship strength between past and current values. QuickML extends this with flexible regression backends, including Linear Regressor, Random Forest Regressor, AdaBoost Regressor, and Gradient Boosting Regressor, enabling both linear and non-linear temporal modeling. **Mathematical intuition**: {{%code%}} Yₜ = c + ∑ᵢ₌₁ᵖ φᵢ Yₜ₋ᵢ + εₜ {{%/code%}} Where: - Yt = current value - c = intercept (constant term) - ϕi = autoregressive coefficients - Yt−i = lagged observations - εt = random noise **Use case**: Used in financial forecasting, sensor drift detection, and predictive process monitoring, where recent history significantly affects the next outcome — e.g., predicting short-term stock trends or detecting abnormal fluctuations in temperature sensors. **Key parameters**: {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} In parameter ranges, square brackets [ ] include boundaries, and parentheses ( ) exclude them. {{%/note%}}  <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:20%;">Parameter</th> <th style="text-align:left; width:25%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:20%;">Possible Values</th> <th style="text-align:left; width:15%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>select_model_to_fit</td> <td>Chooses the regression model for fitting time series.</td> <td>string</td> <td>{‘Linear Regressor’, ‘Random-Forest Regressor’, ‘AdaBoost Regressor’, ‘Gradient Boosting Regressor’}</td> <td>‘Linear Regressor’</td> </tr> <tr> <td>max_lag</td> <td>Defines how many past time steps are used to predict the current value.</td> <td>int</td> <td>[1, ∞)</td> <td>1</td> </tr> </tbody> </table> ### Moving Average (MA)  **Explanation**: The Moving Average model focuses on modeling noise or random shocks within a time series. Instead of using past values directly, it predicts current observations as a weighted sum of past forecast errors. This approach helps smooth short-term volatility and is ideal for detecting unusual deviations from expected residual patterns. **Mathematical intuition**: {{%code%}} Yₜ = μ + ∑ᵢ₌₁ᵠ θᵢ εₜ₋ᵢ + εₜ {{%/code%}} Where: - Yt = current observation - μ = mean of the series - θi = moving average coefficients - εt−i = past forecast errors **Use case**: Applied in stock price anomaly detection, manufacturing output monitoring, and short-term demand forecasting, where identifying irregular error patterns helps uncover transient anomalies or process instability. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:15%;">Parameter</th> <th style="text-align:left; width:50%;">Description</th> <th style="text-align:left; width:10%;">Data Type</th> <th style="text-align:left; width:12.5%;">Possible Values</th> <th style="text-align:left; width:12.5%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>ma_lag (q)</td> <td>Specifies the number of past forecast errors used to predict future values.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> </tbody> </table> ### ARIMA (AutoRegressive Integrated Moving Average)    **Explanation**: ARIMA integrates three components: - AR (AutoRegressive) – Uses past values - I (Integrated) – Removes trend and non-stationarity through differencing - MA (Moving Average) – Models residual noise By combining these, ARIMA captures both temporal dependence and trend shifts. It’s highly effective for stationary time series and detecting deviations that violate established statistical relationships over time. **Mathematical intuition**: {{%code%}} Y′ₜ = c + ∑ᵢ₌₁ᵖ φᵢ Y′ₜ₋ᵢ + ∑ⱼ₌₁ᵠ θⱼ εₜ₋ⱼ + εₜ {{%/code%}} Where Yt′ is the differenced series after applying order d. **Use case**: Widely used in economic indicator monitoring, system log anomaly detection, and production line forecasting, where detecting shifts from long-term trends or cyclical stability is crucial. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:15%;">Parameter</th> <th style="text-align:left; width:30%;">Description</th> <th style="text-align:left; width:15%;">Data Type</th> <th style="text-align:left; width:10%;">Possible Values</th> <th style="text-align:left; width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>ar_lag (p)</td> <td>Number of lag observations included in the model.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>ma_lag (q)</td> <td>Number of lagged forecast errors in the prediction equation.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>integration (d)</td> <td>Number of times the data is differenced to achieve stationarity.</td> <td>int</td> <td>[0, 5]</td> <td>0</td> </tr> </tbody> </table> ### ARMA (AutoRegressive Moving Average)    **Explanation**: ARMA combines the strengths of AR and MA models but assumes the series is stationary (no differencing). It captures both dependency on past observations and correlations in error terms, providing a balance between trend sensitivity and noise filtering. **Mathematical intuition**: {{%code%}} Yₜ = c + ∑ᵢ₌₁ᵖ φᵢ Yₜ₋ᵢ + ∑ⱼ₌₁ᵠ θⱼ εₜ₋ⱼ + εₜ {{%/code%}} **Use case**: Used in network latency monitoring, server performance analysis, and stable process anomaly detection, where consistent behavior over time makes small deviations highly significant. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:15%;">Parameter</th> <th style="text-align:left; width:30%;">Description</th> <th style="text-align:left; width:15%;">Data Type</th> <th style="text-align:left; width:10%;">Possible Values</th> <th style="text-align:left; width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>ar_lag (p)</td> <td>Number of lag observations used in the model.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>ma_lag (q)</td> <td>Number of lagged forecast errors in the model.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> </tbody> </table> ### Auto ARIMA    **Explanation**: Auto ARIMA automates ARIMA model selection by evaluating multiple combinations of parameters (p,d,q) and optionally seasonal parameters based on information criteria (like AIC or BIC). This allows optimal fitting without manual parameter tuning and ensures that the model adapts to changing time dynamics. **Mathematical intuition**: Automatically selects parameters minimizing: {{%code%}} AIC = 2k − 2ln(L) {{%/code%}} Where k = number of parameters and L = maximum likelihood estimate. **Use case**: Ideal for seasonal retail forecasting, utility consumption anomaly tracking, and automated operational monitoring, where time series exhibit both periodic and non-periodic irregularities. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:15%;">Parameter</th> <th style="text-align:left; width:30%;">Description</th> <th style="text-align:left; width:15%;">Data Type</th> <th style="text-align:left; width:10%;">Possible Values</th> <th style="text-align:left; width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>seasonal</td> <td>Indicates whether the model should account for seasonality.</td> <td>bool</td> <td>{True, False}</td> <td>True</td> </tr> <tr> <td>ar_lag (p)</td> <td>Non-seasonal AR lag parameter.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>ma_lag (q)</td> <td>Non-seasonal MA lag parameter.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>integration (d)</td> <td>Non-seasonal differencing order.</td> <td>int</td> <td>[0, 5]</td> <td>0</td> </tr> <tr> <td>periodicity (s)</td> <td>Seasonal period length (e.g., 12 for monthly data).</td> <td>int</td> <td>[0, ∞)</td> <td>12</td> </tr> <tr> <td>integration (D)</td> <td>Seasonal differencing order.</td> <td>int</td> <td>[0, 5]</td> <td>0</td> </tr> <tr> <td>max_order</td> <td>Maximum total order of the model (optional).</td> <td>int</td> <td>[0, 5]</td> <td>5</td> </tr> </tbody> </table> ### SARIMA (Seasonal ARIMA) **Explanation**: SARIMA extends ARIMA by including seasonal autoregressive, differencing, and moving average components, allowing it to model periodic fluctuations (e.g., daily, monthly, yearly cycles). It excels in detecting anomalies that occur relative to seasonal expectations. **Mathematical intuition**: {{%code%}} Φᴾ(Bˢ) φᵖ(B) (1 − B)ᵈ (1 − Bˢ)ᴰ Yₜ = Θᵠ(Bˢ) θᵩ(B) εₜ {{%/code%}} Where: - s = seasonal period - (p,d,q) = non-seasonal parameters - (P,D,Q) = seasonal parameters **Use case**: Applied in energy load anomaly detection, climate monitoring, and sales seasonality analysis, where deviations from expected seasonal behavior signal potential anomalies. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="text-align:left; width:25%;">Parameter</th> <th style="text-align:left; width:20%;">Description</th> <th style="text-align:left; width:15%;">Data Type</th> <th style="text-align:left; width:10%;">Possible Values</th> <th style="text-align:left; width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>ar_lag (p)</td> <td>Non-seasonal AR term.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>ma_lag (q)</td> <td>Non-seasonal MA term.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>integration (d)</td> <td>Non-seasonal differencing.</td> <td>int</td> <td>[0, 5]</td> <td>0</td> </tr> <tr> <td>seasonal_ar (P)</td> <td>Seasonal AR term.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>seasonal_ma (Q)</td> <td>Seasonal MA term.</td> <td>int</td> <td>[0, 5]</td> <td>1</td> </tr> <tr> <td>seasonal_integration (D)</td> <td>Seasonal differencing.</td> <td>int</td> <td>[0, 5]</td> <td>0</td> </tr> <tr> <td>periodicity (s)</td> <td>Seasonal period (e.g., 12 for monthly seasonality).</td> <td>int</td> <td>[0, ∞)</td> <td>12</td> </tr> <tr> <td>enforce_stationarity</td> <td>Whether to enforce stationarity in the model.</td> <td>bool</td> <td>{True, False}</td> <td>False</td> </tr> <tr> <td>enforce_invertibility</td> <td>Whether to enforce invertibility of the model.</td> <td>bool</td> <td>{True, False}</td> <td>False</td> </tr> </tbody> </table> ### Exponential Smoothing **Explanation**: Exponential Smoothing predicts future values by giving exponentially decreasing weights to older observations. This emphasizes recent data while maintaining awareness of the overall trend. It’s highly responsive to sudden changes in trend or level. **Mathematical intuition**: {{%code%}} Ŷₜ₊₁ = αYₜ + (1 − α)Ŷₜ {{%/code%}} Where: - Y^t+1 = forecast - α = smoothing parameter (0–1) - Yt = actual value **Use case**: Common in inventory control, sales trend monitoring, and machine telemetry, where rapid detection of shifts or decays in recent behavior is vital. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="width:25%;">Parameter</th> <th style="width:20%;">Description</th> <th style="width:15%;">Data Type</th> <th style="width:10%;">Possible Values</th> <th style="width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>damped_trends</td> <td>Whether to apply damping to trend components.</td> <td>bool</td> <td>{True, False}</td> <td>False</td> </tr> <tr> <td>season</td> <td>Type of seasonality to apply.</td> <td>string</td> <td>{‘Add’, ‘Mul’}</td> <td>‘Add’</td> </tr> <tr> <td>seasonal_periods</td> <td>Number of periods in a full seasonal cycle.</td> <td>int</td> <td>[1, ∞)</td> <td>12</td> </tr> </tbody> </table> ### Holt-Winter’s Method    **Explanation**: The Holt-Winter’s method (Triple Exponential Smoothing) extends exponential smoothing by adding trend and seasonal components. It can adapt to level changes, upward or downward trends, and cyclical variations, providing robust anomaly detection for periodic time series. **Mathematical intuition**: {{%code%}}Lₜ = α (Yₜ / Sₜ₋ₛ) + (1 − α)(Lₜ₋₁ + Tₜ₋₁) Tₜ = β (Lₜ − Lₜ₋₁) + (1 − β)Tₜ₋₁ Sₜ = γ (Yₜ / Lₜ) + (1 − γ)Sₜ₋ₛ Ŷₜ₊ₘ = (Lₜ + mTₜ) Sₜ₋ₛ₊ₘ {{%/code%}} Where Lt = level, Tt = trend, St = seasonal component. **Use case**: Widely used in retail demand forecasting, resource utilization monitoring, and temperature anomaly detection, where both seasonality and trend shifts must be captured for accurate anomaly identification. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="width:25%;">Parameter</th> <th style="width:25%;">Description</th> <th style="width:15%;">Data Type</th> <th style="width:20%;">Possible Values</th> <th style="width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>smoothing_level</td> <td>Smoothing factor for the level component.</td> <td>float</td> <td>[0, 1]</td> <td>0.8</td> </tr> <tr> <td>smoothing_trend</td> <td>Smoothing factor for the trend component.</td> <td>float</td> <td>[0, 1]</td> <td>0.2</td> </tr> <tr> <td>damping_trend</td> <td>Controls the damping of the trend component.</td> <td>bool</td> <td>{True, False}</td> <td>True</td> </tr> <tr> <td>optimise</td> <td>Specifies whether parameters should be optimized automatically.</td> <td>string</td> <td>{‘Select’, ‘Manual’}</td> <td>‘Select’</td> </tr> <tr> <td>exponential</td> <td>Whether to use exponential trend smoothing.</td> <td>bool</td> <td>{True, False}</td> <td>False</td> </tr> </tbody> </table> # Non-Time Series Anomaly Detection Algorithms ### One-Class SVM **Explanation**: One-Class SVM (Support Vector Machine) learns a decision boundary around the majority of (normal) data points in feature space. Points that fall outside this boundary are classified as anomalies. It works well in high-dimensional datasets and is effective when anomalies are rare and distinct from normal data. **Mathematical Intuition**: The algorithm finds a function f(x) that is positive for regions with high data density (normal points) and negative for low-density regions (anomalies). It aims to solve: {{%code%}}min (1/2) ||w||² + (1 / (νn)) Σ ξᵢ − ρ subject to: (w · φ(xᵢ)) ≥ ρ − ξᵢ , ξᵢ ≥ 0 {{%/code%}} Where: - ν: controls the upper bound of outliers - ξi: slack variables allowing soft margins - ϕ(x): kernel function mapping data to higher dimensions **Use case**: Used in fraud detection, network intrusion detection, or novelty detection in industrial systems where normal behavior is well-defined but anomalies are rare. **Key parameters**: {{%note%}}{{%bold class="bold-primary"%}}Note:{{%/bold%}} In parameter ranges, square brackets [ ] include the boundary, while parentheses ( ) exclude it. Example: [1, ∞) → includes 1 and extends to infinity; (0, 1] → excludes 0 but includes 1. {{%/note%}} <table class="content-table" style="width:100%;"> <thead> <tr> <th style="width:15%;">Parameter</th> <th style="width:30%;">Description</th> <th style="width:15%;">Data Type</th> <th style="width:15%;">Possible Values</th> <th style="width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>kernel</td> <td>Specifies the kernel type to use. If none is given, 'rbf' is used.</td> <td>string</td> <td>{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}</td> <td>'rbf'</td> </tr> <tr> <td>degree</td> <td>Degree of the polynomial kernel function ('poly'). Ignored by other kernels.</td> <td>int</td> <td>[1, ∞)</td> <td>3</td> </tr> <tr> <td>gamma</td> <td>Kernel coefficient for 'rbf', 'poly', and 'sigmoid'.</td> <td>string</td> <td>{‘scale’, ‘auto’}</td> <td>'scale'</td> </tr> <tr> <td>coef0</td> <td>Independent term in kernel function, significant in 'poly' and 'sigmoid'.</td> <td>float</td> <td>[0, 1]</td> <td>0.0</td> </tr> <tr> <td>tol</td> <td>Tolerance for stopping criterion.</td> <td>float</td> <td>[0, 1]</td> <td>1e-3</td> </tr> <tr> <td>nu</td> <td>Upper bound on fraction of training errors and lower bound on fraction of support vectors.</td> <td>float</td> <td>(0, 1]</td> <td>0.5</td> </tr> <tr> <td>shrinking</td> <td>Whether to use the shrinking heuristic.</td> <td>bool</td> <td>{False, True}</td> <td>True</td> </tr> <tr> <td>max_iter</td> <td>Hard limit on solver iterations, or -1 for no limit.</td> <td>int</td> <td>[1, ∞), {-1}</td> <td>-1</td> </tr> </tbody> </table> ### Isolation Forest **Explanation**: Isolation Forest identifies anomalies by isolating observations instead of modeling normal data points. It randomly selects a feature and splits the data based on a random threshold. Since anomalies are few and different, they are easier to isolate and require fewer splits. The average path length of trees is shorter for anomalies and longer for normal points. **Mathematical intuition**: The anomaly score is computed as: {{%code%}} s(x, n) = 2^(− E(h(x)) / c(n)) {{%/code%}} Where: - E(h(x)): average path length of observation x - c(n): average path length of unsuccessful search in a Binary Search Tree - Scores close to 1 → anomalies; near 0.5 → normal **Use case**: Commonly used in fraud detection, network intrusion detection, manufacturing defect detection, and IoT sensor anomaly identification. In fraud detection, it helps uncover suspicious transactions that deviate from normal spending behavior. In network intrusion detection, it identifies abnormal access patterns or traffic spikes that may indicate a security breach. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="width:15%;">Parameter</th> <th style="width:30%;">Description</th> <th style="width:15%;">Data Type</th> <th style="width:15%;">Possible Values</th> <th style="width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_estimators</td> <td>Number of base estimators (trees) in the ensemble.</td> <td>int</td> <td>[1, ∞)</td> <td>100</td> </tr> <tr> <td>max_samples</td> <td>Number of samples to draw to train each estimator.</td> <td>string, int, float</td> <td>[1, ∞), [0, 1], {'auto'}</td> <td>'auto'</td> </tr> <tr> <td>contamination</td> <td>Proportion of outliers in the dataset.</td> <td>string, float</td> <td>{‘auto’}, (0, 0.5]</td> <td>'auto'</td> </tr> <tr> <td>max_features</td> <td>Number of features to draw for each estimator.</td> <td>int, float</td> <td>[1, ∞), [0, 1]</td> <td>1.0</td> </tr> <tr> <td>bootstrap</td> <td>Whether to sample training data with replacement.</td> <td>bool</td> <td>{False, True}</td> <td>False</td> </tr> </tbody> </table> ### Local Outlier Factor (LOF) **Explanation**: Local Outlier Factor detects anomalies by comparing the local density of a data point to that of its neighbors. If a point has a substantially lower density than its neighbors, it’s considered an anomaly. LOF is particularly effective in detecting local anomalies rather than global ones. **Mathematical intuition**: LOF is based on the concept of local reachability density (LRD). The LOF score of a data point A is given by: {{%code%}} LOFₖ(A) = ( Σ_{B ∈ Nₖ(A)} [LRDₖ(B) / LRDₖ(A)] ) / |Nₖ(A)| {{%/code%}} Where: - Nk(A): k-nearest neighbors of A - LRDk(A): local reachability density of A - Values ≈ 1 → normal, > 1 → outlier **Use case**: It is effectively utilized across diverse domains where identifying local deviations from normal patterns is essential. In fraud detection, it highlights customers or transactions that behave differently from their peer groups, helping uncover subtle or evolving fraud patterns. For customer segmentation, it detects outlier profiles within defined segments, such as unusually high-value or inactive customers, enabling more accurate targeting and retention strategies. **Key parameters**: <table class="content-table" style="width:100%;"> <thead> <tr> <th style="width:15%;">Parameter</th> <th style="width:30%;">Description</th> <th style="width:15%;">Data Type</th> <th style="width:15%;">Possible Values</th> <th style="width:25%;">Default Value</th> </tr> </thead> <tbody> <tr> <td>n_neighbors</td> <td>Number of neighbors to use for kneighbors queries.</td> <td>int</td> <td>[1, n_samples)</td> <td>20</td> </tr> <tr> <td>algorithm</td> <td>Algorithm used to compute nearest neighbors.</td> <td>string</td> <td>{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}</td> <td>'auto'</td> </tr> <tr> <td>leaf_size</td> <td>Leaf size for BallTree or KDTree, affecting speed and memory.</td> <td>int</td> <td>[1, ∞)</td> <td>20</td> </tr> <tr> <td>metric</td> <td>Metric for distance computation.</td> <td>string</td> <td>{'cityblock', 'cosine', 'euclidean', 'haversine', 'l1', 'l2', 'manhattan', 'nan_euclidean'}</td> <td>'minkowski'</td> </tr> <tr> <td>p</td> <td>Parameter for Minkowski metric (1=Manhattan, 2=Euclidean).</td> <td>float</td> <td>[1, ∞)</td> <td>2</td> </tr> <tr> <td>contamination</td> <td>Proportion of outliers in the dataset.</td> <td>float, string</td> <td>(0, 0.5], {'auto'}</td> <td>'auto'</td> </tr> </tbody> </table> ### Operations in QuickML -------------------------------------------------------------------------------- title: "Encoding" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.616Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/operations-in-quickml/encoding/" service: "QuickML" -------------------------------------------------------------------------------- # Operations in QuickML Data Preprocessing is the step in which data gets transformed, or encoded, to help the machine parse it. In other words, the features of the data can now be easily interpreted by the algorithm. 1. **Encoding** 2. **Feature Engineering** 3. **Imputation** 4. **Normalization** 5. **Transformers** # Encoding Encoding is a technique of converting categorical variables (discrete) into numerical (continuous) values so they can be fit easily to a machine-learning model. 1. ### Ordinal Encoder An ordinal encoding involves mapping each unique label to an integer value. This type of encoding is really only appropriate if there is a known relationship between the categories. If the data is ordered, we can use ordinal encoding.\ **Example:**\ For temperature values, Low, Normal, and High, we can use ordinal encoding. After encoding the data will look like 0,1,2.(0-->Low temp,2-->High temp). Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in. In this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random. 2. ### One-Hot Encoding We use this categorical data-encoding technique when the features are nominal (do not have any order). In one-hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category. If the categorical feature is not ordinal (ordered data) and the number of categories in categorical features is less, so one-hot encoding can be effectively applied. **Sample input:** <table class="content-table" style="width:200px;"> <thead> <tr> <th style="text-align:center">color</th> </tr> </thead> <tbody> <tr> <td style="text-align:center">blue</td> </tr> <tr> <td style="text-align:center">red</td> </tr> <tr> <td style="text-align:center">green</td> </tr> </tbody> </table> **Sample output:** <table class="content-table" style="width:500px;"> <thead> <tr> <th style="text-align:center">color_blue</th> <th style="text-align:center">color_red</th> <th style="text-align:center">color_green</th> </tr> </thead> <tbody> <tr> <td style="text-align:center">1</td> <td style="text-align:center">0</td> <td style="text-align:center">0</td> </tr> <tr> <td style="text-align:center">0</td> <td style="text-align:center">1</td> <td style="text-align:center">0</td> </tr> <tr> <td style="text-align:center">0</td> <td style="text-align:center">0</td> <td style="text-align:center">1</td> </tr> </tbody> </table> 3. ### JamesStein Encoder For feature value, the James-Stein estimator returns a weighted average of: 1. The mean target value for the observed feature value. 2. The mean target value (regardless of the feature value). 4. ### Label Encoding This is used to convert a categorical target column into a numerical column by assigning a unique integer or numerical label to each category in the categorical variable. It's important to note that encoding introduces ordering to the categorical variables, which may not be useful in every case. It is appropriate for ordinal variables where there is inherent order or ranking among the categories. 5. ### LeaveOneOut Encoder Leave one out encoding essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question. The encoding algorithm is slightly different between training and test data set. For training data sets, the record under consideration is left out, hence the name leave one out. 6. ### Target Encoding In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.\ Target encoding is the process of replacing a categorical value with the mean of the target variable. Any non-categorical columns are automatically dropped by the target encoder model. 7. ### Count Encoder Count encoding is based on replacing categories with their counts computed on the train set. Counts may be the same for some of the variables, which may result in collision, encoding two categories as the same value. Count encoder can be used if the count of categories are not the same. <table class="content-table" style="width:600px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">10</td> <td style="text-align:center">20</td> <td style="text-align:center">30</td> <td style="text-align:center">30</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">2</td> <td style="text-align:center">2</td> <td style="text-align:center">1</td> <td style="text-align:center">3</td> <td style="text-align:center">3</td> <td style="text-align:center">3</td> </tr> <tbody> </tbody> </table> 8. ### Backward Difference Encoding In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable. 9. ### Helmert Encoding The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. This comparison does not make much sense for a nominal variable, such as race. 10. ### Catboost Encoding Catboost is a target-based categorical encoder. It replaces a categorical feature with average value of target corresponding to that category in training dataset combined with the target probability over the entire dataset. However, this introduces a target leakage, because the target is used to predict the target. -------------------------------------------------------------------------------- title: "Feature Engineering" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.616Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/operations-in-quickml/feature-engineering/" service: "QuickML" -------------------------------------------------------------------------------- # Feature Engineering ### Feature Generation This is the process of transforming features that already exist into new ones to make them more relevant to the target feature. The following techniques used in feature generation: - **Operations** - A feature generation technique that generates new features based on mathematical operations on the existing numerical features. - **Autolearn** - A regression-based feature generation algorithm. Features are generated by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regression and selecting those relationships that are stable and improve the prediction performance. - **Explorekit** - Generates a large set of candidate features by combining information in the original features, with the aim of maximising predictive performance according to user-selected criteria. ### Feature Selection The techniques below helps to decrease the dimensionality of the feature space, streamline the model, and enhance generalisation performance of model by choosing a subset of relevant features from the original list in the dataset. - **Embedded** - A technique where feature selection is integrated into the process of training a machine learning model. The model itself decides which features are most relevant during training; - **Filter** - A technique that involves selecting the most relevant features based on their statistical properties or ranking scores. - **Redundancy Elimination** - A process of removing features from a dataset that provide similar or duplicate information. - **Backward Feature Elimination** - A technique that starts with all features in the dataset and iteratively removes the least significant features one at a time. - **Exhaustive Feature Engineering** - A technique that considers all possible combinations of features to find the optimal subset that results in the best model performance. - **Forward Selection** - A technique that starts with an empty set of features and iteratively adds the most significant features one at a time. ### Feature Reduction These techniques are used to address the “curse of dimensionality,” which is when an algorithm struggles to train an effective model due to large number of features in dataset relative to observations. The following effective techniques are employed: - **PCA** - Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much of the original data's variability as possible. - **FA** - Factor Analysis (FA) is a statistical technique used to uncover underlying latent variables (factors) that explain patterns of correlations among observed variables in a dataset. It is commonly employed for dimensionality reduction and to gain insights into the structure of complex data. - **NMF** - NMF (Non-Negative Matrix Factorization) is a dimensionality reduction and feature extraction technique that is particularly useful when dealing with non-negative data, such as text data or image data with pixel intensities. - **ICA** - ICA ( Independent Component Analysis) is a technique used to separate a multivariate signal into statistically independent components, assuming that the observed data is a linear combination of non-Gaussian and independent source signals. - **LDA** - LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction and classification technique used to find a linear combination of features that best separates two or more classes in the data. -------------------------------------------------------------------------------- title: "Imputers" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.617Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/operations-in-quickml/imputers/" service: "QuickML" -------------------------------------------------------------------------------- # Imputers Imputation is a technique used for replacing the missing data in a dataset with some substitute value to retain most of the data/information of the dataset. 1. ### KNN Imputation This imputation utilises the k-Nearest neighbours method to replace the missing values in the datasets with the mean value from the parameter {{%badge%}}n_neighbors{{%/badge%}} nearest neighbours found in the training set. By default, the {{%badge%}}n_neigbours{{%/badge%}} parameter will be set to 5 and the Euclidean distance metric will be used to find the k nearest neighbours. 2. ### MissForest Imputation It initially imputes all missing data using the mean/mode. Then for each variable with missing values, a random forest model will be trained on the observed part and then predicts the missing part. 3. ### Mean Imputation Mean Imputation replaces the null values with the mean of that feature across the whole dataset. 4. ### Median Imputation Median Imputation replaces the null values with the median of that feature across the whole dataset. 5. ### Mode Imputation Mode Imputation replaces the null values with the mode of that feature across the whole dataset. 6. ### Group-By Imputation Group-by imputation takes the following 3 inputs for replacing the null values. * **Columns to Impute**: Columns that contains null values and that are needed to be replaced. * **Group-By Columns**: These columns will be grouped by while calculating values to replace null values. * **Aggregator**: Aggregate function like mean, median, minimum and maximum that need to be used. By using the above inputs, Group-By Imputation finds the required aggregate values by grouping the Group-By columns and fills the null values in the dataset with those aggregate values. -------------------------------------------------------------------------------- title: "Normalization" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.617Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/operations-in-quickml/normalization/" service: "QuickML" -------------------------------------------------------------------------------- # Normalization Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values or losing information. ### Min-Max Normalization Min-Max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1. **Process:** {{%code class="language-json"%}}x_normalized = (x−min(x))/ (max(x)-min(x)){{%/code%}} **Where:** - x_normalized is the normalized value of the feature. - x is the original value of the feature. - min(x) is the minimum value of the feature across the dataset. - max(x) is the maximum value of the feature across the dataset. **Example:** <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">0</td> <td style="text-align:center">0.75</td> <td style="text-align:center">1</td> </tr> <tbody> </tbody> </table> ### Unit Normalization Unit normalization consists of dividing every entry in a column (feature) by its magnitude to create a feature of length 1 known as the unit vector. **Process:** {{%code class="language-json"%}}x_normalized = x / ||x||{{%/code%}} **Where:** - x_normalized is the normalized value of the feature. - x is the original value of the feature. - ||x|| is the magnitude which is calculated as - ||x|| = sqrt(x1^2 + x2^2 + ....... xn^2) - x1, x2, x3......xn are the original values of the feature. **Example:** <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">0.248</td> <td style="text-align:center">0.620</td> <td style="text-align:center">0.744</td> </tr> <tbody> </tbody> </table> ### Mean Normalization This transformer transforms the data based on the mean so that sum of the values equals to 0. **Process:** {{%code class="language-json"%}}x_normalized = x - mean(x) / max(x) - min(x){{%/code%}} **Where:** - x_normalized is the normalized value of the feature. - x is the original value of the feature. - mean(x) is the mean of feature across the dataset. - min(x) is the minimum value of the feature across the dataset. - max(x) is the maximum value of the feature across the dataset. **Example:** <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">-0.583</td> <td style="text-align:center">0.166</td> <td style="text-align:center">0.416</td> </tr> <tbody> </tbody> </table> ### Mean-Std Normalization The data can be normalized by subtracting the mean (µ) of each feature and a division by the standard deviation (σ). This way, each feature has a mean of 0 and a standard deviation of 1. This results in faster convergence. **Process:** {{%code class="language-json"%}}x_normalized = x - mean(x) / std(x){{%/code%}} **Where** - x_normalized is the normalized value of the feature. - x is the original value of the feature. - mean(x) is the mean of feature across the dataset. - std(x) is the standard deviation of the feature across the dataset. **Example:** <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">-1.120</td> <td style="text-align:center">0.320</td> <td style="text-align:center">0.800</td> </tr> <tbody> </tbody> </table> ### Robust Normalization Robust Scaler is a normalization technique that scales features using the **median** and **interquartile range (IQR)**, making it less sensitive to outliers compared to standard scaling methods. It centers the normalised data around the median and scales it according to the IQR, which helps maintain the structure of data with extreme values without letting those values skew the scaling. **Process:** {{%code class="language-json"%}}xrobust = IQR(x)x − median(x) {{%/code%}} **Where** - x = original feature value - median(x) = median of the feature values - IQR(x) = interquartile range (75th percentile - 25th percentile) - xrobust = normalized value after robust scaling **Example** Suppose the feature values are: [10, 25, 30, 1000 (outlier)] - Median = 27.5 - IQR = 20 (between 25th percentile = 15 and 75th percentile = 35) Normalized values would be: <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Input Value</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> <td style="text-align:center">1000</td> </tr> <th style="text-align:center">Output (Robust)</th> <td style="text-align:center">-0.875</td> <td style="text-align:center">-0.125</td> <td style="text-align:center">0.125</td> <td style="text-align:center">48.625</td> </tr> <tbody> </tbody> </table> Notice that the outlier (1000) has a large scaled value, but the rest are kept in a reasonable range without distortion. **Real-Time benefit of Robust Normalization** In real-world data, such as sensor readings, financial transactions, or health metrics, outliers are common. Using robust normalization prevents these extreme values from dominating the model training, resulting in: - More stable and reliable model - Better generalization to unseen data - Improved performance when the dataset has noisy or extreme values -------------------------------------------------------------------------------- title: "Transformer" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.617Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/operations-in-quickml/transformer/" service: "QuickML" -------------------------------------------------------------------------------- # Transformer Transformation is a ML operation that transforms the data so that all feature data in same scale. Many normalization techniques are used for the same. 1. ### Square Transform This transforms the data into the square of the data. <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">5</td> <td style="text-align:center">10</td> <td style="text-align:center">12</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">25</td> <td style="text-align:center">100</td> <td style="text-align:center">144</td> </tr> <tbody> </tbody> </table> 2. ### Cube Transform This transforms the data into a cube of data. <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">2</td> <td style="text-align:center">4</td> <td style="text-align:center">5</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">8</td> <td style="text-align:center">64</td> <td style="text-align:center">125</td> </tr> <tbody> </tbody> </table> 3. ### Inverse Transform This transforms the data into an inverse of the data. <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">20</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">0.1</td> <td style="text-align:center">0.05</td> <td style="text-align:center">0.03333</td> </tr> <tbody> </tbody> </table> 4. ### Root Transform This transforms data into its square root. <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">25</td> <td style="text-align:center">100</td> <td style="text-align:center">64</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">5</td> <td style="text-align:center">10</td> <td style="text-align:center">8</td> </tr> <tbody> </tbody> </table> 5. ### Log Transform This transforms data into its log (data) value. <table class="content-table" style="width:400px;"> <tr> <th style="text-align:center">Sample Input</th> <td style="text-align:center">10</td> <td style="text-align:center">25</td> <td style="text-align:center">30</td> </tr> <th style="text-align:center">Sample Output</th> <td style="text-align:center">2.39789527</td> <td style="text-align:center">3.25809654</td> <td style="text-align:center">3.4339872</td> </tr> <tbody> </tbody> </table> ### Pipeline Builder Interface -------------------------------------------------------------------------------- title: "Walkthrough" description: "QuickML is a fully no-code ML pipeline builder service in the Catalyst development platform for creating machine-learning pipelines with end-to-end solutions." last_updated: "2026-03-18T07:41:08.618Z" source: "https://docs.catalyst.zoho.com/en/quickml/help/pipeline-builder-interface/walkthrough/" service: "QuickML" -------------------------------------------------------------------------------- # Pipeline Builder Interface QuickML provides easy to use, convenient Drag and Drop editor powered with data and Machine learning capabilities as Stages. ### Stage A stage is a single task or unit of function that provides particular results based on the nature of computation assigned to it. A stage can be any data-related or ML related task that can be sequentially executed in a workflow. For example, a stage can be a data filter that takes raw data as input and provides filtered data as output based on the filter type. ### Pipeline A pipeline is a sequential combination of stages which is intended to do multiple operations and provide a collective result as output. Any data pipelines or ML pipelines can be created using QuickML service with provided functionalities. ### Pipeline Builder Interface The Pipeline Builder interface is a drag-and-drop, in-browser editor component provided in QuickML to construct pipelines with multiple or single stages of desired operations. This editor provides all the basic create, copy, edit, undo, redo and delete operations in the pipeline building process. Each data or ML module is segregated as stages in the editor and listed under data and ML categories. ![Pipeline Builder Interface](/images/help/quickml/pipeline-builder-interface/Pipeline_Builder_Interface.webp) Each pipeline will have a mandatory source and destination stages. Other stages are customizable as per your pipeline requirements. Any stage that has been added to the editor needs to be connected with next or previous stages in the editor in order to form a pipeline execution flow. Once the mapping is done with stages, a preview will be shown in the editor for better understanding of the pipeline flow. ### Data Operations in the Editor The data-related operations are listed in the pipeline builder to add in the data pipeline construction. The details of each operation in the editor are provided in the data pipelines chapter of this manual. For each data operation available in the pipeline editor, a custom-configuration panel will be open to fill and feed configurations for that stage. ### Criteria Editor A criteria editor component has been provided in the builder to apply custom criteria over the resultant data of any stage based on your requirements. ![Criteria Editor](/images/help/quickml/pipeline-builder-interface/Criteria_Editor.webp) ML operations in the Editor The ML operations will be available in the builder when the model pipeline needs to be constructed. The ML pipeline builder contains both data-related operations and ML operations to form an execution pipeline. These options are categorized as algorithms and operations for ML tasks. ML stages can be any algorithms or Zia features that are integrated as stages into the builder, and can be added to pipeline by dragging them into the builder. Once the pipeline is built, it can be saved and kept for future development, or can be published to execute in the background. These executions are versioned and execution results will be stored to keep track of the status. Any version of the pipeline can be viewed at any time and promoted as production-ready pipelines. --- ## SDK — Java — QuickML -------------------------------------------------------------------------------- title: "Execute QuickML Endpoint" description: "This page describes the method to execute QuickML endpoints in your Java application with a sample code snippet." last_updated: "2026-03-18T07:41:08.655Z" source: "https://docs.catalyst.zoho.com/en/sdk/java/v1/quickml/execute-quickml-endpoints/" service: "QuickML" related: - QuickML (/en/quickml/) - QuickML Pipeline Endpoints (/en/quickml/help/pipeline-endpoints/) -------------------------------------------------------------------------------- # Catalyst QuickML {{%link href="/en/quickml/" %}}Catalyst QuickML{{%/link%}} is a no-code machine learning pipeline builder service that lets you implement a host of pre-built ML algorithms, operations, and data preprocessing techniques, and connect with datasets to build and publish ML models. After you publish the data pipeline and ML pipeline, you can access the models you create with authenticated endpoints. ### Execute QuickML Endpoint The code snippet given below allows you to pass input data to a published {{%link href="/en/quickml/help/pipeline-endpoints/" %}}QuickML endpoint{{%/link%}}, and predict the outcome based on the ML model's processing. The output returns the prediction of the values of the target column that is defined while creating the ML pipeline. {{%note%}}{{%bold class="bold-primary" %}}Note:{{%/bold%}} 1. You will need to have the ML pipeline and the model's endpoint configured and published in your project using the Catalyst console, before you execute this code to predict the outcome with the code snippet below. 2. QuickML is currently not available to Catalyst users accessing from the JP, SA or CA data centers. {{%/note%}} The input data to the model's endpoint are passed as key-value pairs through a hashmap. The {{%badge%}}endpoint_key{{%/badge%}} mentioned below is the unique ID of the endpoint published for the ML model configured in your project. The endpoint key and the input data are passed to the {{%badge%}}predict(){{%/badge%}} method for execution. #### Sample Code Snippet <br> {{% panel_without_adjustment header="Package Imports" class="language-java line-numbers" %}}import java.util.HashMap; import com.zc.component.quickml.ZCQuickML; import com.zc.component.quickml.ZCQuickMLDetail;{{% /panel_without_adjustment %}} {{%code class="language-java" scroll="set-scroll"%}}// input data HashMap&lt;String, String&gt; map= new HashMap<>(); //Give column name and value as per your data set map.put("column_name1", "value1"); map.put("column_name2", "value2"); map.put("column_name3", "value3"); String EndPointKey = "c8c7b4bfd8fdf4f29b9499fjhd9d03f5b2a79cd4792b302e26934a3db984214fd4cfc8049bba7acb15fe73eac0d15af59"; // End point key copied from catalyst UI. // create a quickml instance ZCQuickML quickMlInstance = ZCQuickML.getInstance(); // execute method ZCQuickMLDetail result = quickMlInstance.predict(EndPointKey, map); System.out.println(result.getStatus()); // return the result status System.out.print(result.getResult()); // return the result data{{%/code%}} --- ## SDK — Node JS — QuickML -------------------------------------------------------------------------------- title: "Execute QuickML Endpoint" description: "This page describes the method to execute QuickML endpoints in your NodeJS application with a sample code snippet." last_updated: "2026-03-18T07:41:08.657Z" source: "https://docs.catalyst.zoho.com/en/sdk/nodejs/v2/quickml/execute-quickml-endpoints/" service: "QuickML" related: - QuickML (/en/quickml/) - QuickML Pipeline Endpoints (/en/quickml/help/pipeline-endpoints/) -------------------------------------------------------------------------------- # Catalyst QuickML {{%link href="/en/quickml/" %}}Catalyst QuickML{{%/link%}} is a no-code machine learning pipeline builder service that lets you implement a host of pre-built ML algorithms, operations, and data preprocessing techniques, and connect with datasets to build and publish ML models. After you publish the data pipeline and ML pipeline, you can access the models you create with authenticated endpoints. ### Execute QuickML Endpoint The code snippet given below allows you to pass input data to a published {{%link href="/en/quickml/help/pipeline-endpoints/" %}}QuickML endpoint{{%/link%}}, and predict the outcome based on the ML model's processing. The output returns the prediction of the values of the target column that is defined while creating the ML pipeline. {{%note%}}{{%bold class="bold-primary" %}}Note:{{%/bold%}} 1. You will need to have the ML pipeline and the model's endpoint configured and published in your project using the Catalyst console, before you execute this code to predict the outcome with the code snippet below. 2. QuickML is currently not available to Catalyst users accessing from the JP, SA or CA data centers. {{%/note%}} The {{%badge%}}quickml{{%/badge%}} component instance is created as shown below, which will not fire a server-side call. You can pass the input data to the model's endpoint as key-value pairs. The {{%badge%}}endpoint_key{{%/badge%}} mentioned below is the unique ID of the endpoint published for the ML model configured in your project. The endpoint key and the input data are passed to the {{%badge%}}predict(){{%/badge%}} method for execution. {{%code class="language-javascript" scroll="set-scroll" %}}// input data const input_data = { // Enter column name and value as per your dataset "column_name1": "value1", "column_name2": "value2", "column_name3": "value3" } // create a quickml instance const quickml = app.quickML(); // execute predict method const result = await quickml.predict("{endpoint_key}",input_data); // Replace {endpoint_key} with the endpoint key copied from the catalyst console console.log(result);{{%/code%}} The syntax of the output received is shown below: {{%code class="language-javascript" scroll="set-scroll" %}}{ 'status': 'success', 'result': ["results....."] }{{%/code%}} --- ## SDK — Python — QuickML -------------------------------------------------------------------------------- title: "Execute QuickML Endpoint" description: "This page describes the method to execute QuickML endpoints in your Python application with a sample code snippet." last_updated: "2026-03-18T07:41:08.660Z" source: "https://docs.catalyst.zoho.com/en/sdk/python/v1/quickml/execute-quickml-endpoints/" service: "QuickML" related: - QuickML Help (/en/quickml/) - QuickML Pipeline Endpoints (/en/quickml/help/pipeline-endpoints/) - SDK Scopes (/en/sdk/python/v1/sdk-scopes) -------------------------------------------------------------------------------- # Catalyst QuickML {{%link href="/en/quickml/" %}}Catalyst QuickML{{%/link%}} is a no-code machine learning pipeline builder service that lets you implement a host of pre-built ML algorithms, operations, and data preprocessing techniques, and connect with datasets to build and publish ML models. After you publish the ML pipeline, you can access the models you created with authenticated endpoints. ### Execute QuickML Endpoint The code snippet given below allows you to pass input data to a published {{%link href="/en/quickml/help/pipeline-endpoints/" %}}QuickML endpoint{{%/link%}}, and predict the outcome based on the ML model's processing. The output returns the prediction of the values of the target column that is defined while creating the ML pipeline. {{%note%}}{{%bold class="bold-primary" %}}Note:{{%/bold%}} 1. You will need to have the ML pipeline and the model's endpoint configured and published in your project using the Catalyst console, before you execute this code to predict the outcome with the code snippet below. 2. QuickML is currently not available to Catalyst users accessing from the JP, SA or CA data centers. {{%/note%}} The {{%badge%}}quickml{{%/badge%}} component instance is created as shown below, which will not fire a server-side call. You will need to create a data dictionary through which you can pass the input data to the model's endpoint as key-value pairs. The {{%badge%}}endpoint_key{{%/badge%}} mentioned below is the unique ID of the endpoint published for the ML model configured in your project. The endpoint key and the input data are passed to the {{%badge%}}predict(){{%/badge%}} method for execution. The {{%badge%}}app{{%/badge%}} reference used in the code below is the {{%link href="/en/sdk/python/v1/setup/#initializing-the-sdk" %}}Python object{{%/link%}} returned as a response during SDK initialization. **Parameters Used** <table class="content-table"> <thead> <tr> <th class="w20p">Parameter Name</th> <th class="w60p">Definition</th> </tr> </thead> <tbody> <tr> <td>{{%badge%}}endpoint_key{{%/badge%}}</td> <td>A {{%bold%}}Mandatory{{%/bold%}} parameter. Will store the unique ID of the endpoint published for the ML model.</td> </tr> <tr> <td>{{%badge%}}input_data{{%/badge%}}</td> <td>A {{%bold%}}Mandatory{{%/bold%}} parameter. Will store the the data to be published to the QuickML endpoint.</td> </tr> </tbody> </table> {{%code class="language-python" scroll="set-scroll" %}}# Input data dictionary input_data = { # Give column name and value as per your data set "column_name1": "value1", "column_name2": "value2", "column_name3": "value3", } #Create quickml instance quickml = app.quick_ml() #Execute method result = quickml.predict("{endpoint_key}", input_data) #Replace {endpoint_key} with the endpoint key copied from the catalyst console print(result) {{%/code%}} The syntax of the output received is shown below: {{%code class="language-json" scroll="set-scroll" %}}{ "status":"success", "result":[ "results....." ] }{{%/code%}} {{%note%}}{{%bold%}}Info :{{%/bold%}} Refer to the {{%link href="/en/sdk/python/v1/sdk-scopes" %}}SDK Scopes table{{%/link%}} to determine the required permission level for performing the above operation.{{%/note%}} --- ## FAQ — QuickML # General {{% expand-container %}} {{% expand id="automl_quickml" header="What is the difference between AutoML and QuickML?" %}} AutoML enables you to easily analyse a set of training data and generate predictive analytics on the dataset without requiring you to be involved in the complex ML training process that involves selecting the right ML algorithms to train the model, preprocessing or profiling the data, or managing the models. Catalyst implements the required model training, and automates the entire process for you. QuickML on the other hand provides you more control in managing ML and data operations, and lets you build, test, deploy, and monitor effective ML models end-to-end. You will be able to perform a host of data preprocessing and transformation operations, pick the ML algorithms for training, and design the pipeline exactly as you need, all with no coding involved. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="data_security" header="How safe is my data in QuickML? Will the data be used to increase the QuickML platform's quality?" %}} User data is encrypted and stored safely by adhering to all the security measures applied in Zoho. The data uploaded will not be used to improve the accuracy of any of the QuickML Platform's algorithms. It is only used to train the intended customer model that user configure. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="dataset_between_different_projects" header="Is it possible to access a dataset in one Catalyst project from another project?" %}} Catalyst Projects are intended to provide a clear isolation between data that are being handled. It is not possible to access the data from one project from another. {{%/expand%}} {{%/expand-container%}} <!-- {{% expand-container %}} {{% expand id="quick_billing" header="How are billing prices calculated for model training and retraining?" %}} Model training cost is calculated based on the below metrics. **Training cost** = Memory cost + CPU usage cost\ **CPU usage cost** = Average CPU Usage per Second * Cost per vCPU second\ **Memory usage cost** = Memory used / GB-second #### Cost per unit: **vCPU second** = 0.03 Rupees\ **1 GB per second cost** = 0.01 Rupees {{%/expand%}} {{%/expand-container%}} --> # Datasets {{% expand-container %}} {{% expand id="data_quality" header="How is my data quality score calculated?" %}} Data quality score is calculated by internal metrics considering the invalid and missing values present in the dataset. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="records_fetch" header="What is the maximum number of records that will be fetched from Zoho CRM?" %}} By using Zoho CRM Bulk APIs, maximum of six hundred thousand/ six lakh records will be fetched into QuickML even if synchronization is configured. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="dataset_pause_resume" header="Is it possible pause and resume the synchronization for particular Dataset import?" %}} You can edit the synchronization option of Particular dataset by choosing None option in Sync Frequency Dropdown box in dataset details page. {{%/expand%}} {{%/expand-container%}} # Pipelines {{% expand-container %}} {{% expand id="multiple_datasets" header="Is it possible to use multiple datasets in single pipeline?" %}} Yes. You can use multiple datasets in a single pipeline by using the Add Dataset stage and configuring the required dataset in it. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="data_ml_pipeline" header="What is the difference between Data Pipeline and ML Pipeline?" %}} Data pipeline is intended to do data pre-processing on the original dataset and reuse the dataset for future pipeline creations. Hence, the data pipeline will only have data operations while the ML pipeline will contain both data and ML operations. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="dataset_pipeline_profile" header="Why is there a statistics difference in dataset details page profile and pipeline source stage profile?" %}} Profile in Dataset details page represents details for whole data but profile in source stage represents details for sampled data from the original dataset. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="target_column" header="Is it possible to change the target column of the ML pipeline?" %}} No. Once the target column of the pipeline is saved, it cannot be changed. We will have to create a new ML pipeline with the new target column to experiment. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="zia_features_and_customer_data" header="Is Zia features in QuicML Pipeline builder is generated using customer data?" %}} Zia features are in-house ML/DL models that are pre-trained using large open-source dataset to solve common use-cases and integrated into builder to enhance data preprocessing capabilities. {{%/expand%}} {{%/expand-container%}} {{% expand-container %}} {{% expand id="long_operation_process" header="Why is my current operation processing for long time?" %}} All the pipeline executions are queued and handled asynchronously inside QuickML and based on demand the execution will take place. In certain cases, the operation might be costlier in terms of computation, so there may be a delay. However, the status of the pipeline execution will be updated respectively once the execution succeeds or fails. {{%/expand%}} {{%/expand-container%}} # Models {{% expand-container %}} {{% expand id="model_creation" header="How are the models created?" %}} Model will be created at the successful execution of the Model pipeline automatically. Once the models are created, we can view the details associated with that model and pipeline under models module. {{%/expand%}} {{%/expand-container%}} # Endpoints {{% expand-container %}} {{% expand id="development_production_env" header="What is the difference between development and production environments in QuickML endpoints?" %}} While we create an endpoint for a model, QuickML platform automatically enables an REST API endpoint which is to test and verify the model behavior, and is free to use for 1000 invocations. After verifying the endpoint has to be published to access it in production environment for production grade integrations and charged as per usage. {{%/expand%}} {{%/expand-container%}} <!-- {{% expand-container %}} {{% expand id="failed_api_billing" header="Is the failed API requests are accounted for billing?" %}} No. Only successful API invocations will be charged as per pricing tiers. {{%/expand%}} {{%/expand-container%}} --> {{% expand-container %}} {{% expand id="modes_of_integration" header="What are the modes used to integrate QuickML endpoints to external endpoints?" %}} There are two modes available to integrate QuickML endpoints. **REST API:** Zoho OAuth authenticated [REST API](/en/quickml/help/pipeline-endpoints/) calls.\ **QuickML SDKs:** External calls can be made using [Java](/en/sdk/java/v1/quickml/execute-quickml-endpoints/), [Python](/en/sdk/python/v1/quickml/execute-quickml-endpoints/) and [NodeJs](/en/sdk/nodejs/v2/quickml/execute-quickml-endpoints/) Catalyst SDKs. {{%/expand%}} {{%/expand-container%}}