Clustering Algorithms 

Clustering is an unsupervised machine learning model that focuses on grouping similar data points together based on their characteristics or patterns. Instead of relying on predefined labels, clustering uncovers natural structures in the data, making it especially useful for exploratory analysis and segmentation. Let’s dive deeper into how clustering works and explore the different types of clustering algorithms that can be used to build effective clustering models.

Below are the clustering algorithms supported in Catalyst QuickML, along with their Clustering type, explanation, real-world use cases, and key parameters.

Note: Clustering is available in early access across all the data centers. To use this, request access via support@zohocatalyst.com

Centroid-Based Algorithms 

K-Means

Explanation: K-Means partitions data into k clusters by minimizing the variance within each cluster, measured as the sum of squared Euclidean distances between data points and their cluster centroids. It works by iteratively performing two steps: (1) assigning each data point to the nearest centroid (hard assignment), and (2) recalculating centroids as the mean of all points in the cluster. This process repeats until assignments no longer change or a maximum number of iterations is reached.

Mathematical formula:

copy
 min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − μi||² 

Where:

  • Ci = cluster i
  • μi = centroid (mean) of cluster i
  • ∣∣x−μi∣∣2 = squared Euclidean distance between data point x and centroid μi

Use case: Retail companies use K-Means to segment customers into groups like “budget,” “regular,” and “premium” based on spending habits.

Key parameters:

Note: In the parameter ranges provided, square brackets [ ] indicate that the boundary value is included, while parentheses ( ) indicate that the boundary value is excluded. Example: [1, ∞) means the range starts from 1 (included) and extends to infinity (excluded). hereas, (0, 1] means 0 is excluded but 1 is included.
Parameter Description Data Type Possible Values Default Value
n_clusters Defines how many clusters the algorithm should create and, accordingly, how many centroids will be generated. Choosing the right value is critical, as too few clusters may oversimplify the data while too many may lead to overfitting. int [1, n_rows] 8
init Specifies the method used to initialize the centroids before the clustering iterations begin. string {'k-means++', 'random'} 'k-means++'
max_iter Determines the maximum number of iterations the K-Means algorithm will perform in a single run. int [1, ∞) 300
algorithm Indicates the specific variant of the K-Means algorithm to use. string {'lloyd', 'elkan'} 'lloyd'

MiniBatchKMeans   

Explanation: MiniBatchKMeans is a faster and more memory-efficient version of K-Means that processes small random subsets (mini-batches) of the data at each iteration instead of the entire dataset. By using these mini-batches to update centroids, it converges much faster while giving results close to standard K-Means. This makes it especially useful for large datasets or streaming data, where running full K-Means would be too slow or resource-intensive.

Mathematical formula:

copy
 min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Bt ∩ Ci) ||x − μi||² 

Where Bt is a random mini-batch at iteration t.

Use case: E-commerce platforms with millions of users cluster browsing behavior in real time for product recommendations.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Defines the number of clusters to be formed as well as the number of centroids to be generated. int [1, n_rows] 8
init Specifies the method used to initialize the centroids. string {'k-means++', 'random'} 'k-means++'
max_iter Determines the maximum number of iterations the K-Means algorithm will perform in a single run. int [1, ∞) 300
batch_size Size of the mini batches. int [1, ∞) 1024

Fuzzy C-Means   

Explanation: Fuzzy C-Means is similar to K-Means but allows soft assignments, meaning each data point can belong to multiple clusters with different degrees of membership. Instead of assigning each point to a single cluster, Fuzzy C-Means calculates a membership probability for every cluster based on the point’s distance from each centroid. This is particularly useful for data where clusters overlap, as it provides a more nuanced view of membership rather than forcing a hard choice.

Mathematical formula:

copy
 J = Σ(i=1 to k) Σ(j=1 to n) uij^m ||xj − ci||² 

Where:

  • uij = degree of membership of data point xj in cluster i
  • m = fuzziness parameter (m>1)
  • ci = centroid of cluster i

Use case: Music apps group users whose tastes overlap across multiple genres (pop, rock, electronic).

Key parameter:

Parameter Description Data Type Possible Values Default Value
n_clusters Defines the number of centers that should be initialized. int [1, n_rows] 8

K-Medians   

Explanation: K-Medians works like K-Means but uses the median instead of the mean to compute cluster centers. This small change makes the algorithm more robust to outliers, since medians are less influenced by extreme values. The algorithm still alternates between assigning points to the nearest cluster center and recalculating centers, but by using medians, it produces clusters that better represent the typical data point in skewed or heavy-tailed distributions.

Mathematical formula:

copy
 min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − mi|| 

Where mi is the median of all points in cluster i.

Use case: Healthcare organizations use K-Medians to group patients based on median treatment costs or hospital stay durations, ensuring clusters are not skewed by a few patients with exceptionally high bills or unusually long stays.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Defines the number of clusters to be allocated. int [1, n_rows] 8

K-Modes 

Explanation: K-Modes is designed for clustering categorical data. It replaces means with modes (the most frequent category value in each cluster) and uses a simple matching dissimilarity measure to assign points to clusters. This makes it well suited for scenarios where numeric distance does not make sense, such as clustering customers by preferences or product categories. The algorithm iteratively updates cluster modes until they no longer change, producing groups defined by the most common category values.

Mathematical formula:

copy
 min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) d(x, θi) 

Where d(x,θi) is the number of mismatches between categorical attributes of x and mode θi.

Use case: Retailers cluster customers using categorical data like “prefers online/offline,” “electronics/clothing,” “discount/luxury.”

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Defines the number of clusters to be formed as well as the number of centroids to be generated. int [1, n_rows] 8
max_iter Determines the maximum number of iterations of the k-modes algorithm for a single run. int [1, ∞) 100
init Specifies the method used for initialization. string {'Huang', 'Cao', 'random'} 'Cao'

Affinity Propagation  

Explanation: Affinity Propagation clusters data by exchanging similarity messages between pairs of points until some points are identified as “exemplars,” which act as cluster centers. Unlike K-Means, you do not need to specify the number of clusters beforehand — the algorithm automatically determines how many clusters to form based on the input similarities and a “preference” parameter that controls how likely each point is to become an exemplar. This makes it especially useful when the natural number of clusters is unknown.

Use case: In customer service, it can group incoming support queries into clusters without prior knowledge of how many categories exist.

Key parameters:

Parameter Description Data Type Possible Values Default Value
damping Specifies the damping factor. float [0.5, 1.0) 0.5
max_iter Determines the maximum number of iterations. int [1, ∞) 200
convergence_iter Specifies the number of consecutive iterations during which the estimated number of clusters remains unchanged before the algorithm declares convergence. int [1, ∞) 15
affinity Determines the similarity measure used between data points. string {'precomputed', 'euclidean'} 'euclidean'

Birch – Balanced Iterative Reducing and Clustering using Hierarchies 

Explanation: BIRCH is designed for large or streaming datasets. It incrementally builds a CF (Clustering Feature) tree, which stores compact summaries of data points, and uses these summaries to perform clustering. The CF tree allows BIRCH to quickly group incoming data without having to store or process the full dataset at once. After building the tree, a final global clustering step (often using K-Means) can refine the clusters. This approach makes BIRCH extremely memory-efficient and scalable.

Use case: In IoT applications, Birch can cluster continuous streams of sensor data (like smart city traffic monitoring) into meaningful groups in real time.

Key parameters:

Parameter Description Data Type Possible Values Default Value
threshold Defines the maximum radius of a subcluster when merging with the closest existing subcluster. float (0, 1.0) 0.5
branching_factor Determines the maximum number of CF subclusters in each node. int [1, ∞) 50
n_clusters Determines the final number of clusters to form after the CF tree is built. int, None [1, ∞) 3

K-Prototypes  

Explanation: K-Prototypes extends K-Means and K-Modes to handle datasets containing both numerical and categorical variables. It calculates cluster centers using means for numerical features and modes for categorical features, combining the two using a weighting parameter that balances their influence. This makes it possible to cluster mixed-type data without extensive preprocessing or conversion.

Mathematical formula:

copy
 d(x, y) = Σ(j ∈ num) (xj − yj)² + γ Σ(j ∈ cat) δ(xj, yj) 

Where:

  • δ(xj,yj)=0 if equal, else 1
  • γ = weight balancing numeric and categorical distances

Use case: In HR analytics, companies can group employees based on mixed data — numerical (e.g., years of experience, salary) and categorical (e.g., department, role) — to identify workforce patterns.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Specifies the number of clusters to form and the number of centroids to generate. int [1, n_rows] 8
max_iter Determines the maximum number of iterations of the k-modes algorithm for a single run. int [1, ∞) 100
init Specifies the method for initialization. string {'Huang', 'Cao', 'random'} 'Cao'
categorical Indicates the list of categorical columns. list, None List of columns None

Density-Based Algorithms   

MeanShift   

Explanation: MeanShift finds clusters by treating data points as samples from a probability density function and iteratively shifting them toward the nearest region of highest density (mode). Points that converge to the same mode form a cluster. This approach can automatically determine the number of clusters and works well with arbitrarily shaped clusters, but its performance depends on the choice of bandwidth (the kernel size used for density estimation).

Mathematical formula:

copy
 m(x) = Σ(xi ∈ N(x)) K(xi − x) xi / Σ(xi ∈ N(x)) K(xi − x) 

Where:

  • K is the kernel function (e.g., Gaussian)
  • N(x) is the neighborhood around x

Use case: Image segmentation where regions like “sky,” “water,” and “land” form clusters of similar pixels.

Key parameters:

Parameter Description Data Type Possible Values Default Value
bin_seeding When set to True, the algorithm uses a binned version of the input data to initialize kernel locations instead of using every data point. This reduces the number of initial seeds, making the algorithm faster, especially on large datasets. bool {False, True} False
cluster_all Controls how unassigned points (orphans) are handled. If True, all points are forced into the nearest cluster. If False, orphan points that do not fall within any cluster are labeled as -1 (noise). bool {False, True} True
max_iter Specifies the maximum number of iterations allowed for each seed point. If a seed point has not converged by this limit, the process stops for that seed. int [1, ∞) 300

DBSCAN   

Explanation: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together (dense regions) into clusters and labels points that lie alone in low-density regions as noise. It does not require specifying the number of clusters in advance, making it ideal for datasets with irregularly shaped clusters or when you want to identify outliers explicitly. Its performance depends heavily on the eps (neighborhood radius) and min_samples (minimum number of points in a dense region) parameters.

Mathematical formula:

copy
 |Nε(p)| ≥ minPts 

Where p is a core point and Nε(p) = set of points within distance ε of p.

Use case: Banking systems cluster normal transaction patterns, while suspicious isolated ones are flagged as fraud.

Key parameters:

Parameter Description Data Type Possible Values Default Value
eps Defines the maximum distance between two points for them to be considered neighbors. This parameter strongly influences the size and density of clusters, making it one of the most critical settings in DBSCAN. float (0, ∞) 0.5
min_samples Specifies the minimum number of neighboring samples required (including the point itself) for a data point to be classified as a core point. It controls how dense a region must be to form a cluster. int [1, ∞) 5
metric Determines the distance measure used to calculate similarity between data points. string {'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan', 'minkowski'} 'euclidean'
algorithm Indicates the nearest neighbor search method used to find pointwise distances efficiently. string {'auto', 'ball_tree', 'kd_tree', 'brute'} 'auto'

Medoid-Based Algorithms   

CLARA  

Explanation: CLARA (Clustering Large Applications) is a scalable version of the Partitioning Around Medoids (PAM) algorithm that works by clustering multiple small random samples of the data rather than the entire dataset. It then evaluates the clustering quality on the full dataset and selects the best set of medoids. This sampling approach makes CLARA feasible for very large datasets where running PAM directly would be computationally expensive.

Mathematical formula:

Runs K-Medoids on small random samples St of the data:

copy
 min(M) Σ(x ∈ St) d(x, m(x)) 

Then evaluates total cost on full dataset to pick the best sample.

Use case: Telecom operators cluster millions of call records into categories like “international callers” using data samples instead of the entire dataset.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Determines the number of clusters to form as well as the number of medoids to generate. int [1, n_rows] 8
max_iter Specifies the maximum number of iterations when fitting. int [1, ∞) 300

CLARANS   

Explanation: CLARANS (Clustering Large Applications based on Randomized Search) improves on CLARA by using a randomized search technique to explore different possible medoid configurations. Instead of checking all possible swaps, it explores only a subset of neighbors at each step, which strikes a balance between efficiency and clustering quality. This makes CLARANS better at finding near-optimal solutions for large datasets without exhaustive computation.

Mathematical concept:

Iteratively searches for better medoids using randomized local swaps:

copy
 min(M) Σ(x ∈ D) d(x, m(x)) 

CLARANS explores random neighbor solutions rather than all possibilities, balancing speed and accuracy.   Use case: City transport authorities group bus routes by travel patterns to optimize scheduling.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Determines the number of clusters to form as well as the number of medoids to generate. int [1, n_rows] 8
maxneighbor Specifies the maximum number of neighbors examined. int [1, ∞) 5

K-Medoids  

Explanation: K-Medoids is similar to K-Means, but instead of using the mean of data points as the cluster center, it uses an actual data point called the medoid. This makes it more robust to noise and outliers since medoids are less influenced by extreme values compared to centroids. K-Medoids is often used when datasets contain many outliers or when interpretability of clusters through actual data points is important.

Mathematical formula:

copy
 min(M) Σ(i=1 to k) Σ(x ∈ Ci) d(x, mi) 

Where:

  • M={m1,…,mk} are medoids (actual data points)
  • d(x,mi) is a distance metric (e.g., Euclidean, Manhattan)

Use case: In healthcare, K-Medoids can cluster patient records based on symptoms and test results, using real patient profiles as representative medoids. This ensures that clusters are not skewed by patients with unusually rare or extreme medical conditions.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_clusters Determines the number of clusters to form as well as the number of medoids to generate. int [1, n_rows] 8
init Specifies the medoid initialization method. string {'random', 'heuristic', 'k-medoids++', 'build'} 'heuristic'
metric Specifies the distance function used to measure similarity or dissimilarity between data points. string {'cityblock', 'cosine', 'euclidean', 'haversine', 'l2', 'l1', 'manhattan', 'nan_euclidean'} 'euclidean'
method Determines which algorithm to use. 'alternate' is faster while 'pam' is more accurate. string {'pam', 'alternate'} 'alternate'
max_iter Defines the maximum number of iterations the algorithm will run over the entire dataset. If the clustering process does not converge before reaching this limit, it stops and returns the current cluster assignments. int [1, ∞) 300

Model-Based Algorithms   

Gaussian Mixture Model (GMM) 

Explanation: GMM models assumes that the data as a mixture of several Gaussian (normal) distributions, each representing a cluster. Unlike K-Means, which assigns each point to exactly one cluster, GMM assigns probabilities for each point to belong to each cluster (soft clustering). The model parameters (means, covariances, and mixing weights) are optimized using the Expectation-Maximization (EM) algorithm. This allows GMM to capture clusters with different shapes, sizes, and orientations, making it more flexible than K-Means when data are not well separated or have elliptical distributions.

Mathematical formula:

copy
 p(x) = Σ(i=1 to k) πi 𝓝(x | μi, Σi) 

Where:

  • πi = weight (probability) of component i
  • μi = mean vector
  • Σi = covariance matrix
  • N(x∣μi,Σi) = multivariate Gaussian distribution

Use case: In marketing analytics, GMM is used to cluster customers into overlapping groups based on behavior and spending patterns, where each customer can have a probability of belonging to multiple segments.

Key parameters:

Parameter Description Data Type Possible Values Default Value
n_components Specifies the number of mixture components. int (0, ∞) 2
covariance_type Defines the shape of the clusters by specifying the structure of the covariance matrix. string {'full', 'tied', 'diag', 'spherical'} 'full'
max_iter Sets the maximum number of iterations for the Expectation-Maximization (EM) algorithm. int [1, ∞) 100
init_params Determines how the initial parameters (weights, means, and covariances) are set before the EM algorithm starts. string {'kmeans', 'k-means++', 'random', 'random_from_data'} 'kmeans'

Last Updated 2025-10-24 19:35:40 +0530 IST