Clustering Algorithms
Clustering is an unsupervised machine learning model that focuses on grouping similar data points together based on their characteristics or patterns. Instead of relying on predefined labels, clustering uncovers natural structures in the data, making it especially useful for exploratory analysis and segmentation. Let’s dive deeper into how clustering works and explore the different types of clustering algorithms that can be used to build effective clustering models.
Below are the clustering algorithms supported in Catalyst QuickML, along with their Clustering type, explanation, real-world use cases, and key parameters.
Centroid-Based Algorithms
K-Means
Explanation: K-Means partitions data into k clusters by minimizing the variance within each cluster, measured as the sum of squared Euclidean distances between data points and their cluster centroids. It works by iteratively performing two steps: (1) assigning each data point to the nearest centroid (hard assignment), and (2) recalculating centroids as the mean of all points in the cluster. This process repeats until assignments no longer change or a maximum number of iterations is reached.
Mathematical formula:
min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − μi||²
Where:
- Ci = cluster i
- μi = centroid (mean) of cluster i
- ∣∣x−μi∣∣2 = squared Euclidean distance between data point x and centroid μi
Use case: Retail companies use K-Means to segment customers into groups like “budget,” “regular,” and “premium” based on spending habits.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Defines how many clusters the algorithm should create and, accordingly, how many centroids will be generated. Choosing the right value is critical, as too few clusters may oversimplify the data while too many may lead to overfitting. | int | [1, n_rows] | 8 |
| init | Specifies the method used to initialize the centroids before the clustering iterations begin. | string | {'k-means++', 'random'} | 'k-means++' |
| max_iter | Determines the maximum number of iterations the K-Means algorithm will perform in a single run. | int | [1, ∞) | 300 |
| algorithm | Indicates the specific variant of the K-Means algorithm to use. | string | {'lloyd', 'elkan'} | 'lloyd' |
MiniBatchKMeans
Explanation: MiniBatchKMeans is a faster and more memory-efficient version of K-Means that processes small random subsets (mini-batches) of the data at each iteration instead of the entire dataset. By using these mini-batches to update centroids, it converges much faster while giving results close to standard K-Means. This makes it especially useful for large datasets or streaming data, where running full K-Means would be too slow or resource-intensive.
Mathematical formula:
min(C1, C2, ..., Ck) Σ(i=1 to k) Σ(x ∈ Bt ∩ Ci) ||x − μi||²
Where Bt is a random mini-batch at iteration t.
Use case: E-commerce platforms with millions of users cluster browsing behavior in real time for product recommendations.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Defines the number of clusters to be formed as well as the number of centroids to be generated. | int | [1, n_rows] | 8 |
| init | Specifies the method used to initialize the centroids. | string | {'k-means++', 'random'} | 'k-means++' |
| max_iter | Determines the maximum number of iterations the K-Means algorithm will perform in a single run. | int | [1, ∞) | 300 |
| batch_size | Size of the mini batches. | int | [1, ∞) | 1024 |
Fuzzy C-Means
Explanation: Fuzzy C-Means is similar to K-Means but allows soft assignments, meaning each data point can belong to multiple clusters with different degrees of membership. Instead of assigning each point to a single cluster, Fuzzy C-Means calculates a membership probability for every cluster based on the point’s distance from each centroid. This is particularly useful for data where clusters overlap, as it provides a more nuanced view of membership rather than forcing a hard choice.
Mathematical formula:
J = Σ(i=1 to k) Σ(j=1 to n) uij^m ||xj − ci||²
Where:
- uij = degree of membership of data point xj in cluster i
- m = fuzziness parameter (m>1)
- ci = centroid of cluster i
Use case: Music apps group users whose tastes overlap across multiple genres (pop, rock, electronic).
Key parameter:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Defines the number of centers that should be initialized. | int | [1, n_rows] | 8 |
K-Medians
Explanation: K-Medians works like K-Means but uses the median instead of the mean to compute cluster centers. This small change makes the algorithm more robust to outliers, since medians are less influenced by extreme values. The algorithm still alternates between assigning points to the nearest cluster center and recalculating centers, but by using medians, it produces clusters that better represent the typical data point in skewed or heavy-tailed distributions.
Mathematical formula:
min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) ||x − mi||
Where mi is the median of all points in cluster i.
Use case: Healthcare organizations use K-Medians to group patients based on median treatment costs or hospital stay durations, ensuring clusters are not skewed by a few patients with exceptionally high bills or unusually long stays.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Defines the number of clusters to be allocated. | int | [1, n_rows] | 8 |
K-Modes
Explanation: K-Modes is designed for clustering categorical data. It replaces means with modes (the most frequent category value in each cluster) and uses a simple matching dissimilarity measure to assign points to clusters. This makes it well suited for scenarios where numeric distance does not make sense, such as clustering customers by preferences or product categories. The algorithm iteratively updates cluster modes until they no longer change, producing groups defined by the most common category values.
Mathematical formula:
min(C1, ..., Ck) Σ(i=1 to k) Σ(x ∈ Ci) d(x, θi)
Where d(x,θi) is the number of mismatches between categorical attributes of x and mode θi.
Use case: Retailers cluster customers using categorical data like “prefers online/offline,” “electronics/clothing,” “discount/luxury.”
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Defines the number of clusters to be formed as well as the number of centroids to be generated. | int | [1, n_rows] | 8 |
| max_iter | Determines the maximum number of iterations of the k-modes algorithm for a single run. | int | [1, ∞) | 100 |
| init | Specifies the method used for initialization. | string | {'Huang', 'Cao', 'random'} | 'Cao' |
Affinity Propagation
Explanation: Affinity Propagation clusters data by exchanging similarity messages between pairs of points until some points are identified as “exemplars,” which act as cluster centers. Unlike K-Means, you do not need to specify the number of clusters beforehand — the algorithm automatically determines how many clusters to form based on the input similarities and a “preference” parameter that controls how likely each point is to become an exemplar. This makes it especially useful when the natural number of clusters is unknown.
Use case: In customer service, it can group incoming support queries into clusters without prior knowledge of how many categories exist.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| damping | Specifies the damping factor. | float | [0.5, 1.0) | 0.5 |
| max_iter | Determines the maximum number of iterations. | int | [1, ∞) | 200 |
| convergence_iter | Specifies the number of consecutive iterations during which the estimated number of clusters remains unchanged before the algorithm declares convergence. | int | [1, ∞) | 15 |
| affinity | Determines the similarity measure used between data points. | string | {'precomputed', 'euclidean'} | 'euclidean' |
Birch – Balanced Iterative Reducing and Clustering using Hierarchies
Explanation: BIRCH is designed for large or streaming datasets. It incrementally builds a CF (Clustering Feature) tree, which stores compact summaries of data points, and uses these summaries to perform clustering. The CF tree allows BIRCH to quickly group incoming data without having to store or process the full dataset at once. After building the tree, a final global clustering step (often using K-Means) can refine the clusters. This approach makes BIRCH extremely memory-efficient and scalable.
Use case: In IoT applications, Birch can cluster continuous streams of sensor data (like smart city traffic monitoring) into meaningful groups in real time.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| threshold | Defines the maximum radius of a subcluster when merging with the closest existing subcluster. | float | (0, 1.0) | 0.5 |
| branching_factor | Determines the maximum number of CF subclusters in each node. | int | [1, ∞) | 50 |
| n_clusters | Determines the final number of clusters to form after the CF tree is built. | int, None | [1, ∞) | 3 |
K-Prototypes
Explanation: K-Prototypes extends K-Means and K-Modes to handle datasets containing both numerical and categorical variables. It calculates cluster centers using means for numerical features and modes for categorical features, combining the two using a weighting parameter that balances their influence. This makes it possible to cluster mixed-type data without extensive preprocessing or conversion.
Mathematical formula:
d(x, y) = Σ(j ∈ num) (xj − yj)² + γ Σ(j ∈ cat) δ(xj, yj)
Where:
- δ(xj,yj)=0 if equal, else 1
- γ = weight balancing numeric and categorical distances
Use case: In HR analytics, companies can group employees based on mixed data — numerical (e.g., years of experience, salary) and categorical (e.g., department, role) — to identify workforce patterns.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Specifies the number of clusters to form and the number of centroids to generate. | int | [1, n_rows] | 8 |
| max_iter | Determines the maximum number of iterations of the k-modes algorithm for a single run. | int | [1, ∞) | 100 |
| init | Specifies the method for initialization. | string | {'Huang', 'Cao', 'random'} | 'Cao' |
| categorical | Indicates the list of categorical columns. | list, None | List of columns | None |
Density-Based Algorithms
MeanShift
Explanation: MeanShift finds clusters by treating data points as samples from a probability density function and iteratively shifting them toward the nearest region of highest density (mode). Points that converge to the same mode form a cluster. This approach can automatically determine the number of clusters and works well with arbitrarily shaped clusters, but its performance depends on the choice of bandwidth (the kernel size used for density estimation).
Mathematical formula:
m(x) = Σ(xi ∈ N(x)) K(xi − x) xi / Σ(xi ∈ N(x)) K(xi − x)
Where:
- K is the kernel function (e.g., Gaussian)
- N(x) is the neighborhood around x
Use case: Image segmentation where regions like “sky,” “water,” and “land” form clusters of similar pixels.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| bin_seeding | When set to True, the algorithm uses a binned version of the input data to initialize kernel locations instead of using every data point. This reduces the number of initial seeds, making the algorithm faster, especially on large datasets. | bool | {False, True} | False |
| cluster_all | Controls how unassigned points (orphans) are handled. If True, all points are forced into the nearest cluster. If False, orphan points that do not fall within any cluster are labeled as -1 (noise). | bool | {False, True} | True |
| max_iter | Specifies the maximum number of iterations allowed for each seed point. If a seed point has not converged by this limit, the process stops for that seed. | int | [1, ∞) | 300 |
DBSCAN
Explanation: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together (dense regions) into clusters and labels points that lie alone in low-density regions as noise. It does not require specifying the number of clusters in advance, making it ideal for datasets with irregularly shaped clusters or when you want to identify outliers explicitly. Its performance depends heavily on the eps (neighborhood radius) and min_samples (minimum number of points in a dense region) parameters.
Mathematical formula:
|Nε(p)| ≥ minPts
Where p is a core point and Nε(p) = set of points within distance ε of p.
Use case: Banking systems cluster normal transaction patterns, while suspicious isolated ones are flagged as fraud.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| eps | Defines the maximum distance between two points for them to be considered neighbors. This parameter strongly influences the size and density of clusters, making it one of the most critical settings in DBSCAN. | float | (0, ∞) | 0.5 |
| min_samples | Specifies the minimum number of neighboring samples required (including the point itself) for a data point to be classified as a core point. It controls how dense a region must be to form a cluster. | int | [1, ∞) | 5 |
| metric | Determines the distance measure used to calculate similarity between data points. | string | {'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan', 'minkowski'} | 'euclidean' |
| algorithm | Indicates the nearest neighbor search method used to find pointwise distances efficiently. | string | {'auto', 'ball_tree', 'kd_tree', 'brute'} | 'auto' |
Medoid-Based Algorithms
CLARA
Explanation: CLARA (Clustering Large Applications) is a scalable version of the Partitioning Around Medoids (PAM) algorithm that works by clustering multiple small random samples of the data rather than the entire dataset. It then evaluates the clustering quality on the full dataset and selects the best set of medoids. This sampling approach makes CLARA feasible for very large datasets where running PAM directly would be computationally expensive.
Mathematical formula:
Runs K-Medoids on small random samples St of the data:
min(M) Σ(x ∈ St) d(x, m(x))
Then evaluates total cost on full dataset to pick the best sample.
Use case: Telecom operators cluster millions of call records into categories like “international callers” using data samples instead of the entire dataset.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Determines the number of clusters to form as well as the number of medoids to generate. | int | [1, n_rows] | 8 |
| max_iter | Specifies the maximum number of iterations when fitting. | int | [1, ∞) | 300 |
CLARANS
Explanation: CLARANS (Clustering Large Applications based on Randomized Search) improves on CLARA by using a randomized search technique to explore different possible medoid configurations. Instead of checking all possible swaps, it explores only a subset of neighbors at each step, which strikes a balance between efficiency and clustering quality. This makes CLARANS better at finding near-optimal solutions for large datasets without exhaustive computation.
Mathematical concept:
Iteratively searches for better medoids using randomized local swaps:
min(M) Σ(x ∈ D) d(x, m(x))
CLARANS explores random neighbor solutions rather than all possibilities, balancing speed and accuracy. Use case: City transport authorities group bus routes by travel patterns to optimize scheduling.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Determines the number of clusters to form as well as the number of medoids to generate. | int | [1, n_rows] | 8 |
| maxneighbor | Specifies the maximum number of neighbors examined. | int | [1, ∞) | 5 |
K-Medoids
Explanation: K-Medoids is similar to K-Means, but instead of using the mean of data points as the cluster center, it uses an actual data point called the medoid. This makes it more robust to noise and outliers since medoids are less influenced by extreme values compared to centroids. K-Medoids is often used when datasets contain many outliers or when interpretability of clusters through actual data points is important.
Mathematical formula:
min(M) Σ(i=1 to k) Σ(x ∈ Ci) d(x, mi)
Where:
- M={m1,…,mk} are medoids (actual data points)
- d(x,mi) is a distance metric (e.g., Euclidean, Manhattan)
Use case: In healthcare, K-Medoids can cluster patient records based on symptoms and test results, using real patient profiles as representative medoids. This ensures that clusters are not skewed by patients with unusually rare or extreme medical conditions.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_clusters | Determines the number of clusters to form as well as the number of medoids to generate. | int | [1, n_rows] | 8 |
| init | Specifies the medoid initialization method. | string | {'random', 'heuristic', 'k-medoids++', 'build'} | 'heuristic' |
| metric | Specifies the distance function used to measure similarity or dissimilarity between data points. | string | {'cityblock', 'cosine', 'euclidean', 'haversine', 'l2', 'l1', 'manhattan', 'nan_euclidean'} | 'euclidean' |
| method | Determines which algorithm to use. 'alternate' is faster while 'pam' is more accurate. | string | {'pam', 'alternate'} | 'alternate' |
| max_iter | Defines the maximum number of iterations the algorithm will run over the entire dataset. If the clustering process does not converge before reaching this limit, it stops and returns the current cluster assignments. | int | [1, ∞) | 300 |
Model-Based Algorithms
Gaussian Mixture Model (GMM)
Explanation: GMM models assumes that the data as a mixture of several Gaussian (normal) distributions, each representing a cluster. Unlike K-Means, which assigns each point to exactly one cluster, GMM assigns probabilities for each point to belong to each cluster (soft clustering). The model parameters (means, covariances, and mixing weights) are optimized using the Expectation-Maximization (EM) algorithm. This allows GMM to capture clusters with different shapes, sizes, and orientations, making it more flexible than K-Means when data are not well separated or have elliptical distributions.
Mathematical formula:
p(x) = Σ(i=1 to k) πi 𝓝(x | μi, Σi)
Where:
- πi = weight (probability) of component i
- μi = mean vector
- Σi = covariance matrix
- N(x∣μi,Σi) = multivariate Gaussian distribution
Use case: In marketing analytics, GMM is used to cluster customers into overlapping groups based on behavior and spending patterns, where each customer can have a probability of belonging to multiple segments.
Key parameters:
| Parameter | Description | Data Type | Possible Values | Default Value |
|---|---|---|---|---|
| n_components | Specifies the number of mixture components. | int | (0, ∞) | 2 |
| covariance_type | Defines the shape of the clusters by specifying the structure of the covariance matrix. | string | {'full', 'tied', 'diag', 'spherical'} | 'full' |
| max_iter | Sets the maximum number of iterations for the Expectation-Maximization (EM) algorithm. | int | [1, ∞) | 100 |
| init_params | Determines how the initial parameters (weights, means, and covariances) are set before the EM algorithm starts. | string | {'kmeans', 'k-means++', 'random', 'random_from_data'} | 'kmeans' |
Last Updated 2025-10-24 19:35:40 +0530 IST
Yes
No
Send your feedback to us