ML-Cluster Distance Measure

Measuring the distance of two clusters

A few ways to measure the distance of two clusters.

Results in different variations of the algorithm.

Single link
Complete link
Average link
Centroids

Single Link

Definition: Distance between two clusters is defined as the shortest distance between any two points in the clusters.
Formula: $d_{single} (A, B) = min {d (x, y) : x \in A, y \in B}$
Characteristics:
- Forms "chain-like" clusters, suitable for finding non-convex shapes.
- Disadvantage: Sensitive to noise and outliers.
Use Case: Suitable for datasets where the clusters are non-convex.

Complete Link

Definition: Distance between two clusters is defined as the longest distance between any two points in the clusters.
Formula: $d_{complete} (A, B) = max {d (x, y) : x \in A, y \in B}$
Characteristics:
- Creates compact clusters with limited spread.
- Disadvantage: May over-split clusters; not suitable for complex distributions.
Use Case: Preferred when tight clusters are required.

Average Link

Definition: Distance between two clusters is defined as the average of all pairwise distances between points in the two clusters.
Formula: $d_{average} (A, B) = \frac{1}{| A | \cdot | B |} \sum_{x \in A} \sum_{y \in B} d (x, y)$
Characteristics:
- Balances single link and complete link approaches.
- Robust to noise compared to single link but less than complete link.
Use Case: Suitable for evenly distributed data and balanced clustering.

Centroids

Definition: Distance between two clusters is defined as the distance between their centroids (average points).
Formula: $d_{centroids} (A, B) = | | c_{A} - c_{B} | |$ Where $c_{A}$ and $c_{B}$ are the centroids of clusters $A$ and $B$ .
Characteristics:
- Simple to compute but may cause "reversal" (merged clusters may separate due to centroid movement).
- Disadvantage: Not suitable for complex shapes.
Use Case: Effective for spherical or isotropic clusters.

Comparison Table

Method	Advantage	Disadvantage	Suitable Use Case
Single Link	Detects non-convex clusters	Sensitive to noise and outliers	Chain-like, non-convex clusters
Complete Link	Creates tight clusters	Struggles with complex distributions	Compact, tight clustering
Average Link	Balances between single and complete	Higher computational cost	Balanced clustering
Centroids	Computationally efficient	Not suitable for irregular clusters	Spherical, fast computation

Time complexity

All the algorithms are at least $O (n^{2})$ . $n$ is the number of data points.

Single link can be done in $O (n^{2})$ .
Complete and average links can be done in $O (n^{2} l o g n)$ .
Due the complexity, hard to use for large data sets.
- Sampling
- Scale-up methods (e.g., BIRCH).

Algorithm

Tutorial

assignment

Assignment

As-1

As-2

Lab-1

Lab-2

Lab-3

Lab-4

GAMES101

Assignment-1

Assignment-2

Assignment-3

Assignment-4

Lab

Lecture

Peoject

CSCN

Ploidy

ML-Cluster Distance Measure ​

Single Link ​

Complete Link ​

Average Link ​

Centroids ​

Comparison Table ​