ML-Cluster Distance Measure
Measuring the distance of two clusters
A few ways to measure the distance of two clusters.
Results in different variations of the algorithm.
- Single link
- Complete link
- Average link
- Centroids
Single Link
- Definition: Distance between two clusters is defined as the shortest distance between any two points in the clusters.
- Formula:
- Characteristics:
- Forms "chain-like" clusters, suitable for finding non-convex shapes.
- Disadvantage: Sensitive to noise and outliers.
- Use Case: Suitable for datasets where the clusters are non-convex.
Complete Link
- Definition: Distance between two clusters is defined as the longest distance between any two points in the clusters.
- Formula:
- Characteristics:
- Creates compact clusters with limited spread.
- Disadvantage: May over-split clusters; not suitable for complex distributions.
- Use Case: Preferred when tight clusters are required.
Average Link
- Definition: Distance between two clusters is defined as the average of all pairwise distances between points in the two clusters.
- Formula:
- Characteristics:
- Balances single link and complete link approaches.
- Robust to noise compared to single link but less than complete link.
- Use Case: Suitable for evenly distributed data and balanced clustering.
Centroids
- Definition: Distance between two clusters is defined as the distance between their centroids (average points).
- Formula:
Where and are the centroids of clusters and . - Characteristics:
- Simple to compute but may cause "reversal" (merged clusters may separate due to centroid movement).
- Disadvantage: Not suitable for complex shapes.
- Use Case: Effective for spherical or isotropic clusters.
Comparison Table
Method | Advantage | Disadvantage | Suitable Use Case |
---|---|---|---|
Single Link | Detects non-convex clusters | Sensitive to noise and outliers | Chain-like, non-convex clusters |
Complete Link | Creates tight clusters | Struggles with complex distributions | Compact, tight clustering |
Average Link | Balances between single and complete | Higher computational cost | Balanced clustering |
Centroids | Computationally efficient | Not suitable for irregular clusters | Spherical, fast computation |
Time complexity
All the algorithms are at least
- Single link can be done in
. - Complete and average links can be done in
. - Due the complexity, hard to use for large data sets.
- Sampling
- Scale-up methods (e.g., BIRCH).