ML-Distance Measure
Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.
There are numerous distance functions for
- Different types of data
- Numeric data
- Nominal data
- Different specific applications
Minkowski Distance
We denote distance with:
Most commonly used functions are Euclidean distance and Manhattan (city block) distance
- If
, it is the Manhattan distance
- If
, it is the Euclidean distance
- Weighted Euclidean distance
- Chebychev distance:
Distance functions for binary attributes
Simple Matching Coefficient (SMC)
Definition: Measures the proportion of matching elements (both
1
s and0
s) between two binary vectors.Formula:
Where:
: True Positives (both vectors are 1
).: True Negatives (both vectors are 0
).: False Positives (vector 1 is 1
, vector 2 is0
).: False Negatives (vector 1 is 0
, vector 2 is1
).
Range:
, where 1
indicates perfect matching.Use Case: Suitable when both
1
s and0
s carry equal importance.
Jaccard Similarity Coefficient
Definition: Measures the similarity between two binary vectors by considering only the matches for
1
s. Ignores0
s.Formula:
Where:
: True Positives (both vectors are 1
).: False Positives (vector 1 is 1
, vector 2 is0
).: False Negatives (vector 1 is 0
, vector 2 is1
).
Range:
, where 1
indicates perfect similarity.Use Case: Ideal for sparse data or cases where
1
s are more significant than0
s.
Hamming Distance
Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.
Formula:
Where
and are the corresponding elements in the two vectors. Range:
, where 0
indicates no differences (identical vectors).Use Case: Suitable for measuring the difference between binary strings or vectors.
Comparison Table
Measure | Formula | Focus | Range | Best Use Case |
---|---|---|---|---|
SMC | Matches for 1 s and 0 s | Equal importance for 1 s and 0 s | ||
Jaccard | Matches for 1 s only | Sparse data, where 1 s matter more | ||
Hamming | Mismatched positions | Binary strings or sequences |
Example for Binary Vectors
Given two binary vectors:
Simple Matching Coefficient (SMC):
, , ,
Jaccard Similarity:
, ,
Hamming Distance:
- Number of mismatched positions:
(index 2), (index 5).
- Number of mismatched positions:
Distance functions for nominal attributes
Nominal attributes: with more than two states or values.
the commonly used distance measure is also based on the simple matching method.
Given two data points
Distance Function for Text Documents
This section explains how text documents are represented and how distances or similarities between them are measured.
Representing Text Documents
- Definition:
- A text document consists of a sequence of sentences, and each sentence is a sequence of words.
- Simplification:
- To simplify, a document is usually represented as a Bag of Words (BOW) in document clustering.
- In the Bag of Words model:
- The sequence and position of words are ignored.
- Focus is on word occurrence or frequency.
- Vector Representation:
- A document is converted into a vector, where each dimension corresponds to a specific term (word), and the value represents the frequency or presence of the term.
Example:
Term | Document 1 | Document 2 |
---|---|---|
aid | 0 | 1 |
back | 1 | 0 |
dog | 1 | 0 |
men | 0 | 1 |
... | ... | ... |
Measuring Distance or Similarity
Similarity vs. Distance:
- Instead of using distance, it is common to use similarity to compare text documents.
- Most Common Similarity Measure:
- Cosine Similarity: Measures the cosine of the angle between two vectors.
Cosine Similarity:
- Formula:
Where: and : Vector representations of two documents. : The Euclidean norm of vector .
- Range:
: Documents are identical. : Documents are completely dissimilar.
Data Standardization
In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.
Standardize attributes: to force the attributes to have a common value range
Interval-scaled attributes
Their values are real numbers following a linear scale.
Two main approaches to standardize interval scaled attributes, range and z-score.
Range Standardization
Z-Score Standardization
Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute
Z-score: