ML-Distance Measure

Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.

There are numerous distance functions for

Different types of data
- Numeric data
- Nominal data
Different specific applications

Minkowski Distance

We denote distance with: $d i s t (x_{i}, x_{j})$ , where $x_{i}$ and $x_{j}$ are data points (vectors)

d i s t (x_{i}, x_{j}) = ((x_{i 1} - x_{j 1})^{h} + (x_{i 2} - x_{j 2})^{h} + . . . + (x_{i n} - x_{j n})^{h})^{1 / h}

Most commonly used functions are Euclidean distance and Manhattan (city block) distance

If $h = 1$ , it is the Manhattan distance

d i s t (x_{i}, x_{j}) = | x_{i 1} - x_{j 1} | + | x_{i 2} - x_{j 2} | + \dots + | x_{i r} - x_{j r} |

If $h = 2$ , it is the Euclidean distance

d i s t (x_{i}, x_{j}) = \sqrt{(x_{i 1} - x_{j 1})^{2} + (x_{i 2} - x_{j 2})^{2} + \dots + (x_{i r} - x_{j r})^{2}}

Weighted Euclidean distance

d i s t (x_{i}, x_{j}) = \sqrt{w_{1} (x_{i 1} - x_{j 1})^{2} + w_{2} (x_{i 2} - x_{j 2})^{2} + \dots + w_{r} (x_{i r} - x_{j r})^{2}}

Chebychev distance:

d i s t (x_{i}, x_{j}) = m a x (| x_{i 1} - x_{j 1} |, | x_{i 2} - x_{j 2} |, \dots, | x_{i r} - x_{j r} |)

Distance functions for binary attributes

Simple Matching Coefficient (SMC)

Definition: Measures the proportion of matching elements (both 1s and 0s) between two binary vectors.
Formula:
$S M C = \frac{T P + T N}{T P + T N + F P + F N}$
Where:
- $T P$ : True Positives (both vectors are 1).
- $T N$ : True Negatives (both vectors are 0).
- $F P$ : False Positives (vector 1 is 1, vector 2 is 0).
- $F N$ : False Negatives (vector 1 is 0, vector 2 is 1).
Range: $S M C \in [0, 1]$ , where 1 indicates perfect matching.
Use Case: Suitable when both 1s and 0s carry equal importance.

Jaccard Similarity Coefficient

Definition: Measures the similarity between two binary vectors by considering only the matches for 1s. Ignores 0s.
Formula:
$J a c c a r d = \frac{T P}{T P + F P + F N}$
Where:
- $T P$ : True Positives (both vectors are 1).
- $F P$ : False Positives (vector 1 is 1, vector 2 is 0).
- $F N$ : False Negatives (vector 1 is 0, vector 2 is 1).
Range: $J a c c a r d \in [0, 1]$ , where 1 indicates perfect similarity.
Use Case: Ideal for sparse data or cases where 1s are more significant than 0s.

Hamming Distance

Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.
Formula:
$H a m m i n g D i s t a n c e = \sum_{i = 1}^{n} | x_{i 1} - x_{i 2} |$
Where $x_{i 1}$ and $x_{i 2}$ are the corresponding elements in the two vectors.
Range: $H a m m i n g \in [0, n]$ , where 0 indicates no differences (identical vectors).
Use Case: Suitable for measuring the difference between binary strings or vectors.

Comparison Table

Measure	Formula	Focus	Range	Best Use Case
SMC	$\frac{T P + T N}{T P + T N + F P + F N}$	Matches for `1`s and `0`s	$[0, 1]$	Equal importance for `1`s and `0`s
Jaccard	$\frac{T P}{T P + F P + F N}$	Matches for `1`s only	$[0, 1]$	Sparse data, where `1`s matter more
Hamming	$\sum \| x_{i 1} - x_{i 2} \|$	Mismatched positions	$[0, n]$	Binary strings or sequences

Example for Binary Vectors

Given two binary vectors:

x_{1} = [1, 1, 0, 1, 0], x_{2} = [1, 0, 0, 1, 1]

Simple Matching Coefficient (SMC):
- $T P = 2$ , $T N = 1$ , $F P = 1$ , $F N = 1$
- $S M C = \frac{T P + T N}{T P + T N + F P + F N} = \frac{2 + 1}{2 + 1 + 1 + 1} = 0.6$
Jaccard Similarity:
- $T P = 2$ , $F P = 1$ , $F N = 1$
- $J a c c a r d = \frac{T P}{T P + F P + F N} = \frac{2}{2 + 1 + 1} = 0.5$
Hamming Distance:
- Number of mismatched positions: $1$ (index 2), $1$ (index 5).
- $H a m m i n g D i s t a n c e = 2$

Distance functions for nominal attributes

Nominal attributes: with more than two states or values.

the commonly used distance measure is also based on the simple matching method.

Given two data points $x_{i}$ and $x_{j}$ , let the number of attributes be $r$ , and the number of values that match in $x_{i}$ and $x_{j}$ be $q$ .

d i s t (x_{i}, x_{j}) = \frac{r - q}{r}

Distance Function for Text Documents

This section explains how text documents are represented and how distances or similarities between them are measured.

Representing Text Documents

Definition:
- A text document consists of a sequence of sentences, and each sentence is a sequence of words.
Simplification:
- To simplify, a document is usually represented as a Bag of Words (BOW) in document clustering.
- In the Bag of Words model:
  - The sequence and position of words are ignored.
  - Focus is on word occurrence or frequency.
Vector Representation:
- A document is converted into a vector, where each dimension corresponds to a specific term (word), and the value represents the frequency or presence of the term.

Example:

Term	Document 1	Document 2
aid	0	1
back	1	0
dog	1	0
men	0	1
...	...	...

Measuring Distance or Similarity

Similarity vs. Distance:

Instead of using distance, it is common to use similarity to compare text documents.
Most Common Similarity Measure:
- Cosine Similarity: Measures the cosine of the angle between two vectors.

Cosine Similarity:

Formula: $Cosine Similarity = \frac{\vec{v_{1}} \cdot \vec{v_{2}}}{∥ \vec{v_{1}} ∥ ∥ \vec{v_{2}} ∥}$ Where:
- $\vec{v_{1}}$ and $\vec{v_{2}}$ : Vector representations of two documents.
- $∥ \vec{v_{1}} ∥$ : The Euclidean norm of vector $\vec{v_{1}}$ .
Range:
- $C o s i n e S i m i l a r i t y \in [0, 1]$
- $1$ : Documents are identical.
- $0$ : Documents are completely dissimilar.

Data Standardization

In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.

Standardize attributes: to force the attributes to have a common value range

Interval-scaled attributes

Their values are real numbers following a linear scale.

Two main approaches to standardize interval scaled attributes, range and z-score. $f$ is an attribute

Range Standardization

r a n g e (x_{i f}) = \frac{x_{i f - m i n (f)}}{m a x (f) - m i n (f)}

Z-Score Standardization

Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute $f$ , denoted by $S_{f}$ , is computed as follows.

S_{f} = \frac{1}{n} (| x_{1 f} - m_{f} | + | x_{2 f} - m_{f} | + \dots + | x_{n f} - m_{f} |)

m_{f} = \frac{1}{n} (x_{1 f} + x_{2 f} + \dots + x_{n f})

Z-score:

z (x_{i f}) = \frac{x_{i f} - m_{f}}{S_{f}}

Algorithm

Tutorial

assignment

Assignment

As-1

As-2

Lab-1

Lab-2

Lab-3

Lab-4

GAMES101

Assignment-1

Assignment-2

Assignment-3

Assignment-4

Lab

Lecture

Peoject

CSCN

Ploidy

ML-Distance Measure ​

Minkowski Distance ​

Distance functions for binary attributes ​

Simple Matching Coefficient (SMC) ​

Jaccard Similarity Coefficient ​

Hamming Distance ​

Comparison Table ​

Example for Binary Vectors ​

Distance functions for nominal attributes ​

Distance Function for Text Documents ​

Representing Text Documents ​

Example: ​

Measuring Distance or Similarity ​

Similarity vs. Distance: ​

Cosine Similarity: ​

Data Standardization ​

Interval-scaled attributes ​

Range Standardization ​

Z-Score Standardization ​