BDA-Find Similar Items
For High dimensional data
- LSH
- Clustering
- Dimensionality Reduction
Finding Similar Items
- Similarity of documents
- Plagiarism
- Mirror pages
- Articles from the same source
- Collaborative filtering
- Online purchase
- Movie ratings
Distance Measures
Goal: Find near-neighbors in high-dim. space
We formally define “near neighbors” as points that are a “small distance” apart
For each application, we first need to define what “distance” means
Jaccard 相似度 (Jaccard Similarity)
The Jaccard Similarity measures the similarity between two sets and is defined as the size of the intersection divided by the size of the union of the sets.
Finding Similar Documents
Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs
Applications:
- Mirror websites, or approximate mirrors
- Don’t want to show both in search results
- Similar news articles at many news sites
Problems:
- Many small pieces of one document can appear out of order in another
- Too many documents to compare all pairs
- Documents are so large or so many that they cannot fit in main memory
3 Essential Steps for Similar Docs
- Shingling: Convert documents to sets
- Min-Hashing: Convert large sets to short signatures, while preserving similarity
- Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
Candidate pairs!
Minhash/LSH
Convert large sets to short signatures, while preserving similarity.
Encoding Sets as Bit Vectors
Many similarity problems can be formalized as finding subsets that have significant intersection
- Encode sets using 0/1 (bit, boolean) vectors
- One dimension per element in the universal set
- Interpret set intersection as bitwise AND, and set union as bitwise OR
Example:
- Size of intersection = 3; size of union = 4,
- Jaccard similarity (not distance) = 3/4
- Distance: d(C1 ,C2 ) = 1 – (Jaccard similarity) = 1/4
From Sets to Boolean Matrices
Rows = elements (shingles)
Columns = sets (documents)
- 1 in row e and column s if and only if e is a member of s
- Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)
Typical matrix is sparse!
Each document is a column:
Example:
So far:
- Documents
Sets of shingles - Represent sets as boolean vectors in a matrix
Next goal: Find similar columns while computing small signatures
- Similarity of columns == similarity of signatures