Similarity and Distance Measurements
Abstract
总结一下机器学习中常用的距离度量、相似度度量,以及两者之间的联系。
Euclidean Space Distance
Euclidean distance
Manhattan distance
Minkowski distance
Chebyshev distance
Hamming distance
The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
Mahalanobis Distance
马氏距离基本思路就是对原数据进行坐标旋转,使得旋转后各个维度尽量线性无关;再进行缩放,使得各个维度经过缩放后方差都为 1;最后计算经过变换后的数据的欧式距离即为马氏距离。
Similarity
Cosine similarity
Pearson Correlation Coefficient
Jaccard similarity coefficient
KL-divergence
KL divergence is a measure of how one probability distribution diverges from a second expected probability distribution.
Note that KL-divergence is not symmetry, i.e
We can define
so that it satisfies symmetry.
Note
Distance is lack of similarity and similarity is resemblance. Some authors prefer to use the term ‘dissimilarity’ instead of distance.
Distance satisfy three conditions: reflexivity, symmetry, and triangular inequality. (Consider three points a, b, and c describing a triangle in a 2D-space)
- reflexivity
- symmetry
- triangular inequality
Similarity is a measure of the resemblance between data sets. Similarity only satisfies symmetry condition. The similarity of a vector to itself is 1, \(S(a, a)=1\). Similarity can be negative while Distance only adopts non-negative values. We can arithmetically average, add, or subtract distances to compute new distances, but we cannot do the same with similarities.
Similarity can be transform to distance metric using following tricks:
References: this tutorial