Home

Welcome to the smart-match wiki!

Usage

Default method LE(Levenshtein): It is also called edit distance, which is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

import smart_match
print(smart_match.similarity('hello', 'hero'))
print(smart_match.dissimilarity('hello', 'hero'))
print(smart_match.distance('hello', 'hero'))

Output:

0.6
0.4
2

change to the other methods:

ED(EuclideanDistance): It calculate the euclidean distance of the two stings.

smart_match.use('ED')
print(smart_match.distance('hello', 'hero'))

Output:

0.34921514788478913

DL(Damerau Levenshtein): It consider the cost of transposition of two adjacent characters to be 1.

smart_match.use('DL')
print(smart_match.distance('hello', 'ehllo'))

Output:

BD(Block Distance): It focuses on the differences in the alphabet without considering the order.

smart_match.use('BD')
print(smart_match.distance('hello', 'ehllo'))

Output:

cos(Cosine Similarity): It measures the cosine of the angle between two strings projected in a multi-dimensional space. Mathematically

$cos(X, Y) = \frac{X \cdot Y}{\|X\| \|Y\|}$

smart_match.use('cos')
print(smart_match.similarity('hello', 'hero'))

Output:

0.5669467095138409

TC(TanimotoCoefficient): Tanimoto coefficient is similar to Cosine similarity, but the occurrence of an entry will be taken into consideration.

smart_match.use('TC')
print(smart_match.similarity('test', 'test string1'))

Output:

0.5773502691896257

dice(Dice Similarity): The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. It is intended to be applied to discrete data, so the occurrence of an entry will be ignored. Mathematically

$dice(X, Y) = \frac{2|X \cap Y|}{|X|+|Y|}$

smart_match.use('dice')
print(smart_match.similarity('hello', 'hero'))

Output:

0.75

simon(Simon White): The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. The occurrence of an entry will be taken into consideration.

smart_match.use('simon')
print(smart_match.similarity('hello', 'hollow'))

Output:

0.7272727272727273

jac(Jaccard): The Jacquard coefficient is defined as the ratio between the intersection size and the union size of two strings/sets. Mathematically

$jaccard(X, Y) = \frac{|X \cap Y|}{|X| \cup |Y|}$

smart_match.use('jac')
print(smart_match.similarity('hello', 'helo'))
print(smart_match.similarity('hello', 'hero'))
print(jaccard.similarity('hello world', 'hello world hello world'))

Output:

1
0.6
1.0

gjac(GeneralizedJaccard): The Jacquard coefficient is defined as the ratio between the intersection size and the union size of two strings/sets. Different from Jacquard method, the occurrence of an entry is taken into account.

smart_match.use('gjac')
print(smart_match.similarity('hello', 'helo'))
print(smart_match.similarity('hello', 'hero'))
print(jaccard.similarity('hello world', 'hello world hello world'))

Output:

0.8
0.5
0.4782608695652174

OC(OverlapCoefficient): The Overlap coefficient is a similarity measure that measures the overlap between two finite strings/sets. Mathematically

$overlap(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}$

smart_match.use('gjac')
print(smart_match.similarity('hello', 'hero'))

Output:

0.75

GOC(GeneralizedOverlapCoefficient): The Overlap coefficient is a similarity measure that measures the overlap between two finite strings/sets. Different from OverlapCoefficient method, the occurrence of an entry is taken into account.

smart_match.use('GOC')
print(smart_match.similarity('hello', 'hollow'))

Output:

0.8

LCST(LongestCommonSubstring): The longest common substring is a similarity based on finding longest string that is a substring of two strings.

smart_match.use('LCST')
print(smart_match.similarity('hello', 'low'))

Output:

0.4

LCSQ(LongestCommonSubsequence): The longest common subsequence is a similarity based on finding longest subsequence that is a subsequence of two strings.

smart_match.use('LCSQ')
print(smart_match.similarity('hello', 'hill'))

Output:

0.6

HD(HammingDistance): Hamming distance is the number of different characters in the corresponding positions of two strings. The two strings must be the same length.

smart_match.use('HD')
print(smart_match.similarity('12211','11111'))

Output:

jaro(Jaro Similarity): The Jaro Similarity of two given strings is

$sim(x, y)=\begin{cases} 0 & \text{if m = 0}\\ \frac{1}{3}(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m-t}{m}) & \text{otherwise} \end{cases}$

in which |x| represent the length of string x, m is the number of matching characters, t is half the number of transpositions.

smart_match.use('jaro')
print(smart_match.similarity('CRATE','TRACE'))

Output:

0.7333333333333334

JW(JaroWinkler Similarity): The JaroWinkler Similarity uses a prefix scale p which gives more favorable ratings to strings that match from the beginning for exact matching prefix l. Mathematically

$JD(x, y) = jaro(x, y) + lp(1-jaro(x, y))$

smart_match.use('JW')
print(smart_match.similarity('TRATE', 'TRACE'))

Output:

0.9066666666666667

NW(NeedlemanWunch): Applies the NeedlemanWunch algorithm to calculate the similarity between two strings.

smart_match.use('NW')
print(smart_match.similarity('test string1', 'test string2'))

Output:

0.9583333333333334

SW(SmithWaterman): Applies the Smith-Waterman algorithm to calculate the similarity between two strings. Mathematically

$score_{i, j} = \max \begin{cases} 0 \\ score_{i-1, j-1} + compare(s_i, t_j) \\ \max_{1 \leq k \leq i}(score_{i-k, j} + gap^* + (k-1) \times gap) \\ \max_{1 \leq k \leq j}(score_{i, j-k} + gap^* + (k-1) \times gap) \\ \end{cases}$

in which

$compare(s_i, t_j) = \begin{cases} match & \text{if } s_i = t_j \\ mismatch & \text{otherwise} \end{cases}$

smart_match.use('SW')
print(smart_match.similarity('Web Aplications', 'Web Application Development With PHP'))

Output:

0.8666666666666667

SWG(SmithWatermanGotoh): Applies the Smith-Waterman algorithm to calculate the similarity between two strings. This implementation uses optimizations proposed by Osamu Gotoh. Mathematically

$score_{i, j} = \max \begin{cases} 0 \\ score_{i-1, j-1} + compare(s_i, t_j) \\ score_{i-1, j} + gap \\ score_{i, j-1} + gap \\ \end{cases}$

in which

$compare(s_i, t_j) = \begin{cases} match & \text{if } s_i = t_j \\ mismatch & \text{otherwise} \end{cases}$

smart_match.use('SWG')
print(smart_match.similarity('GGTTGACTA', 'TGTTACGG'))

Output:

0.3125

ME(MongeElkan): The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. It uses the other similarity method as inner method to consider similarity for each string pair in two string collections.

smart_match.use('ME')
print(smart_match.similarity(['Hello', 'world'], ['Hero', 'world']))
smart_match.use('ME', 'cos')
print(smart_match.similarity(['Hello', 'world'], ['Hero', 'world']))

Output:

0.8
0.7834733547569204

Authors

Jiaying Wang ([email protected])
Jing Shan ([email protected])
Kaiwei Li
Xiuzi Zhang
YuQiang Feng
XianFeng Du
Zifan Guo
JingLin Wu
Mingyang Shao
Yaxin Li
Xueqing Xin

qrcode_for_wechat_official_account

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Usage

Authors

Clone this wiki locally