-
Notifications
You must be signed in to change notification settings - Fork 17
Home
Welcome to the smart-match wiki!
- Default method LE(Levenshtein): It is also called edit distance, which is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
import smart_match
print(smart_match.similarity('hello', 'hero'))
print(smart_match.dissimilarity('hello', 'hero'))
print(smart_match.distance('hello', 'hero'))
Output:
0.6
0.4
2
- change to the other methods:
ED(EuclideanDistance): It calculate the euclidean distance of the two stings.
smart_match.use('ED')
print(smart_match.distance('hello', 'hero'))
Output:
0.34921514788478913
DL(Damerau Levenshtein): It consider the cost of transposition of two adjacent characters to be 1.
smart_match.use('DL')
print(smart_match.distance('hello', 'ehllo'))
Output:
1
BD(Block Distance): It focuses on the differences in the alphabet without considering the order.
smart_match.use('BD')
print(smart_match.distance('hello', 'ehllo'))
Output:
0
cos(Cosine Similarity): It measures the cosine of the angle between two strings projected in a multi-dimensional space. Mathematically
smart_match.use('cos')
print(smart_match.similarity('hello', 'hero'))
Output:
0.5669467095138409
TC(TanimotoCoefficient): Tanimoto coefficient is similar to Cosine similarity, but the occurrence of an entry will be taken into consideration.
smart_match.use('TC')
print(smart_match.similarity('test', 'test string1'))
Output:
0.5773502691896257
dice(Dice Similarity): The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. It is intended to be applied to discrete data, so the occurrence of an entry will be ignored. Mathematically
smart_match.use('dice')
print(smart_match.similarity('hello', 'hero'))
Output:
0.75
simon(Simon White): The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. The occurrence of an entry will be taken into consideration.
smart_match.use('simon')
print(smart_match.similarity('hello', 'hollow'))
Output:
0.7272727272727273
jac(Jaccard): The Jacquard coefficient is defined as the ratio between the intersection size and the union size of two strings/sets. Mathematically
smart_match.use('jac')
print(smart_match.similarity('hello', 'helo'))
print(smart_match.similarity('hello', 'hero'))
print(jaccard.similarity('hello world', 'hello world hello world'))
Output:
1
0.6
1.0
gjac(GeneralizedJaccard): The Jacquard coefficient is defined as the ratio between the intersection size and the union size of two strings/sets. Different from Jacquard method, the occurrence of an entry is taken into account.
smart_match.use('gjac')
print(smart_match.similarity('hello', 'helo'))
print(smart_match.similarity('hello', 'hero'))
print(jaccard.similarity('hello world', 'hello world hello world'))
Output:
0.8
0.5
0.4782608695652174
OC(OverlapCoefficient): The Overlap coefficient is a similarity measure that measures the overlap between two finite strings/sets. Mathematically
smart_match.use('gjac')
print(smart_match.similarity('hello', 'hero'))
Output:
0.75
GOC(GeneralizedOverlapCoefficient): The Overlap coefficient is a similarity measure that measures the overlap between two finite strings/sets. Different from OverlapCoefficient method, the occurrence of an entry is taken into account.
smart_match.use('GOC')
print(smart_match.similarity('hello', 'hollow'))
Output:
0.8
LCST(LongestCommonSubstring): The longest common substring is a similarity based on finding longest string that is a substring of two strings.
smart_match.use('LCST')
print(smart_match.similarity('hello', 'low'))
Output:
0.4
LCSQ(LongestCommonSubsequence): The longest common subsequence is a similarity based on finding longest subsequence that is a subsequence of two strings.
smart_match.use('LCSQ')
print(smart_match.similarity('hello', 'hill'))
Output:
0.6
HD(HammingDistance): Hamming distance is the number of different characters in the corresponding positions of two strings. The two strings must be the same length.
smart_match.use('HD')
print(smart_match.similarity('12211','11111'))
Output:
2
jaro(Jaro Similarity): The Jaro Similarity of two given strings is
in which |x| represent the length of string x, m is the number of matching characters, t is half the number of transpositions.
smart_match.use('jaro')
print(smart_match.similarity('CRATE','TRACE'))
Output:
0.7333333333333334
JW(JaroWinkler Similarity): The JaroWinkler Similarity uses a prefix scale p which gives more favorable ratings to strings that match from the beginning for exact matching prefix l. Mathematically
smart_match.use('JW')
print(smart_match.similarity('TRATE', 'TRACE'))
Output:
0.9066666666666667
NW(NeedlemanWunch): Applies the NeedlemanWunch algorithm to calculate the similarity between two strings.
smart_match.use('NW')
print(smart_match.similarity('test string1', 'test string2'))
Output:
0.9583333333333334
SW(SmithWaterman): Applies the Smith-Waterman algorithm to calculate the similarity between two strings. Mathematically
in which
smart_match.use('SW')
print(smart_match.similarity('Web Aplications', 'Web Application Development With PHP'))
Output:
0.8666666666666667
SWG(SmithWatermanGotoh): Applies the Smith-Waterman algorithm to calculate the similarity between two strings. This implementation uses optimizations proposed by Osamu Gotoh. Mathematically
in which
smart_match.use('SWG')
print(smart_match.similarity('GGTTGACTA', 'TGTTACGG'))
Output:
0.3125
ME(MongeElkan): The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. It uses the other similarity method as inner method to consider similarity for each string pair in two string collections.
smart_match.use('ME')
print(smart_match.similarity(['Hello', 'world'], ['Hero', 'world']))
smart_match.use('ME', 'cos')
print(smart_match.similarity(['Hello', 'world'], ['Hero', 'world']))
Output:
0.8
0.7834733547569204
- Jiaying Wang ([email protected])
- Jing Shan ([email protected])
- Kaiwei Li
- Xiuzi Zhang
- YuQiang Feng
- XianFeng Du
- Zifan Guo
- JingLin Wu
- Mingyang Shao
- Yaxin Li
- Xueqing Xin