This library find non-euclidean distance or similarity between two strings.
Jaro and Jaro-Winkler equations provides a score between two short words where errors are more prone at the end of the word. Jaro's equation measure is the weighted sum of percentage of equal and transposed characters from each strings. Winkler's factor adds a weight in Jaro's formula to increased the calculated measure when there are a sequance of characters (prefix) that matches between the compaired items.
Note
- Impact of the character prefix is limited to 4, as originally defined by Winkler.
- Input strings are not modified beyond leading or trailing whitespace stripping. In-word whitespace and characters case will optionally impact score.
- Returns a floating point number rounded to the desired decimals (defaults to
2
) using Python'sround()
. - Consider usual Floating Point Arithmetic characterisitcs.
The complexity of this algoritme reside in the calculation of matching
and transposed
characters.
- A character of the first string is
matching
if it's found in the second string within a specifieddistance
. A character in the first string cannot be matched multiple time to the same character in the second string. - Two characters are
transposed
if they match, but aren't matched at the same position. - The
limit
is calculated using the length of the longest string devided by two minus one.
----------------------------
P E N N C I S Y L V N I A
P 1 |
E 1 |
N 1 |
N 1 |
S 1 |
Y | 1 |
L | 1 |
V | 1 |
A | 1
N | 1
I | 1
A |
----------------------------
${sim_{j}=\left{{\begin{array}{l l}0&{\text{if }}m=0\{\frac {1}{3}}\left({\frac {m}{|s_{1}|}}+{\frac {m}{|s_{2}|}}+{\frac {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right.}$
The original implementation is based on the Jaro Winkler Similarity Algorithm article that can be found on Wikipedia. This version of is based on the original C implementation of strcmp95 library.
from pyjarowinkler import distance
distance.get_jaro_distance("hello", "haloa", decimals=2)
# 0.76
distance.get_jaro_winkler_distance("hello", "Haloa", scaling=0.1, ignore_case=False)
# 0.6
distance.get_jaro_winkler_distance("hello", "HaLoA", scaling=0.1, ignore_case=True)
# 0.73