Skip to content

Commit

Permalink
SymSpell v6.7 : WordSegmentation improved
Browse files Browse the repository at this point in the history
1. WordSegmentation did not work correctly if input string contained words in uppercase.
2. WordSegmentation now retains/preserves case.
3. WordSegmentation now keeps punctuation or apostrophe adjacent to previous word.
4. WordSegmentation now normalizes ligatures: "scientific" -> "scientific".
5. WordSegmentation now removes hyphens prior to word segmentation (as they might be caused by syllabification).
6. American English word forms added to dictionary in addition to British English e.g. favourable -> favorable.
  • Loading branch information
wolfgarbe committed Aug 25, 2020
1 parent 3ab85f7 commit c9e4fed
Show file tree
Hide file tree
Showing 9 changed files with 102 additions and 13 deletions.
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,16 @@ but SymSpell needs to generate **only 25 deletes** to cover them all, both at pr
<br>

```
Copyright (c) 2019 Wolf Garbe
Version: 6.5
Copyright (c) 2020 Wolf Garbe
Version: 6.7
Author: Wolf Garbe <[email protected]>
Maintainer: Wolf Garbe <[email protected]>
URL: https://github.com/wolfgarbe/symspell
Description: https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f
MIT License
Copyright (c) 2019 Wolf Garbe
Copyright (c) 2020 Wolf Garbe
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
Expand Down Expand Up @@ -374,6 +374,15 @@ https://github.com/Archivus/SymSpell
2. Option to preserve case (upper/lower case) of input term.
3. Open source the code for creating custom frequency dictionaries in any language and size as intersection between Google Books Ngram data (Provides representative word frequencies) and SCOWL Spell Checker Oriented Word Lists (Ensures genuine English vocabulary).

#### Changes in v6.7

1. WordSegmentation did not work correctly if input string contained words in uppercase.<br>
2. WordSegmentation now retains/preserves case.<br>
3. WordSegmentation now keeps punctuation or apostrophe adjacent to previous word.<br>
4. WordSegmentation now normalizes ligatures: "scientific" -> "scientific".<br>
5. WordSegmentation now removes hyphens prior to word segmentation (as they might be caused by syllabification).<br>
6. American English word forms added to dictionary in addition to British English e.g. favourable -> favorable.<br>

#### Changes in v6.6

1. IMPROVEMENT: LoadDictionary and LoadBigramDictionary now have an optional separator parameter, which defines the separator characters (e.g. '\t') between term(s) and count. Default is defaultSeparatorChars=null for white space.<br>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,17 @@
".NETCoreApp,Version=v2.0": {
"SymSpell.CommandLine/1.0.0": {
"dependencies": {
"SymSpell": "6.5.0"
"Microsoft.NETFramework.ReferenceAssemblies": "1.0.0-preview.2",
"Microsoft.Net.Compilers.Toolset": "3.3.1",
"SymSpell": "6.7.0"
},
"runtime": {
"SymSpell.CommandLine.dll": {}
}
},
"SymSpell/6.5.0": {
"Microsoft.Net.Compilers.Toolset/3.3.1": {},
"Microsoft.NETFramework.ReferenceAssemblies/1.0.0-preview.2": {},
"SymSpell/6.7.0": {
"runtime": {
"SymSpell.dll": {}
}
Expand All @@ -27,7 +31,21 @@
"serviceable": false,
"sha512": ""
},
"SymSpell/6.5.0": {
"Microsoft.Net.Compilers.Toolset/3.3.1": {
"type": "package",
"serviceable": true,
"sha512": "sha512-2AjN0WJfTnazrp9iTDhZjcEao5+4/HkKQ3PqM/gaQyWd+zXW9bub8B7NDTLfXllYBl+pICODo2C2peNTtBLPag==",
"path": "microsoft.net.compilers.toolset/3.3.1",
"hashPath": "microsoft.net.compilers.toolset.3.3.1.nupkg.sha512"
},
"Microsoft.NETFramework.ReferenceAssemblies/1.0.0-preview.2": {
"type": "package",
"serviceable": true,
"sha512": "sha512-m+pJPEO7HyXvrOna5Sr3s77ewXonjYWJTNL6drh8xACnMNxnlqUDKx9HfGeSE9wmfY0lQwppaeZpFTPGaH7kZg==",
"path": "microsoft.netframework.referenceassemblies/1.0.0-preview.2",
"hashPath": "microsoft.netframework.referenceassemblies.1.0.0-preview.2.nupkg.sha512"
},
"SymSpell/6.7.0": {
"type": "project",
"serviceable": false,
"sha512": ""
Expand Down
Binary file not shown.
Binary file modified SymSpell.CommandLine/bin/Release/netcoreapp2.0/SymSpell.dll
Binary file not shown.
Binary file modified SymSpell.CommandLine/bin/Release/netcoreapp2.0/SymSpell.pdb
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -787,6 +787,7 @@ adult 97583096
tickets 97561755
thing 97451660
centre 97258243
center 97258243
requirements 97233632
via 97167128
cheap 97049762
Expand Down Expand Up @@ -2588,6 +2589,7 @@ exclusive 29069478
seat 29066336
concerns 29065609
colour 29049269
color 29049269
vendor 29029083
originally 29016923
intel 28995654
Expand Down Expand Up @@ -4465,6 +4467,7 @@ athletic 14193383
thermal 14186960
essays 14183871
behaviour 14175567
behavior 14175567
vital 14166642
telling 14160780
fairly 14154477
Expand Down Expand Up @@ -4654,6 +4657,7 @@ sight 13360499
laid 13359320
clay 13357683
defence 13356231
defense 13356231
patches 13354086
weak 13347842
refund 13340961
Expand Down Expand Up @@ -8970,6 +8974,7 @@ wagon 5198530
barbie 5198277
dat 5197825
favour 5196953
favor 5196953
knock 5196013
urge 5195810
generates 5193442
Expand Down Expand Up @@ -9046,6 +9051,7 @@ johnston 5136263
terminology 5134603
gentleman 5134582
fibre 5134463
fiber 5134463
reproduce 5134246
convicted 5133903
shades 5133522
Expand Down Expand Up @@ -12821,6 +12827,7 @@ haul 2825285
acupuncture 2825171
workload 2824856
acknowledgement 2823891
acknowledgment 2823891
highlighting 2823564
duly 2823211
roasted 2822882
Expand Down Expand Up @@ -14738,6 +14745,7 @@ admire 2196307
westerns 2196151
dodgers 2195501
litre 2195204
liter 2195204
poured 2195144
usefulness 2194801
unsolicited 2194503
Expand Down Expand Up @@ -15209,6 +15217,7 @@ nguyen 2070173
meteorological 2070017
spit 2069895
labelled 2069889
labeled 2069889
darker 2069829
horsepower 2068922
globes 2068693
Expand Down Expand Up @@ -16554,6 +16563,7 @@ articulate 1767860
ecstasy 1767792
sweetheart 1767766
fulfil 1767616
fulfill 1767616
calcutta 1767575
thursdays 1767458
tenerife 1767294
Expand Down Expand Up @@ -16798,6 +16808,7 @@ weymouth 1721736
spherical 1721397
intracellular 1721340
favourable 1721108
favorable 1721108
informs 1720738
dramas 1720511
cher 1720213
Expand Down Expand Up @@ -17216,6 +17227,7 @@ sanction 1645521
dyer 1645507
effected 1645475
signalling 1645433
signaling 1645433
daycare 1645407
tubular 1645135
merriam 1644972
Expand Down Expand Up @@ -21970,6 +21982,7 @@ guilds 1013989
blatant 1013927
floss 1013783
favoured 1013666
favored 1013666
sarge 1013607
endnote 1013592
ridges 1013404
Expand Down Expand Up @@ -23528,6 +23541,7 @@ simulating 878491
coughing 878491
hiatus 878386
enrol 878274
enroll 878274
upholstered 878180
evangelist 878138
louvre 878086
Expand Down Expand Up @@ -37237,6 +37251,7 @@ betas 306327
brothels 306278
intraocular 306273
skilful 306269
skillful 306269
sprockets 306229
futurist 306200
invocations 306188
Expand Down Expand Up @@ -38311,6 +38326,7 @@ timbuktu 284407
nonnegative 284402
awakens 284351
amoeba 284346
ameba 284346
sonoran 284327
accentuate 284301
duvets 284281
Expand Down
40 changes: 35 additions & 5 deletions SymSpell/SymSpell.cs
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@
// 2. mistakenly omitted space between two correct words led to one incorrect combined term
// 3. multiple independent input terms with/without spelling errors

// Copyright (C) 2019 Wolf Garbe
// Version: 6.5
// Copyright (C) 2020 Wolf Garbe
// Version: 6.7
// Author: Wolf Garbe [email protected]
// Maintainer: Wolf Garbe [email protected]
// URL: https://github.com/wolfgarbe/symspell
// Description: https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f
//
// MIT License
// Copyright (c) 2019 Wolf Garbe
// Copyright (c) 2020 Wolf Garbe
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
// documentation files (the "Software"), to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
Expand Down Expand Up @@ -1074,6 +1074,12 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax)
/// the Sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).</returns>
public (string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum) WordSegmentation(string input, int maxEditDistance, int maxSegmentationWordLength)
{
//v6.7
//normalize ligatures:
//"scientific"
//"scientific" "fields" "final"
input = input.Normalize(System.Text.NormalizationForm.FormKC).Replace("\u002D", "");//.Replace("\uC2AD","");

int arraySize = Math.Min(maxSegmentationWordLength, input.Length);
(string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum)[] compositions = new(string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum)[arraySize];
int circularIndex = -1;
Expand Down Expand Up @@ -1110,10 +1116,21 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax)
//add number of removed spaces to ed
topEd -= part.Length;

List<SymSpell.SuggestItem> results = this.Lookup(part, SymSpell.Verbosity.Top, maxEditDistance);
//v6.7
//Lookup against the lowercase term
List<SymSpell.SuggestItem> results = this.Lookup(part.ToLower(), SymSpell.Verbosity.Top, maxEditDistance);
if (results.Count > 0)
{
topResult = results[0].term;
//v6.7
//retain/preserve upper case
if (Char.IsUpper(part[0]))
{
char[] a = topResult.ToCharArray();
a[0] = char.ToUpper(topResult[0]);
topResult = new string(a);
}

topEd += results[0].distance;
//Naive Bayes Rule
//we assume the word probabilities of two words to be independent
Expand Down Expand Up @@ -1146,11 +1163,24 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax)
//replace values if smaller edit distance
|| (compositions[circularIndex].distanceSum + separatorLength + topEd < compositions[destinationIndex].distanceSum))
{
compositions[destinationIndex] = (
//v6.7
//keep punctuation or spostrophe adjacent to previous word
if (((topResult.Length == 1) && char.IsPunctuation(topResult[0])) || ((topResult.Length == 2) && topResult.StartsWith("’")))
{
compositions[destinationIndex] = (
compositions[circularIndex].segmentedString + part,
compositions[circularIndex].correctedString + topResult,
compositions[circularIndex].distanceSum + topEd,
compositions[circularIndex].probabilityLogSum + topProbabilityLog);
}
else
{
compositions[destinationIndex] = (
compositions[circularIndex].segmentedString + " " + part,
compositions[circularIndex].correctedString + " " + topResult,
compositions[circularIndex].distanceSum + separatorLength + topEd,
compositions[circularIndex].probabilityLogSum + topProbabilityLog);
}
}
}
circularIndex++; if (circularIndex == arraySize) circularIndex = 0;
Expand Down
4 changes: 2 additions & 2 deletions SymSpell/SymSpell.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@
<Company>Wolf Garbe &lt;[email protected]&gt;</Company>
<Product>SymSpell</Product>
<Description>Spelling correction &amp; Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm</Description>
<Copyright>Copyright (C) 2019 Wolf Garbe</Copyright>
<Copyright>Copyright (C) 2020 Wolf Garbe</Copyright>
<PackageProjectUrl>https://github.com/wolfgarbe/symspell</PackageProjectUrl>
<RepositoryUrl>https://github.com/wolfgarbe</RepositoryUrl>
<RepositoryType>Git</RepositoryType>
<PackageTags>symspell, spelling-correction, spellcheck, spell-check, spelling, fuzzy-search, approximate-string-matching, edit-distance, levenshtein, levenshtein-distance, damerau-levenshtein, word segmentation</PackageTags>
<PackageReleaseNotes>Better correction quality for LookupCompound with existing single term dictionary by using Naive Bayes probability for selecting best word splitting AND even better correction quality, when using the optional bigram dictionary in order to use sentence level context information for selecting best spelling correction. \r\nEnglish bigram frequency dictionary is included in the release.</PackageReleaseNotes>
<Version>6.5</Version>
<Version>6.7</Version>
<PackageLicenseExpression>MIT</PackageLicenseExpression>
</PropertyGroup>

Expand Down
Loading

0 comments on commit c9e4fed

Please sign in to comment.