-
Notifications
You must be signed in to change notification settings - Fork 298
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SymSpell v6.7 : WordSegmentation improved
1. WordSegmentation did not work correctly if input string contained words in uppercase. 2. WordSegmentation now retains/preserves case. 3. WordSegmentation now keeps punctuation or apostrophe adjacent to previous word. 4. WordSegmentation now normalizes ligatures: "scientific" -> "scientific". 5. WordSegmentation now removes hyphens prior to word segmentation (as they might be caused by syllabification). 6. American English word forms added to dictionary in addition to British English e.g. favourable -> favorable.
- Loading branch information
Showing
9 changed files
with
102 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,16 +20,16 @@ but SymSpell needs to generate **only 25 deletes** to cover them all, both at pr | |
<br> | ||
|
||
``` | ||
Copyright (c) 2019 Wolf Garbe | ||
Version: 6.5 | ||
Copyright (c) 2020 Wolf Garbe | ||
Version: 6.7 | ||
Author: Wolf Garbe <[email protected]> | ||
Maintainer: Wolf Garbe <[email protected]> | ||
URL: https://github.com/wolfgarbe/symspell | ||
Description: https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f | ||
MIT License | ||
Copyright (c) 2019 Wolf Garbe | ||
Copyright (c) 2020 Wolf Garbe | ||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
|
@@ -374,6 +374,15 @@ https://github.com/Archivus/SymSpell | |
2. Option to preserve case (upper/lower case) of input term. | ||
3. Open source the code for creating custom frequency dictionaries in any language and size as intersection between Google Books Ngram data (Provides representative word frequencies) and SCOWL Spell Checker Oriented Word Lists (Ensures genuine English vocabulary). | ||
|
||
#### Changes in v6.7 | ||
|
||
1. WordSegmentation did not work correctly if input string contained words in uppercase.<br> | ||
2. WordSegmentation now retains/preserves case.<br> | ||
3. WordSegmentation now keeps punctuation or apostrophe adjacent to previous word.<br> | ||
4. WordSegmentation now normalizes ligatures: "scientific" -> "scientific".<br> | ||
5. WordSegmentation now removes hyphens prior to word segmentation (as they might be caused by syllabification).<br> | ||
6. American English word forms added to dictionary in addition to British English e.g. favourable -> favorable.<br> | ||
|
||
#### Changes in v6.6 | ||
|
||
1. IMPROVEMENT: LoadDictionary and LoadBigramDictionary now have an optional separator parameter, which defines the separator characters (e.g. '\t') between term(s) and count. Default is defaultSeparatorChars=null for white space.<br> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file modified
BIN
+0 Bytes
(100%)
SymSpell.CommandLine/bin/Release/netcoreapp2.0/SymSpell.CommandLine.dll
Binary file not shown.
Binary file modified
BIN
+512 Bytes
(100%)
SymSpell.CommandLine/bin/Release/netcoreapp2.0/SymSpell.dll
Binary file not shown.
Binary file modified
BIN
+0 Bytes
(100%)
SymSpell.CommandLine/bin/Release/netcoreapp2.0/SymSpell.pdb
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,15 +11,15 @@ | |
// 2. mistakenly omitted space between two correct words led to one incorrect combined term | ||
// 3. multiple independent input terms with/without spelling errors | ||
|
||
// Copyright (C) 2019 Wolf Garbe | ||
// Version: 6.5 | ||
// Copyright (C) 2020 Wolf Garbe | ||
// Version: 6.7 | ||
// Author: Wolf Garbe [email protected] | ||
// Maintainer: Wolf Garbe [email protected] | ||
// URL: https://github.com/wolfgarbe/symspell | ||
// Description: https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f | ||
// | ||
// MIT License | ||
// Copyright (c) 2019 Wolf Garbe | ||
// Copyright (c) 2020 Wolf Garbe | ||
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
// documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
// the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, | ||
|
@@ -1074,6 +1074,12 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax) | |
/// the Sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).</returns> | ||
public (string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum) WordSegmentation(string input, int maxEditDistance, int maxSegmentationWordLength) | ||
{ | ||
//v6.7 | ||
//normalize ligatures: | ||
//"scientific" | ||
//"scientific" "fields" "final" | ||
input = input.Normalize(System.Text.NormalizationForm.FormKC).Replace("\u002D", "");//.Replace("\uC2AD",""); | ||
|
||
int arraySize = Math.Min(maxSegmentationWordLength, input.Length); | ||
(string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum)[] compositions = new(string segmentedString, string correctedString, int distanceSum, decimal probabilityLogSum)[arraySize]; | ||
int circularIndex = -1; | ||
|
@@ -1110,10 +1116,21 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax) | |
//add number of removed spaces to ed | ||
topEd -= part.Length; | ||
|
||
List<SymSpell.SuggestItem> results = this.Lookup(part, SymSpell.Verbosity.Top, maxEditDistance); | ||
//v6.7 | ||
//Lookup against the lowercase term | ||
List<SymSpell.SuggestItem> results = this.Lookup(part.ToLower(), SymSpell.Verbosity.Top, maxEditDistance); | ||
if (results.Count > 0) | ||
{ | ||
topResult = results[0].term; | ||
//v6.7 | ||
//retain/preserve upper case | ||
if (Char.IsUpper(part[0])) | ||
{ | ||
char[] a = topResult.ToCharArray(); | ||
a[0] = char.ToUpper(topResult[0]); | ||
topResult = new string(a); | ||
} | ||
|
||
topEd += results[0].distance; | ||
//Naive Bayes Rule | ||
//we assume the word probabilities of two words to be independent | ||
|
@@ -1146,11 +1163,24 @@ public List<SuggestItem> LookupCompound(string input, int editDistanceMax) | |
//replace values if smaller edit distance | ||
|| (compositions[circularIndex].distanceSum + separatorLength + topEd < compositions[destinationIndex].distanceSum)) | ||
{ | ||
compositions[destinationIndex] = ( | ||
//v6.7 | ||
//keep punctuation or spostrophe adjacent to previous word | ||
if (((topResult.Length == 1) && char.IsPunctuation(topResult[0])) || ((topResult.Length == 2) && topResult.StartsWith("’"))) | ||
{ | ||
compositions[destinationIndex] = ( | ||
compositions[circularIndex].segmentedString + part, | ||
compositions[circularIndex].correctedString + topResult, | ||
compositions[circularIndex].distanceSum + topEd, | ||
compositions[circularIndex].probabilityLogSum + topProbabilityLog); | ||
} | ||
else | ||
{ | ||
compositions[destinationIndex] = ( | ||
compositions[circularIndex].segmentedString + " " + part, | ||
compositions[circularIndex].correctedString + " " + topResult, | ||
compositions[circularIndex].distanceSum + separatorLength + topEd, | ||
compositions[circularIndex].probabilityLogSum + topProbabilityLog); | ||
} | ||
} | ||
} | ||
circularIndex++; if (circularIndex == arraySize) circularIndex = 0; | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,13 +9,13 @@ | |
<Company>Wolf Garbe <[email protected]></Company> | ||
<Product>SymSpell</Product> | ||
<Description>Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm</Description> | ||
<Copyright>Copyright (C) 2019 Wolf Garbe</Copyright> | ||
<Copyright>Copyright (C) 2020 Wolf Garbe</Copyright> | ||
<PackageProjectUrl>https://github.com/wolfgarbe/symspell</PackageProjectUrl> | ||
<RepositoryUrl>https://github.com/wolfgarbe</RepositoryUrl> | ||
<RepositoryType>Git</RepositoryType> | ||
<PackageTags>symspell, spelling-correction, spellcheck, spell-check, spelling, fuzzy-search, approximate-string-matching, edit-distance, levenshtein, levenshtein-distance, damerau-levenshtein, word segmentation</PackageTags> | ||
<PackageReleaseNotes>Better correction quality for LookupCompound with existing single term dictionary by using Naive Bayes probability for selecting best word splitting AND even better correction quality, when using the optional bigram dictionary in order to use sentence level context information for selecting best spelling correction. \r\nEnglish bigram frequency dictionary is included in the release.</PackageReleaseNotes> | ||
<Version>6.5</Version> | ||
<Version>6.7</Version> | ||
<PackageLicenseExpression>MIT</PackageLicenseExpression> | ||
</PropertyGroup> | ||
|
||
|
Oops, something went wrong.