You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run in the terminal : python Autochecker4Chinese.py
You will get the following result :
1. Make a detecter
Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.
Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
Otherwise, we put the candidate in third order.
importpinyin
# list for chinese words# read from the words.dicdefload_cn_words_dict( file_path ):
cn_words_dict=""withopen(file_path, "r") asf:
forwordinf:
cn_words_dict+=word.strip().decode("utf-8")
returncn_words_dict
# function calculate the edite distance from the chinese phrase defedits1(phrase, cn_words_dict):
"All edits that are one edit away from `phrase`."phrase=phrase.decode("utf-8")
splits= [(phrase[:i], phrase[i:]) foriinrange(len(phrase) +1)]
deletes= [L+R[1:] forL, RinsplitsifR]
transposes= [L+R[1] +R[0] +R[2:] forL, Rinsplitsiflen(R)>1]
replaces= [L+c+R[1:] forL, RinsplitsifRforcincn_words_dict]
inserts= [L+c+RforL, Rinsplitsforcincn_words_dict]
returnset(deletes+transposes+replaces+inserts)
# return the phrease exist in phrase_freqdefknown(phrases): returnset(phraseforphraseinphrasesifphrase.encode("utf-8") inphrase_freq)
# get the candidates phrase of the error phrase# we sort the candidates phrase's importance according to their pinyin# if the candidate phrase's pinyin exactly matches with the error phrase, we put them into first order# if the candidate phrase's first word pinyin matches with the error phrase first word, we put them into second order# else we put candidate phrase into the third orderdefget_candidates( error_phrase ):
candidates_1st_order= []
candidates_2nd_order= []
candidates_3nd_order= []
error_pinyin=pinyin.get(error_phrase, format="strip", delimiter="/").encode("utf-8")
cn_words_dict=load_cn_words_dict( "./cn_dict.txt" )
candidate_phrases=list( known(edits1(error_phrase, cn_words_dict)) )
forcandidate_phraseincandidate_phrases:
candidate_pinyin=pinyin.get(candidate_phrase, format="strip", delimiter="/").encode("utf-8")
ifcandidate_pinyin==error_pinyin:
candidates_1st_order.append(candidate_phrase)
elifcandidate_pinyin.split("/")[0] ==error_pinyin.split("/")[0]:
candidates_2nd_order.append(candidate_phrase)
else:
candidates_3nd_order.append(candidate_phrase)
returncandidates_1st_order, candidates_2nd_order, candidates_3nd_order
# test for the auto_correct error_phrase_1="呕涂"# should be "呕吐"error_phrase_2="东方之朱"# should be "东方之珠"error_phrase_3="沙拢"# should be "沙龙"printerror_phrase_1, auto_correct( error_phrase_1 )
printerror_phrase_2, auto_correct( error_phrase_2 )
printerror_phrase_3, auto_correct( error_phrase_3 )
呕涂 呕吐
东方之朱 东方之珠
沙拢 沙龙
3. Correct the misspelled phrase in a sentance
For any given sentence, use jieba do the segmentation,
Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
Use auto_correct function to correct the misspelled phrase
defauto_correct_sentence( error_sentence, verbose=True):
jieba_cut=jieba.cut(err_test.decode("utf-8"), cut_all=False)
seg_list="\t".join(jieba_cut).split("\t")
correct_sentence=""forphraseinseg_list:
correct_phrase=phrase# check if item is a punctuationifphrasenotinPUNCTUATION_LIST.decode("utf-8"):
# check if the phrase in our dict, if not then it is a misspelled phraseifphrase.encode("utf-8") notinphrase_freq.keys():
correct_phrase=auto_correct(phrase.encode("utf-8"))
ifverbose :
printphrase, correct_phrasecorrect_sentence+=correct_phraseifverbose:
printcorrect_sentencereturncorrect_sentence