Skip to content

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Notifications You must be signed in to change notification settings

dongjing007/Word2vec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec

Word2vec 是基於非監督式學習,訓練集建議越大越好,語料涵蓋的越全面,訓練出來的結果相對比較好,當然也有可能 garbage input 進而得到 garbage output ,由於檔案所使用的資料集較大,所以每個過程中都請耐心等候。(ps: word2vec 如果把每種字當成一個維度,假設總共有 4000 個總字,那麼向量就會有 4000 維度。故可透過它來降低維度)

demo

WiKi簡介:
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

By the way :
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

使用方式

Input:

1.download wiki data(請參考資料集)
2.進入 Word2Vec 資料夾
3.執行 python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)
4.執行 python segmentation.py(簡體轉繁體,在進行斷詞並同步過濾停用詞,由於檔案較大故斷詞較久)
5.執行 python train.py(訓練並產生 model ,時間上也會比較久)
5.執行 python main.py(使用 Model,輸入詞彙)
註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)

Output:

1.輸入一個詞彙會找出前5名相似
2.輸入兩個詞彙會算出兩者之間相似度
3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
老師
詞彙相似詞前 5 排序
班導,0.6360481977462769
班導師,0.6360464096069336
代課,0.6358826160430908
級任,0.6271134614944458
班主任,0.6270170211791992

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,媽媽
計算兩個詞彙間 Cosine 相似度
0.780765200371

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,老公,媽媽
爸爸之於老公,如媽媽之於
老婆,0.5401346683502197
蠢萌,0.5245970487594604
夠秤,0.5059393048286438
駁命,0.4888317286968231
孔爵,0.4857243597507477

資料集(wiki data)

主要以 pages-articles.xml.bz2 結尾之檔案類型,這邊使用 zhwiki-latest-pages-articles.xml.bz2。

開發環境

Python 3.5.2
pip install gensim
pip install jieba
pip install hanziconv
note: if your gensim can't install, please check your os then install correct gensim version.

About

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 99.2%
  • Python 0.8%