Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
bab2min committed Apr 18, 2020
1 parent 32689d7 commit 03560c9
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 3 deletions.
21 changes: 20 additions & 1 deletion README.kr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,9 +204,28 @@ infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `t

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

어휘 사전분포를 이용하여 주제 고정하기
--------------------------------------
0.6.0 버전부터 `tomotopy.LDAModel.set_word_prior`라는 메소드가 추가되었습니다. 이 메소드로 특정 단어의 사전분포를 조절할 수 있습니다.
예를 들어 다음 코드처럼 단어 'church'의 가중치를 Topic 0에 대해서는 1.0, 나머지 Topic에 대해서는 0.1로 설정할 수 있습니다.
이는 단어 'church'가 Topic 0에 할당될 확률이 다른 Topic에 할당될 확률보다 10배 높다는 것을 의미하며, 따라서 대부분의 'church'는 Topic 0에 할당되게 됩니다.
그리고 학습을 거치며 'church'와 관련된 단어들 역시 Topic 0에 모이게 되므로, 최종적으로 Topic 0은 'church'와 관련된 주제가 될 것입니다.
이를 통해 특정 내용의 주제를 원하는 Topic 번호에 고정시킬 수 있습니다.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

자세한 내용은 `example.py`의 `word_prior_example` 함수를 참조하십시오.

예제 코드
--------
---------
tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.

예제 코드에서 사용했던 데이터 파일은 https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view 에서 다운받을 수 있습니다.
Expand Down
20 changes: 20 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,26 @@ The following chart shows the speed difference between the two algorithms based

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

Pining Topics using Word Priors
-------------------------------
Since version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.
For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.
This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.
Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'.
This allows to manipulate some topics to be placed at a specific topic number.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See `word_prior_example` in `example.py` for more details.


Examples
--------
Expand Down
30 changes: 30 additions & 0 deletions example.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,33 @@ def hdp_example(input_file, save_path):
for word, prob in mdl.get_topic_words(k):
print('\t', word, prob, sep='\t')

def word_prior_example(input_file):
corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding='utf-8'))

# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
# The word 'church' is assigned to Topic 0 with a weight of 1.0 and to the remaining topics with a weight of 0.1.
# Therefore, a topic related to 'church' can be fixed at Topic 0 .
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])
# Topic 1 for a topic related to 'softwar'
mdl.set_word_prior('softwar', [1.0 if k == 1 else 0.1 for k in range(20)])
# Topic 2 for a topic related to 'citi'
mdl.set_word_prior('citi', [1.0 if k == 2 else 0.1 for k in range(20)])
mdl.train(0)
print('Num docs:', len(mdl.docs), ', Vocab size:', mdl.num_vocabs, ', Num words:', mdl.num_words)
print('Removed top words:', mdl.removed_top_words)
for i in range(0, 1000, 10):
mdl.train(10)
print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
print("== Topic #{} ==".format(k))
for word, prob in mdl.get_topic_words(k, top_n=10):
print(word, prob, sep='\t')
print()

def corpus_and_labeling_example(input_file):
corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
Expand Down Expand Up @@ -115,6 +142,9 @@ def raw_corpus_and_labeling_example(input_file):
print('Running HDP')
hdp_example('enwiki-stemmed-1000.txt', 'test.hdp.bin')

print('Set Word Prior')
word_prior_example('enwiki-stemmed-1000.txt')

print('Running LDA and Labeling')
corpus_and_labeling_example('enwiki-stemmed-1000.txt')

Expand Down
24 changes: 22 additions & 2 deletions tomotopy/documentation.kr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ tomotopy 란?
* Correlated Topic Model (`tomotopy.CTModel`)
* Dynamic Topic Model (`tomotopy.DTModel`)

tomotopy의 가장 최신버전은 0.6.2 입니다.
tomotopy의 가장 최신버전은 0.7.0 입니다.

.. image:: https://badge.fury.io/py/tomotopy.svg

Expand Down Expand Up @@ -246,8 +246,28 @@ infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `t

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

어휘 사전분포를 이용하여 주제 고정하기
--------------------------------------
0.6.0 버전부터 `tomotopy.LDAModel.set_word_prior`라는 메소드가 추가되었습니다. 이 메소드로 특정 단어의 사전분포를 조절할 수 있습니다.
예를 들어 다음 코드처럼 단어 'church'의 가중치를 Topic 0에 대해서는 1.0, 나머지 Topic에 대해서는 0.1로 설정할 수 있습니다.
이는 단어 'church'가 Topic 0에 할당될 확률이 다른 Topic에 할당될 확률보다 10배 높다는 것을 의미하며, 따라서 대부분의 'church'는 Topic 0에 할당되게 됩니다.
그리고 학습을 거치며 'church'와 관련된 단어들 역시 Topic 0에 모이게 되므로, 최종적으로 Topic 0은 'church'와 관련된 주제가 될 것입니다.
이를 통해 특정 내용의 주제를 원하는 Topic 번호에 고정시킬 수 있습니다.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

자세한 내용은 `example.py`의 `word_prior_example` 함수를 참조하십시오.

예제 코드
--------
---------
tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.

예제 코드에서 사용했던 데이터 파일은 https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view 에서 다운받을 수 있습니다.
Expand Down
20 changes: 20 additions & 0 deletions tomotopy/documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,26 @@ The following chart shows the speed difference between the two algorithms based

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

Pining Topics using Word Priors
-------------------------------
Since version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.
For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.
This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.
Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'.
This allows to manipulate some topics to be placed at a specific topic number.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See `word_prior_example` in `example.py` for more details.

Examples
--------
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .
Expand Down

0 comments on commit 03560c9

Please sign in to comment.