Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

YanLiang1102 · 2017-06-21T17:12:59Z

No description provided.

… on lda model result running sklearn kmeans to get our cluser result indexed by the documentIds

YanLiang1102 · 2017-06-21T17:18:48Z

step 1: I am using gensim lda model to train the data, then for each document, say we set the topic numbers to be 20, then after the gensim lda model training, for each document , we get a 20 dimensional vector, and each element in that vector is the probability that this document belongs to the topics.

Then I am using this vector for each document as a feature vector to feed in the sklearn kmeans mode to train the data, and the clustered index will be store in a dictionary indexed by the document id.

step2:
need to store this info in our mongo db, and come up algorithms to evenly distributed the docs in that way

something to think about : to make user choose what the topic they want to code with from the interface.
@ahalterman @cegme

YanLiang1102 · 2017-06-21T17:22:27Z

training the mode now , it is how heavy it is using the resources. :)
@cegme @ahalterman

#175 add in code for training with gensim using lda model, then based…

YanLiang1102 · 2017-06-21T18:53:22Z

set the default topic to be 10000 for each documents.... from mongo command line
db.your_collection.update({},{$set : {"new_field":1}},false,true)
**********steps to migrate a db collection from one server to another
mongodump --db eventData --collection documents_arabic --port ... -out ...
scp -r ./eventData yan@server:/home/yan/dump/lexisnexis/eventData
sudo mongorestore -d dbname -c collectioname --port ... -u username -p password

…e topic name

YanLiang1102 · 2017-06-23T19:26:28Z

The english documents training using like 3 hours to be trained, need to think about play with the gensim parallel part , just for fun, to see how fast it can be enhanced.

…dictionray and update teh lda-kemans code

#175 make the topic model interface fucntional and need coders to giv…

YanLiang1102 self-assigned this Jun 21, 2017

YanLiang1102 added the enhancement label Jun 21, 2017

YanLiang1102 added a commit that referenced this issue Jun 21, 2017

#175 add in code for training with gensim using lda model, then based…

4373dea

… on lda model result running sklearn kmeans to get our cluser result indexed by the documentIds

YanLiang1102 changed the title ~~Cluster the document based on their topics and distribute the code task evenly distributed by cluster~~ Cluster the document based on their topics and distribute the code task evenly by its clustered topic Jun 21, 2017

YanLiang1102 added a commit that referenced this issue Jun 21, 2017

Merge pull request #176 from oudalab/yan-dev

661f1fb

#175 add in code for training with gensim using lda model, then based…

YanLiang1102 added a commit that referenced this issue Jun 22, 2017

#175 make the topic model interface fucntional and need coders to giv…

c62f8d4

…e topic name

YanLiang1102 added a commit that referenced this issue Jun 23, 2017

#180 #175 building arabic actor dictionary by using existing english …

9cf9036

…dictionray and update teh lda-kemans code

YanLiang1102 added a commit that referenced this issue Jun 29, 2017

Merge pull request #177 from oudalab/yan-dev

fb5ceba

#175 make the topic model interface fucntional and need coders to giv…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

YanLiang1102 commented Jun 21, 2017

YanLiang1102 commented Jun 21, 2017 •

edited

Loading

YanLiang1102 commented Jun 21, 2017

YanLiang1102 commented Jun 21, 2017 •

edited

Loading

YanLiang1102 commented Jun 23, 2017

Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

Comments

YanLiang1102 commented Jun 21, 2017

YanLiang1102 commented Jun 21, 2017 • edited Loading

YanLiang1102 commented Jun 21, 2017

YanLiang1102 commented Jun 21, 2017 • edited Loading

YanLiang1102 commented Jun 23, 2017

YanLiang1102 commented Jun 21, 2017 •

edited

Loading

YanLiang1102 commented Jun 21, 2017 •

edited

Loading