Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175

Open
YanLiang1102 opened this issue Jun 21, 2017 · 4 comments
Assignees

Comments

@YanLiang1102
Copy link
Contributor

No description provided.

@YanLiang1102 YanLiang1102 self-assigned this Jun 21, 2017
YanLiang1102 added a commit that referenced this issue Jun 21, 2017
… on lda model result running sklearn kmeans to get our cluser result indexed by the documentIds
@YanLiang1102
Copy link
Contributor Author

YanLiang1102 commented Jun 21, 2017

step 1: I am using gensim lda model to train the data, then for each document, say we set the topic numbers to be 20, then after the gensim lda model training, for each document , we get a 20 dimensional vector, and each element in that vector is the probability that this document belongs to the topics.

Then I am using this vector for each document as a feature vector to feed in the sklearn kmeans mode to train the data, and the clustered index will be store in a dictionary indexed by the document id.

step2:
need to store this info in our mongo db, and come up algorithms to evenly distributed the docs in that way

something to think about : to make user choose what the topic they want to code with from the interface.
@ahalterman @cegme

@YanLiang1102
Copy link
Contributor Author

training the mode now , it is how heavy it is using the resources. :)
@cegme @ahalterman
image

@YanLiang1102 YanLiang1102 changed the title Cluster the document based on their topics and distribute the code task evenly distributed by cluster Cluster the document based on their topics and distribute the code task evenly by its clustered topic Jun 21, 2017
YanLiang1102 added a commit that referenced this issue Jun 21, 2017
#175 add in code for training with gensim using lda model, then based…
@YanLiang1102
Copy link
Contributor Author

YanLiang1102 commented Jun 21, 2017

set the default topic to be 10000 for each documents.... from mongo command line
db.your_collection.update({},{$set : {"new_field":1}},false,true)
**********steps to migrate a db collection from one server to another
mongodump --db eventData --collection documents_arabic --port ... -out ...
scp -r ./eventData yan@server:/home/yan/dump/lexisnexis/eventData
sudo mongorestore -d dbname -c collectioname --port ... -u username -p password

@YanLiang1102
Copy link
Contributor Author

The english documents training using like 3 hours to be trained, need to think about play with the gensim parallel part , just for fun, to see how fast it can be enhanced.

YanLiang1102 added a commit that referenced this issue Jun 23, 2017
…dictionray and update teh lda-kemans code
YanLiang1102 added a commit that referenced this issue Jun 29, 2017
#175 make the topic model interface fucntional and need coders to giv…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant