-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster the document based on their topics and distribute the code task evenly by its clustered topic #175
Comments
… on lda model result running sklearn kmeans to get our cluser result indexed by the documentIds
step 1: I am using gensim lda model to train the data, then for each document, say we set the topic numbers to be 20, then after the gensim lda model training, for each document , we get a 20 dimensional vector, and each element in that vector is the probability that this document belongs to the topics. Then I am using this vector for each document as a feature vector to feed in the sklearn kmeans mode to train the data, and the clustered index will be store in a dictionary indexed by the document id. step2: something to think about : to make user choose what the topic they want to code with from the interface. |
training the mode now , it is how heavy it is using the resources. :) |
#175 add in code for training with gensim using lda model, then based…
set the default topic to be 10000 for each documents.... from mongo command line |
The english documents training using like 3 hours to be trained, need to think about play with the gensim parallel part , just for fun, to see how fast it can be enhanced. |
…dictionray and update teh lda-kemans code
#175 make the topic model interface fucntional and need coders to giv…
No description provided.
The text was updated successfully, but these errors were encountered: