Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDAvis integration #136

Closed
paulthemagno opened this issue Feb 17, 2021 · 4 comments
Closed

LDAvis integration #136

paulthemagno opened this issue Feb 17, 2021 · 4 comments

Comments

@paulthemagno
Copy link

paulthemagno commented Feb 17, 2021

A useful feature would be an integration with LDAvis to see the clusters.

For example I'm trying to use pyLDAvis putting in the prepare function the values. I would like to understand which values to give to this function from Top2Vec. It needs:

  • topic_term_dists:array-like, shape (n_topics, n_terms)
    Matrix of topic-term probabilities. Where n_terms is len(vocab).
  • doc_topic_dists :array-like, shape (n_docs, n_topics)
    Matrix of document-topic probabilities.
  • doc_lengths :array-like, shape n_docs
    The length of each document, i.e. the number of words in each document. The order of the numbers should be consistent with the ordering of the docs in doc_topic_dists.
  • vocab :array-like, shape n_terms
    List of all the words in the corpus used to train the model.
  • term_frequency :array-like, shape n_terms
    The count of each particular term over the entire corpus. The ordering of these counts should correspond with vocab and topic_term_dists.

For the topic_term_dists I thought to do something like this:

import numpy as np
from tqdm import tqdm

reduced = True
model = my_model

topic_term_dists = np.empty([model.get_num_topics(reduced=reduced), len(vocab)])
vocab = model.vocab

for w in tqdm(vocab, total=len(vocab)):
    _, _, word_scores, _ = model.search_topics(keywords=[w], num_topics=model.get_num_topics(reduced=reduced), reduced=reduced)
    topic_term_dists = np.concatenate((topic_term_dists, word_scores[:,np.newaxis]), axis=1)
print(topic_term_dists)
array([[ 0.        ,  0.        ,  0.        , ...,  0.01383832,
         0.03594964,  0.01575022],
       [ 0.14129441,  0.13531766,  0.1144332 , ...,  0.00259366,
         0.02056807,  0.01538985],
       [ 0.        ,  0.        ,  0.        , ...,  0.00084448,
         0.01941805, -0.00048536],
       ...,
       [ 0.02265248,  0.04201523,  0.05276711, ..., -0.04817701,
        -0.02513617, -0.04259174],
       [ 0.        ,  0.        ,  0.        , ..., -0.04851739,
        -0.03064794, -0.0465236 ],
       [ 0.05836846, -0.01507987, -0.02540753, ..., -0.05943062,
        -0.03319241, -0.04708139]])

So passing each term of the vocab to the search_topics function as a single keyword to have the cosine similarity (I read it in the description of the method) for each couple term-topic. The problem is some values are negative (isn't the cosine similarity a range between 0 and 1?) where I expected all positive values with a total sum of 1.

For the second parameter doc_topic_dists can I use search_documents_by_vector or _calculate_documents_topic function passing each vector of the topics?

@paulthemagno
Copy link
Author

@ddangelov any suggestions about this?

@ddangelov
Copy link
Owner

For the cosine similarities you could pass all of the values through a softmax this will resolve the problem with negative values.

For doc_topic_dists the _calculate_documents_topic function should work.

@russelldc
Copy link

@paulthemagno Did you get anywhere with this?

Correct me if I'm wrong, @ddangelov, but as far as I understand, we would need issue #141 to be resolved before we could use the _calculate_documents_topic() function for pyLDAvis. Otherwise, doc_topic_dists would be a very sparse matrix, where each "document row" in the matrix contains only a single non-zero element. To be honest, I'm not sure if that's actually a problem or not, but it seems like it would be...

@ddangelov
Copy link
Owner

Yes you would benefit from issue #141 being resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants