LDAvis integration #136

paulthemagno · 2021-02-17T09:46:54Z

A useful feature would be an integration with LDAvis to see the clusters.

For example I'm trying to use pyLDAvis putting in the prepare function the values. I would like to understand which values to give to this function from Top2Vec. It needs:

topic_term_dists:array-like, shape (n_topics, n_terms)
Matrix of topic-term probabilities. Where n_terms is len(vocab).
doc_topic_dists :array-like, shape (n_docs, n_topics)
Matrix of document-topic probabilities.
doc_lengths :array-like, shape n_docs
The length of each document, i.e. the number of words in each document. The order of the numbers should be consistent with the ordering of the docs in doc_topic_dists.
vocab :array-like, shape n_terms
List of all the words in the corpus used to train the model.
term_frequency :array-like, shape n_terms
The count of each particular term over the entire corpus. The ordering of these counts should correspond with vocab and topic_term_dists.

For the topic_term_dists I thought to do something like this:

import numpy as np
from tqdm import tqdm

reduced = True
model = my_model

topic_term_dists = np.empty([model.get_num_topics(reduced=reduced), len(vocab)])
vocab = model.vocab

for w in tqdm(vocab, total=len(vocab)):
    _, _, word_scores, _ = model.search_topics(keywords=[w], num_topics=model.get_num_topics(reduced=reduced), reduced=reduced)
    topic_term_dists = np.concatenate((topic_term_dists, word_scores[:,np.newaxis]), axis=1)
print(topic_term_dists)

array([[ 0.        ,  0.        ,  0.        , ...,  0.01383832,
         0.03594964,  0.01575022],
       [ 0.14129441,  0.13531766,  0.1144332 , ...,  0.00259366,
         0.02056807,  0.01538985],
       [ 0.        ,  0.        ,  0.        , ...,  0.00084448,
         0.01941805, -0.00048536],
       ...,
       [ 0.02265248,  0.04201523,  0.05276711, ..., -0.04817701,
        -0.02513617, -0.04259174],
       [ 0.        ,  0.        ,  0.        , ..., -0.04851739,
        -0.03064794, -0.0465236 ],
       [ 0.05836846, -0.01507987, -0.02540753, ..., -0.05943062,
        -0.03319241, -0.04708139]])

So passing each term of the vocab to the search_topics function as a single keyword to have the cosine similarity (I read it in the description of the method) for each couple term-topic. The problem is some values are negative (isn't the cosine similarity a range between 0 and 1?) where I expected all positive values with a total sum of 1.

For the second parameter doc_topic_dists can I use search_documents_by_vector or _calculate_documents_topic function passing each vector of the topics?

The text was updated successfully, but these errors were encountered:

paulthemagno · 2021-02-22T15:48:40Z

@ddangelov any suggestions about this?

ddangelov · 2021-02-23T17:35:45Z

For the cosine similarities you could pass all of the values through a softmax this will resolve the problem with negative values.

For doc_topic_dists the _calculate_documents_topic function should work.

russelldc · 2021-03-09T17:49:47Z

@paulthemagno Did you get anywhere with this?

Correct me if I'm wrong, @ddangelov, but as far as I understand, we would need issue #141 to be resolved before we could use the _calculate_documents_topic() function for pyLDAvis. Otherwise, doc_topic_dists would be a very sparse matrix, where each "document row" in the matrix contains only a single non-zero element. To be honest, I'm not sure if that's actually a problem or not, but it seems like it would be...

ddangelov · 2021-03-10T13:15:21Z

Yes you would benefit from issue #141 being resolved.

ddangelov closed this as completed Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDAvis integration #136

LDAvis integration #136

paulthemagno commented Feb 17, 2021 •

edited

Loading

paulthemagno commented Feb 22, 2021

ddangelov commented Feb 23, 2021

russelldc commented Mar 9, 2021

ddangelov commented Mar 10, 2021

LDAvis integration #136

LDAvis integration #136

Comments

paulthemagno commented Feb 17, 2021 • edited Loading

paulthemagno commented Feb 22, 2021

ddangelov commented Feb 23, 2021

russelldc commented Mar 9, 2021

ddangelov commented Mar 10, 2021

paulthemagno commented Feb 17, 2021 •

edited

Loading