-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the model for document predictions #543
Comments
Yes, that is definitely a common way of approaching this specific use case. By setting Having said that, I believe that the most accurate way of doing this is by splitting up your documents into sentences. Although this does not hold for all sentences, a sentence typically holds a single topic. Thus, by splitting up the documents into sentences and passing those to BERTopic, we can simply count how often certain topics appear in the documents by counting the related sentences.
Hmmm, that does not ring bell, unfortunately. Do you have a use case in mind that you want to use it for? |
Thanks for confirming that!
Interesting. I suppose I should be able to call transform on the sentences? I thought that BERT tends not to work well on short text? In terms of the Tf-IDF thing - I'm playing around with different ways of scoring text. However at this point I think I've convinced myself that extracting the vocabulary built with BERTopic and then using Tf-IDF to do the scoring doesn't make much sense. I'm pretty sure I saw an article asserting that this was a viable strategy. |
Actually, the base model that is being used is based on
The one thing that might be interesting to use is to use the fitted c-TF-IDF model on documents instead of the traditional TF-IDF model. That way, you can score individual documents whilst having some information regarding the topics. I have not tried it out myself extensively, apart from calculating covariates. It goes something like this: X = topic_model.vectorizer_model.transform(documents)
c_tf_idf = topic_model.transformer.transform(X) Here, |
As always this is really interesting information. I don't think I've seen you refer to using the c_tf_idf model like this. I will take a look and post here (hopefully within the week) with more questions / results. |
I converted my docs to sentences and ran |
That depends on what you want to do with the resulting feature matrix ( |
Might be more straightforward if I ask the question differently: You suggested:
I broke all my documents into sentences and then ran |
Yes, that is what you can expect if you run a TF-IDF-like model. It generates a sparse matrix of size n x m, where n is the number of documents, in your case 1.1M, and m is the size of the vocabulary, which would be 1.5M words. Having said that, it might be worthwhile to use a custom vectorizer and set the The |
I was following your suggestion above to use the What I'm curious about at this point is whether or not the LDA approach of categorizing by dominant topic has a parallel with BERTTopic? The obvious issue is that BERTopic tends to "classify" a relatively high percentage of the docs as -1. Of course with LDA each document has a probability for each topic, but from what I've seen you still get a large number of documents with very suspect relationship to the dominant topic. The method I first came up with, as I've mentioned previously, is to develop a vocabulary and then create TF-IDF scores based on those. But I'm less than convinced that this is a robust way of dealing with the issue. |
You could use the sparse matrix directly as an input for a supervised classification algorithm. Especially support vector machines have worked well, at least in my experience, with sparse data (TF-IDF like matrices).
There are several ways of reducing the number of outlier documents but the most effective way of doing that is either reducing outliers by making use of |
Due to inactivity, this issue will be closed. Feel free to ping me if you want to re-open the issue! |
Hi Maarten,
I know that this question has come up innumerable times, and I've been scanning through the issues, but I just want to make sure I'm not missing anything. If we want to get a rating of the dominant topic to for each document then just use the matrix produced when
calculate_probabilities
is set toTrue
? If I got that wrong could you point out issue threads where this is addressed?Also, somewhat related--In the last couple of months I remember reading a post (Medium??) where the author used BERTopic as part of a larger process to develop topic keywords and then used those words to produce document probabilities (represented by TfIDF scores?) to categorize individual documents - does any of that ring a bell? I can't find the link anywhere.
Thanks in advance!
The text was updated successfully, but these errors were encountered: