-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I incorporate external quantitative information in the model? #1603
Comments
No problem! Let's start at the beginning, BERTopic is essentially a clustering task as described here which means that we aim to cluster input data. Since BERTopic converts documents to embeddings, these will be used as the main input. However, you could extend the input to anything you are interested in. For instance, instead of using primarily embeddings you could use metadata instead to perform the clustering. Moreover, it might even be possible to concatenate both the embeddings as well as the metadata. Do note though that it would require some sort of projection in order to make sure that the values fall in the same range and are comparable.
This would require training a separate classifier on top of the generated topics.
Doing this jointly in a clustering task depends solely on the input that you give it, as mentioned above the embeddings. You would need to enrich embeddings with metadata in order to achieve this. |
Hi @MaartenGr, Thanks for the detailed response! I think it's a brilliant idea to concatenate the textual data with the metadata. To this end, are you aware of any project implementing this idea? What I want to achieve in my project is to detect the deviation in topic distributions from the distribution (potentially) inferred from the metadata. So currently I have two ideas in my mind, based on your suggestions:
Which one would sound more reasonable to you? Thanks, |
No, I am not aware of any project that is working this way. There is, however, an issue detailing how to do something similar but then with document covariates (#360).
That seems a reasonable approach. Do note though that cosine similarity tends to work better for high-dimensional data. Also, there is the possibility to extract topics per class as described here. I would advise trying out both options here since they both seem to represent the problem well. |
Thanks so much for the suggestions and sharing this thread! It is exactly relevant to my issue. I used to achieve my goal using the STM package in R, which does not seem to provide satisfactory results. I assumed that there were a ton of measurement errors, which made me want to switch to other topic models, e.g., BERTopic, which leverages the state-of-the-art transformers. I appreciate your time to share your knowledge. I will spend time reading carefully through the thread, and do some experiments myself. Before encountering any further questions I cannot solve, I will close this issue for your convenience. |
Hi there, thanks for the great project!
I'm wondering if there is a way to incorporate external quantitative information into the model. I assume it has something to do with the Multimodal Topic Modeling, but it seems that only images are explicitly supported. On top of this, I would like to use the model to infer the topic distribution for a new document with only quantitative information.
For instance, one task would be to infer the topics in a person's speech, given his/her age, race, major, education level, etc.
I think it might be better if I can jointly estimate the dependency of topic distribution on the quantitative data, as well as the topics themselves, compared to building several classifiers/regressors to predict the topic distribution after a topic model has been fitted.
I'm completely new to this model, so thanks in advance for bearing with me on this newbie question.
The text was updated successfully, but these errors were encountered: