Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I incorporate external quantitative information in the model? #1603

Closed
d-jiao opened this issue Oct 30, 2023 · 4 comments
Closed

How can I incorporate external quantitative information in the model? #1603

d-jiao opened this issue Oct 30, 2023 · 4 comments

Comments

@d-jiao
Copy link

d-jiao commented Oct 30, 2023

Hi there, thanks for the great project!

I'm wondering if there is a way to incorporate external quantitative information into the model. I assume it has something to do with the Multimodal Topic Modeling, but it seems that only images are explicitly supported. On top of this, I would like to use the model to infer the topic distribution for a new document with only quantitative information.

For instance, one task would be to infer the topics in a person's speech, given his/her age, race, major, education level, etc.

I think it might be better if I can jointly estimate the dependency of topic distribution on the quantitative data, as well as the topics themselves, compared to building several classifiers/regressors to predict the topic distribution after a topic model has been fitted.

I'm completely new to this model, so thanks in advance for bearing with me on this newbie question.

@MaartenGr
Copy link
Owner

I'm completely new to this model, so thanks in advance for bearing with me on this newbie question.

No problem! Let's start at the beginning, BERTopic is essentially a clustering task as described here which means that we aim to cluster input data. Since BERTopic converts documents to embeddings, these will be used as the main input. However, you could extend the input to anything you are interested in. For instance, instead of using primarily embeddings you could use metadata instead to perform the clustering. Moreover, it might even be possible to concatenate both the embeddings as well as the metadata. Do note though that it would require some sort of projection in order to make sure that the values fall in the same range and are comparable.

For instance, one task would be to infer the topics in a person's speech, given his/her age, race, major, education level, etc.

This would require training a separate classifier on top of the generated topics.

I think it might be better if I can jointly estimate the dependency of topic distribution on the quantitative data, as well as the topics themselves, compared to building several classifiers/regressors to predict the topic distribution after a topic model has been fitted.

Doing this jointly in a clustering task depends solely on the input that you give it, as mentioned above the embeddings. You would need to enrich embeddings with metadata in order to achieve this.

@d-jiao
Copy link
Author

d-jiao commented Oct 30, 2023

Hi @MaartenGr,

Thanks for the detailed response! I think it's a brilliant idea to concatenate the textual data with the metadata. To this end, are you aware of any project implementing this idea?

What I want to achieve in my project is to detect the deviation in topic distributions from the distribution (potentially) inferred from the metadata. So currently I have two ideas in my mind, based on your suggestions:

  • 1). evaluate the topics with BERTopic using texts exclusively, 2) build a softmax-like classifier with the metadata on top of the output of BERTopic, and 3) calculate the OOS using, say, Euclidean distance between the real distribution based on texts and the likelihood inferred from the classifier
  • 1). evaluate the topics with texts and metadata concatenated; 2) OOS: infer the topic distribution using the text and metadata; 3) OOS: infer the topic distribution the metadata and in-sample average of text embeddings; 4) calculate, say, the Euclidean distance between the distributions using 2) and 3).

Which one would sound more reasonable to you?

Thanks,
djiao

@MaartenGr
Copy link
Owner

Thanks for the detailed response! I think it's a brilliant idea to concatenate the textual data with the metadata. To this end, are you aware of any project implementing this idea?

No, I am not aware of any project that is working this way. There is, however, an issue detailing how to do something similar but then with document covariates (#360).

1). evaluate the topics with BERTopic using texts exclusively, 2) build a softmax-like classifier with the metadata on top of the output of BERTopic, and 3) calculate the OOS using, say, Euclidean distance between the real distribution based on texts and the likelihood inferred from the classifier

That seems a reasonable approach. Do note though that cosine similarity tends to work better for high-dimensional data. Also, there is the possibility to extract topics per class as described here.

I would advise trying out both options here since they both seem to represent the problem well.

@d-jiao
Copy link
Author

d-jiao commented Nov 1, 2023

Thanks so much for the suggestions and sharing this thread! It is exactly relevant to my issue. I used to achieve my goal using the STM package in R, which does not seem to provide satisfactory results. I assumed that there were a ton of measurement errors, which made me want to switch to other topic models, e.g., BERTopic, which leverages the state-of-the-art transformers.

I appreciate your time to share your knowledge. I will spend time reading carefully through the thread, and do some experiments myself. Before encountering any further questions I cannot solve, I will close this issue for your convenience.

@d-jiao d-jiao closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants