How can I incorporate external quantitative information in the model? #1603

d-jiao · 2023-10-30T13:32:55Z

Hi there, thanks for the great project!

I'm wondering if there is a way to incorporate external quantitative information into the model. I assume it has something to do with the Multimodal Topic Modeling, but it seems that only images are explicitly supported. On top of this, I would like to use the model to infer the topic distribution for a new document with only quantitative information.

For instance, one task would be to infer the topics in a person's speech, given his/her age, race, major, education level, etc.

I think it might be better if I can jointly estimate the dependency of topic distribution on the quantitative data, as well as the topics themselves, compared to building several classifiers/regressors to predict the topic distribution after a topic model has been fitted.

I'm completely new to this model, so thanks in advance for bearing with me on this newbie question.

MaartenGr · 2023-10-30T14:02:18Z

I'm completely new to this model, so thanks in advance for bearing with me on this newbie question.

No problem! Let's start at the beginning, BERTopic is essentially a clustering task as described here which means that we aim to cluster input data. Since BERTopic converts documents to embeddings, these will be used as the main input. However, you could extend the input to anything you are interested in. For instance, instead of using primarily embeddings you could use metadata instead to perform the clustering. Moreover, it might even be possible to concatenate both the embeddings as well as the metadata. Do note though that it would require some sort of projection in order to make sure that the values fall in the same range and are comparable.

For instance, one task would be to infer the topics in a person's speech, given his/her age, race, major, education level, etc.

This would require training a separate classifier on top of the generated topics.

I think it might be better if I can jointly estimate the dependency of topic distribution on the quantitative data, as well as the topics themselves, compared to building several classifiers/regressors to predict the topic distribution after a topic model has been fitted.

Doing this jointly in a clustering task depends solely on the input that you give it, as mentioned above the embeddings. You would need to enrich embeddings with metadata in order to achieve this.

d-jiao · 2023-10-30T21:04:16Z

Hi @MaartenGr,

Thanks for the detailed response! I think it's a brilliant idea to concatenate the textual data with the metadata. To this end, are you aware of any project implementing this idea?

What I want to achieve in my project is to detect the deviation in topic distributions from the distribution (potentially) inferred from the metadata. So currently I have two ideas in my mind, based on your suggestions:

1). evaluate the topics with BERTopic using texts exclusively, 2) build a softmax-like classifier with the metadata on top of the output of BERTopic, and 3) calculate the OOS using, say, Euclidean distance between the real distribution based on texts and the likelihood inferred from the classifier
1). evaluate the topics with texts and metadata concatenated; 2) OOS: infer the topic distribution using the text and metadata; 3) OOS: infer the topic distribution the metadata and in-sample average of text embeddings; 4) calculate, say, the Euclidean distance between the distributions using 2) and 3).

Which one would sound more reasonable to you?

Thanks,
djiao

MaartenGr · 2023-11-01T08:53:18Z

Thanks for the detailed response! I think it's a brilliant idea to concatenate the textual data with the metadata. To this end, are you aware of any project implementing this idea?

No, I am not aware of any project that is working this way. There is, however, an issue detailing how to do something similar but then with document covariates (#360).

1). evaluate the topics with BERTopic using texts exclusively, 2) build a softmax-like classifier with the metadata on top of the output of BERTopic, and 3) calculate the OOS using, say, Euclidean distance between the real distribution based on texts and the likelihood inferred from the classifier

That seems a reasonable approach. Do note though that cosine similarity tends to work better for high-dimensional data. Also, there is the possibility to extract topics per class as described here.

I would advise trying out both options here since they both seem to represent the problem well.

d-jiao · 2023-11-01T14:32:29Z

Thanks so much for the suggestions and sharing this thread! It is exactly relevant to my issue. I used to achieve my goal using the STM package in R, which does not seem to provide satisfactory results. I assumed that there were a ton of measurement errors, which made me want to switch to other topic models, e.g., BERTopic, which leverages the state-of-the-art transformers.

I appreciate your time to share your knowledge. I will spend time reading carefully through the thread, and do some experiments myself. Before encountering any further questions I cannot solve, I will close this issue for your convenience.

d-jiao closed this as completed Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I incorporate external quantitative information in the model? #1603

How can I incorporate external quantitative information in the model? #1603

d-jiao commented Oct 30, 2023

MaartenGr commented Oct 30, 2023

d-jiao commented Oct 30, 2023 •

edited

Loading

MaartenGr commented Nov 1, 2023

d-jiao commented Nov 1, 2023

How can I incorporate external quantitative information in the model? #1603

How can I incorporate external quantitative information in the model? #1603

Comments

d-jiao commented Oct 30, 2023

MaartenGr commented Oct 30, 2023

d-jiao commented Oct 30, 2023 • edited Loading

MaartenGr commented Nov 1, 2023

d-jiao commented Nov 1, 2023

d-jiao commented Oct 30, 2023 •

edited

Loading