-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion: incorporating document-level covariates #360
Comments
Thank you for the suggestion! I am not very familiar with document-level covariates as used in STM. However, reading through the STM documentation there seems to be some overlap in specific cases. For example, to explore the effect of certain document-level variables you can model topics per variable following this guide using Do you feel that the |
Thank you for your response. topics_per_class is very helpful, but in stm the user can investigate the interaction between covariates and their relationship with topical prevalence. for example the interaction of gender with education on topics written by different people, or have more than one covariate (like a regression) so it's not just difference in topics for different categories. I think this link does a better job of explaining https://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf |
Thank you for sharing the paper, will make sure to read it through! It does seem like it would definitely be an interesting and useful extension of sorts to BERTopic. I would love to implement it but it seems like it would take quite some time. Having said that, I'll make sure to put it on the list and see if I could implement a basic version of it. |
This looks great! thank you so much for taking the time and working on this. This would be really useful for a more nuanced analysis of text. I dont know how I might be of help, but if there's anything you need help with please reach out and I'll see if it's within my capabilities |
Although there might be some more experimentation needed, I think users can start testing out the first version of performing covariate analyses within BERTopic. It is not ready to be added to BERTopic as there are some assumptions to the statistical models that I am not entirely convinced of. In the future, I might also consider more elegant approaches but for the time being this is something to experiment with. CovariatesWithin the structural topic model, the covariates and their impact on the topics are modeled during the creation of the topics. This is not the case with BERTopic as it assumes that topics are generated independently from any covariates that might exist. Technically, we can generate embeddings based on the metadata but I do not believe it to be necessary at this moment in order to improve upon the topic generation process. However, it is something to take into account as STM does assume that covariates influence the topic generation process. Do note that both models do assume that covariates might influence both topic content and prevalence. Topic PrevalenceThe topic prevalence is modeled using the document-topic probability matrix as a proxy. This means that we assume that a higher probability of a document belonging to topic t, the higher its topic prevalence is. Again, this assumption does not necessarily hold true but from some experiments I did, it seems like a strong proxy for the topic prevalence. Topic ContentThe topic content is a bit more difficult to implement as a dependent variable in contrast to topic prevalence where we can directly access the document-topic probability matrix. To do this, we calculate the c-TF-IDF representation of each document instead of the entire topic. It allows us to create a very localized representation of a topic. We then calculate the cosine similarity between the local c-TF-IDF representation for each document and the c-TF-IDF representation of the topic the document belongs to. This results in a bunch of similarity scores that we can use to model the topic content. Then, we can calculate the effect of covariates on the topic content. We assume that when a covariate changes the way a topic is represented the similarity scores will vary which should be captured in the resulting statistical model. CodeAs mentioned before, I am not at the point of including this into BERTopic but I am very curious if this is something users are interested in and also what their experience is using this extension. So, I will share the code here in a way that it should be easy to use on top of BERTopic. Minimal ExampleWe start with a minimal example of how to measure the effect of covariates on topic prevalence and topic content. We are going to be using a corpus consisting of political blogs in 2008 (more info here about the data) with two possible covariates:
First, we need to pip install statsmodels import pandas as pd
from bertopic import BERTopic
# Load data
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
docs = df.documents.tolist()
metadata = df.loc[:, ["rating", "day"]].copy()
# Fit BERTopic
topic_model = BERTopic(calculate_probabilities=True, min_topic_size=50)
topics, probs = topic_model.fit_transform(docs) In the above example, nothing special is happening thus far except for one thing: we need to put all the metadata into a single dataframe. Here, we are only using import numpy as np
import pandas as pd
from bertopic import BERTopic
from typing import Union, Callable, List, Mapping, Any
from sklearn.metrics.pairwise import cosine_similarity
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.base.wrapper as wrap
def estimate_effect(topic_model,
docs: List[str],
probs: np.ndarray,
topics: Union[int, List[int]],
metadata: pd.DataFrame,
y: str = "prevalence",
estimator: Union[str, Callable] = None,
estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
""" Estimate the effect of metadata on topic prevalence and topic content
Arguments:
docs: The original list of documents on which the model was trained on
probs: A mxn probability matrix, *m* is the number of document and
*n* the number of topics. It represents the probabilities of all topics
across all documents.
topics: The topic(s) for which you want to estimate the effect of metadata on
metadata: The metadata in a dataframe. Make sure that the columns have the exact same
name as the elements in the estimator
y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
estimator: Either the formula used in the estimator or a custom estimator.
When it is used as a formula, it follows R-style formulas, for example:
* 'prevalence ~ rating'
* 'prevalence ~ rating + day + rating:day'
Make sure that the target is either 'prevalence' or 'content'
The custom estimator should be a `statsmodels.formula.api`, currently,
`statsmodels.api` is not supported.
estimator_kwargs: The arguments needed within the estimator, needs at
least a "formula" argument
Returns:
fitted_estimators: List of fitted estimators for either topic prevalence or topic content
"""
data = metadata.loc[::]
data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
fitted_estimators = []
if isinstance(topics, int):
topics = [topics]
# As a proxy for the topic prevalence, we take the probability of a document
# belonging to a specific topic. We assume that a higher probability of a topic
# belonging to that topic also results in that document talking more about that topic
if y == "prevalence":
for topic in topics:
# Prepare topic prevalence,
# Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
data["prevalence"] = list(probs[:, topic])
data_filtered = data.loc[data.prevalence < 1, :]
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=data_filtered, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=data_filtered,
family=sm.families.Gamma(link=sm.families.links.log())).fit()
fitted_estimators.append(est)
# Topic content is modeled on a document-level by calculating the document cTFIDF
# representation. Based on that representation, we calculate its cosine similarity
# with its topic cTFIDF representation. The assumption here, is that we expect different
# similarity scores if a covariate changes the topic content.
elif y == "content":
# Extract topic content and prevalence
data=data.loc[data.topic == topic, :]
c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": docs}), fit=False)
sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
data["content"] = sim_matrix[:, topic+1]
data["prevalence"] = list(probs[:, topic])
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=data, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=data,
family=sm.families.Gamma(link=sm.families.links.log())).fit()
fitted_estimators.append(est)
return fitted_estimators The above is the main function for running our statistic models. The only interesting part is the arguments and their documentation but we will get to that in the next step. To run the analyses we can simply call the above function with the appriate parameters: ests = estimate_effect(topic_model=topic_model,
topics=[1, 2],
metadata=metadata,
docs=docs,
probs=probs,
estimator="prevalence ~ rating",
y="prevalence")
print([est.summary() for est in ests]) In the code above there are two parameters that are important, namely
To model the topic content, we can simply run: ests = estimate_effect(topic_model=topic_model,
topics=[1, 2],
metadata=metadata,
docs=docs,
probs=probs,
estimator="content ~ rating",
y="content")
print([est.summary() for est in ests]) Note that the value of "rating" in Custom Statistical ModelWe can extend the above by defining a statistical model if we might expect the data to follow a different distribution or if you simply do not agree with any of the rough defaults I have set in the model: estimator = smf.glm
estimator_kwargs = {"formula": 'prevalence ~ rating',
"family": sm.families.Gamma(link=sm.families.links.log())}
ests = estimate_effect(topic_model=topic_model,
topics=[1],
metadata=metadata,
docs=docs,
probs=probs,
y="prevalence",
estimator=estimator,
estimator_kwargs=estimator_kwargs)
print([est.summary() for est in ests]) In the example above, you can see that the FeedbackThis is, hopefully, a fairly straightforward example of analyzing the effects of covariates on topic content and prevalence. It should work on any data. I would advise following along with the above minimal example and perhaps looking at some of the variables to see how they work. As mentioned before, you can view this as a proof of concept but still usable as it is right now. This does mean, however, that things might be subject to change and this will be improved as more feedback comes in. In other words, any and all feedback is highly appreciated! |
This is so exciting!! I'll test it as soon as possible and let you know if I have any comments. Thank you so much for your work on this. |
Very interesting. Thank you for the code example and for pinning this. |
I've started looking at the covariates code and really appreciate the willingness to extend the code base and move in this direction. One issue jumps out at me however. As has been pointed out many times, setting I'm bringing this up because I've essentially looked outside of BERTopic to address this whole issue (covariates). I am using BERTopic for an initial round of topic identification and then using vocabularies built upon that initial pass to arrive at cosine similarity scores to determine the impact of a covariate. In my case I identify a relevant vocabulary and then weigh it against party affiliation (Democrat, Republican). If I'm understanding correctly the solution being pursued here would provide a substitute path entirely within BERTopic. A great idea but the processing time issue seems like a practical limitation. |
@drob-xx You are completely right in stating that setting There are two ways in circumventing this. First, you can use a smaller sample of your data to train BERTopic. This is often a valid use case as millions of documents are typically not necessary to generate a global representation of the topics that exist in the data. However, if you are looking for very specific and small topics among those millions of documents, then sampling would not work. Second, as you suggested, is to look outside of BERTopic. The STM model works quite well with covariates and is often used in these kinds of use cases. The main downside of using the STM model is that there currently is no implementation in python but I am not entirely sure of that.
Hmmm, I have a bit of trouble wrapping my head around this. Which two variables are you comparing with those cosine similarity scores? Are you building those vocabularies on a document level? Just to clarify things, the probabilities are only necessary to calculate the effect of covariates on the topic prevalence. Here, we proxy the prevalence of a topic by looking at the probability distributions of topics in a document. We assume that a higher probability of Topic A in Document X would mean that Topic A appears more frequently and that a lower probability of Topic B in Document X would mean that Topic B appears less frequently in that document. The topic content, on the other hand, does not need the probabilities to be generated. We are calculating the similarity of the c-TF-IDF representation of each document with the c-TF-IDF representation of each topic. The resulting similarity scores are then the dependent variables (sliced by each topic). |
@MaartenGr Thanks as always for a quick response. My post was long and likely confusing. I'm not sure how much of my question has to do with your new code as opposed to how I'm approaching my own project. However, I'll keep going as it will likely result in a better understanding on my part of what I'm trying to accomplish. My project is not calculating a p-value to determine the impact of a covariant on topic relevance. However, I am interested in determining the effect of party identification on the use of vocabularies within US Congressional press releases. I am interested in your code as it would be another way of determining the relevance of party affiliation in press release language. Right now I am using BERTopic to identify relevant topics. I then use TF-IDF calculations to choose (by hand) vocabularies that are relevant to a topic but differ between parties. For example in press releases dealing with healthcare issues I identified two vocabularies:
Then I calculate TF-IDF scores for each press release. Since I of course have party affiliation (and other data) I can then look at those scores in relation to party affiliation. I've started to use your code, but as I wrote the first issue that arose was the length of time to compute the probabilities jumped out. Calculating them increased processing time of some 100K documents of about 300-1500 words from some 20-30 minutes to hours running on a Colab+ account. Since I've gotten used to relatively fast processing times since I've moved from my desktop system it reminded me of how cpu intensive this work is. At base I was wondering how long that would remain an issue as it may effect my overall approach. |
@drob-xx If my understanding is correct, you are interested in whether the topic representation for a specific topic, for example, "business", may differ based on party affiliation, is that correct? In which case, it is indeed highly relevant to the code I shared with respect to calculating the covariates. But first, when you want to calculate the differences in vocabulary within a topic between party affiliations, I believe there is no need to calculate the probabilities. They do not seem to be relevant to your use case, so I would advise setting
And how about calculating a p-value to determine the impact of a covariant on topic content? As mentioned before, it seems that topic content (i.e., vocabulary) is exactly what you are describing, namely, the vocabulary used within a topic for different covariates (e.g., party affiliation). By calculating the p-value, you can explore in which topics the party affiliation has a significant effect on the vocabulary used. This might help you find those topics instead of having to manually go through them.
I would advise not using the classical TF-IDF calculations here if you are interested in the differences in vocabulary, c-TF-IDF is much more optimized for such tasks and typically generates better results. CodeBased on all the above, the procedure for you would then be something as follows:
We first run BERTopic: # Load data
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
docs = df.documents.tolist()
metadata = df.loc[:, ["rating", "day"]].copy()
# Fit BERTopic and remove stopwords
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(min_topic_size=25, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs) The original import numpy as np
import pandas as pd
from bertopic import BERTopic
from typing import Union, Callable, List, Mapping, Any
from sklearn.metrics.pairwise import cosine_similarity
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.base.wrapper as wrap
def estimate_effect(topic_model,
docs: List[str],
topics: Union[int, List[int]],
metadata: pd.DataFrame,
y: str = "prevalence",
probs: np.ndarray = None,
estimator: Union[str, Callable] = None,
estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
""" Estimate the effect of metadata on topic prevalence and topic content
Arguments:
docs: The original list of documents on which the model was trained on
probs: A mxn probability matrix, *m* is the number of document and
*n* the number of topics. It represents the probabilities of all topics
across all documents.
topics: The topic(s) for which you want to estimate the effect of metadata on
metadata: The metadata in a dataframe. Make sure that the columns have the exact same
name as the elements in the estimator
y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
estimator: Either the formula used in the estimator or a custom estimator.
When it is used as a formula, it follows R-style formulas, for example:
* 'prevalence ~ rating'
* 'prevalence ~ rating + day + rating:day'
Make sure that the target is either 'prevalence' or 'content'
The custom estimator should be a `statsmodels.formula.api`, currently,
`statsmodels.api` is not supported.
estimator_kwargs: The arguments needed within the estimator, needs at
least a "formula" argument
Returns:
fitted_estimators: List of fitted estimators for either topic prevalence or topic content
"""
data = metadata.loc[::]
data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
data["docs"] = docs
fitted_estimators = []
if isinstance(topics, int):
topics = [topics]
# As a proxy for the topic prevalence, we take the probability of a document
# belonging to a specific topic. We assume that a higher probability of a topic
# belonging to that topic also results in that document talking more about that topic
if y == "prevalence":
for topic in topics:
# Prepare topic prevalence,
# Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
data["prevalence"] = list(probs[:, topic])
data_filtered = data.loc[data.prevalence < 1, :]
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=data_filtered, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=data_filtered, family=sm.families.Gamma(link=sm.families.links.log())).fit()
fitted_estimators.append(est)
# Topic content is modeled on a document-level by calculating the document cTFIDF
# representation. Based on that representation, we calculate its cosine similarity
# with its topic cTFIDF representation. The assumption here, is that we expect different
# similarity scores if a covariate changes the topic content.
elif y == "content":
for topic in topics:
# Extract topic content and prevalence
selected_data = data.loc[data.topics == topic, :]
c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
selected_data["content"] = sim_matrix[:, topic+1]
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=selected_data, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=selected_data,
family=sm.families.Gamma(link=sm.families.links.log())).fit() # perhaps remove the gamma + link?
fitted_estimators.append(est)
return fitted_estimators Then, using the updated ests = estimate_effect(topic_model=topic_model,
topics=[1, 2],
metadata=metadata,
docs=docs,
probs=None,
estimator="content ~ rating",
y="content")
print([est.summary() for est in ests]) Now, we can calculate the vocabularies for each party affiliation and each topic: def calculate_ctfidf_representation(topic_model, df, rating):
selected_data = df.loc[df.rating == rating, :]
documents_per_topic = selected_data.groupby(["Topic"], as_index=False).agg({"Document": " ".join, "blog": "count"})
ctfidf, words = topic_model._c_tf_idf(documents_per_topic, fit=False)
labels = sorted(list(documents_per_topic.Topic.unique()))
sliced_topics = topic_model._extract_words_per_topic(words=words, c_tf_idf=ctfidf, labels=labels)
return sliced_topics
# Make sure that the original dataframe is in the correct format
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
df["Topic"] = topics
df.rename({"documents": "Document"}, axis=1, inplace=True)
# Calculate topic vocabularies
conservative_topics = calculate_ctfidf_representation(topic_model, df, "Conservative")
liberal_topics = calculate_ctfidf_representation(topic_model, df, "Liberal") The EDIT: Forgot to add some processing |
This is all very interesting. I will continue to dig. Thanks very much, as always, for taking the time! |
@MaartenGr I finally got back to this and wanted to report my experience. Both approaches you have presented here, one for calculating the covariates and the other for "sub-selecting" topic vocabularies from subsets of documents are very cool and seem quite powerful. I am summarizing here to close the loop as well as to make sure I understand what these techniques are doing. My corpus is a near complete set of U.S. Congressional press releases from 2017 to 2020 (the 115th and 116th Congresses). I am interested in the overall topic composition as well as differences in the subject matter Republicans and Democrats talk about and differences in how they talk about the same topics. I have not come up with the final tuning of BERTopic that I want to use for this project but the settings I've used seem more than adequate for this stage. Here is what I ran:
I then ran the second version of
I did a comparison of
My understanding of how In these cases the first result for The second suggestion you had for extracting the cTFIDF results for subsets of the corpus (split by party) was very interesting. Here is the output from two different topics - the first line is the overall topic list, the second for Democrats and the third for Republicans:
As always thanks so much for this excellent package and the patience and dedication you show here. |
@drob-xx Thank you for sharing your experience and thoughts in such an extensive way! Definitely helps to understand how this is being used and what the potential bottlenecks are when using this. |
This thread has been extremely useful, so thanks to everyone who has contributed! |
@SoranHD It's been a while and I do not think I have that code around anymore. It should be reproducible though based on the code I have shared above for calculating and approaching topic prevalence. |
The above link isn't working for me but for anyone looking, I think this is the correct paper: https://projects.iq.harvard.edu/files/wcfia/files/stmnips2013.pdf |
Hi @MaartenGr thanks very much for this feature and i found it's super useful! Is it possible you could also share the code of the example - "the visualization of using the probabilities to see the differences in prevalence between democrats and republicans talking about American politics in 2008" ? thanks so much. |
@calvinchengyx Have you checked my comment above?
|
Yes! the calculation is all reproducable and thanks again for sharing it! I already used the GLM table outputs as shared above. Just wondering if there is still the code for the boxplot, which will be super helpful for the presentation. |
My message above was in response to a user asking for code to reproduce that specific plot. Unfortunately, I do not have that code available. I believe it was simply some matplotlib code, so it should be straightforward to create. |
Hi
I'd like to suggest adding the option of document-level covariate (similar to STM in R). Basically it allows the user to investigate relationship between document-level covariates such as source (political affiliation for example), country, etc with topical prevalence. Is it feasible? it would make the package much more useful for social research.
Thank you!
The text was updated successfully, but these errors were encountered: