the 6 steps of BERTopic #2204

TalaN1993 · 2024-11-04T23:48:43Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Hello,

I have a question. According to the document, I understand that BERT-Topic consists of six steps, with the representation tuning step being optional. I have read many articles in a specific field that used BERTopic, but my question is why they don’t all include these five steps. For example, some articles only include embedding, dimension reduction, clustering, and weighting scheme (c-TF-IDEF). I’d like to know if each step can be omitted and whether using all five remaining steps is necessarily required?,

Reproduction

from bertopic import BERTopic

BERTopic Version

0.16.3

MaartenGr · 2024-11-05T11:07:37Z

For example, some articles only include embedding, dimension reduction, clustering, and weighting scheme (c-TF-IDEF).

These are actually five steps:

Embedding
Dim reduction
Clustering
Tokenization
c-TF-IDF

Although tokenization isn't mentioned, it is definitely used.

Typically, you would see those five steps with the optional representation step. If you would want to remove a step, the only that you could potentially remove is the dimensionality reduction step. All other are needed.

Many papers just implement the basic BERTopic functionality and compare it with that, which is a shame considering the representation models often improve the output significantly. I can't say their reasoning, but I wished the representing step would be included more often.

TalaN1993 · 2024-11-05T15:35:56Z

Thank you so much for your help and guidance.

TalaN1993 · 2024-11-07T18:34:02Z

Hello MaartenGr,

In my case, I used all six steps with three different respresentation models (gpt 3.5, MMR and KeyBert) with the same other five steps. I evaluated the result using OCTIS npmi and topic diversity., but the result was somewhat different from what I expected. Do you think it makes sense?,

with gpt 3.5 : (npmi_score: 0.1267 , diversity : 0.9851)
with MMR : (npmi_score: 0.2625 , diversity : 0.7263)
with KeyBERT: (npmi_score: 0.3027 , diversity : 0.6421)

I had intended for the NPMI value for gpt to be higher.

MaartenGr · 2024-11-08T07:32:25Z

It may be worthwhile to do a deep-dive into how topic coherence (and diversity) metrics work. They assume we have a list of keywords as the main representation for topics. This is true for MMR and KeyBERT but not for GPT-3.5 since that only generates a single label and not a mixture of words.

TalaN1993 · 2024-11-29T00:42:48Z

Hello MaartenGr,
I have a question. I understand that in LDA, the input data is typically based on the Bag of Words (BoW) representation. My question is: if we change the vector representation from BoW to TF-IDF or SentenceTransformers and then feed it to the LDA model, does this approach make sense? I am looking for ways to improve the results of LDA.

MaartenGr · 2024-11-29T06:20:30Z

@TalaN1993 LDA is quite a different method compared to BERTopic and I don't think it would work that easily with embeddings without any significant changes. I believe there is something called LDA2Vec or something similar that you could research.

TalaN1993 added the bug Something isn't working label Nov 4, 2024

MaartenGr removed the bug Something isn't working label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the 6 steps of BERTopic #2204

the 6 steps of BERTopic #2204

TalaN1993 commented Nov 4, 2024

MaartenGr commented Nov 5, 2024

TalaN1993 commented Nov 5, 2024

TalaN1993 commented Nov 7, 2024 •

edited

Loading

MaartenGr commented Nov 8, 2024

TalaN1993 commented Nov 29, 2024

MaartenGr commented Nov 29, 2024

the 6 steps of BERTopic #2204

the 6 steps of BERTopic #2204

Comments

TalaN1993 commented Nov 4, 2024

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version

MaartenGr commented Nov 5, 2024

TalaN1993 commented Nov 5, 2024

TalaN1993 commented Nov 7, 2024 • edited Loading

MaartenGr commented Nov 8, 2024

TalaN1993 commented Nov 29, 2024

MaartenGr commented Nov 29, 2024

TalaN1993 commented Nov 7, 2024 •

edited

Loading