Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing Results and Dominance Score #126

Open
Glorifier85 opened this issue Sep 17, 2020 · 2 comments
Open

Changing Results and Dominance Score #126

Glorifier85 opened this issue Sep 17, 2020 · 2 comments

Comments

@Glorifier85
Copy link

Hi community,

whenever I run the same corpus with exact the same parameters, I get different results, e.g. ranking of topics. I assume that one could minimze that through thorough data cleansing? Besides that, are there additional methods to increase reliability of results?

My second questions revolves around the dominance score, which is used to rank the topics. How exactly is each score calculated? I'm asking because I have run corpuses in the past, exported the results and added up the numbers in the topic distribution spreadsheet for each topic (I assume these are the dominance scores). My understanding is that these should match the visual ranking of the topics but sometimes, they did not for me. Shouldnt whatever topic rankes first in the graphic also have the highest combined dominance score of all documents or is there more to it than just adding them up? Most of the time, it does match but sometimes, it just doesnt.

Thanks!

@severinsimmler
Copy link
Collaborator

Topic modeling is probabilistic. Two probability distributions are iteratively estimated:

  1. How likely is a word for a topic? The e.g. ten most likely words are then usually interpreted as a "topic".
  2. How likely is a topic for a document? This probability is the dominance score. I can recommend this paper by David Blei – it deals with the mathematical details in a rather comprehensible way.

Both distributions are initialized randomly. Because of this randomness, you will never get the exact same (but still comparable) results with the same texts and the same parameters, because we do not set random seed in the application. You can more or less call this a bug.

Historically, the Topics Explorer was developed for didactic purposes, with which one is introduced to the method as fast and straightforward as possible. But since you obviously have an advanced and complex use case, you could switch to MALLET, a command line based tool which is in my opinion also quite easy to use (but has no graphical interface). With MALLET you can explicitly set a random seed and get deterministic results. The output is similar to the Topics Explorer text files with the topics and distributions.

@Glorifier85
Copy link
Author

I could switch applications (and probably will) but I need one with a GUI. I am considering the Stanford Topic Modeling Toolbox, which I have heard positive things about.

Based on your explanations, the numbers I find in the topic distribution output are the dominance scores. But then why does the Topic Explorer not rank the topic with the highest numerical sum of all dominance scores across all documents also as the leading/most prominent topic? With my dataset, the topic with the highest dominance scores is placed 4th by the app. That still doesnt make sense to me...

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants