Noise topic #346

ronirg · 2023-08-17T09:00:38Z

Hi
According to the paper:
HDBSCAN assigns a label to each dense cluster of document vectors and assigns a noise
label to all document vectors that are not in a dense cluster.

If a document was assigned to a noise label, will it be in Topic -1 or Topic 0? I cannot find it in the documentation.
I don't get Topic -1 in my experiments.

Thanks

jacob-bayer · 2024-01-04T22:49:59Z

I had this question too. I think that topic 0 is noise but I'm not entirely sure. Maybe @ddangelov could weight in. I've found that if you look closely there are lots of other clusters that could be categorized as "noise" as well based on the top words. In my pipeline I look at proportion of topics that are missing the top 5 words from the topic_words, and if they have less than 2 of the top 5 words and confidence below 0.4 I call it an outlier. Then I look at the proportion of outliers for each cluster, and if it's mostly outliers I call it a noise cluster. That works for my data. It might not work for yours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noise topic #346

Noise topic #346

ronirg commented Aug 17, 2023

jacob-bayer commented Jan 4, 2024

Noise topic #346

Noise topic #346

Comments

ronirg commented Aug 17, 2023

jacob-bayer commented Jan 4, 2024