Exploring Historical Web Content / Internet Censorship Data #2248
NullPxl
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
This is an awesome project, thank you for sharing!! As a quick tip, you can easily remove stopwords by using MMR, KeyBERTInspired or use the |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! I recently used BERTopic as a large part of an approach to discover themes in multilingual web content. More specifically the project looks at millions of tests/URLs submitted to an internet censorship measurement platform across 11 years.
You can explore the data and see more details here: https://whatscensored.peterwhiting.me/
It's not perfect-- stopwords still sometimes appear in the labels, there are overlapping topics, and many outliers. That said, it still shows many interesting themes and I believe it's an interesting example of how BERTopic can be applied. I thought anyone looking at the bertopic discussions page might find it cool!
Beta Was this translation helpful? Give feedback.
All reactions