You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.
To be specific, here is what I need to do:
standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
remove extra whitespaces from words, e.g. two whitespaces in a row
stem and lemmatize
I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?
Many thanks in advance!
The text was updated successfully, but these errors were encountered:
Perhaps you might look at https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00099/43370/Comparing-Apples-to-Apple-The-Effects-of-Stemmers which not only has David on the author list, but also just has a terrific title!
My personal experience tends to confirm the broad conclusions there. I can’t stop my collaborators from stemming, though!
I would suggest that standardizing spelling may be more trouble than it’s worth. In the worst-case, where one document category spells something funny (colour, e.g.) I’ve found there are enough contextual clues for the model to realize color==colour, and put them in the same topic.
Finally, extra white spaces won’t affect Mallet output (under the standard options).
Simon DeDeo
Carnegie Mellon University & the Santa Fe Institute
http://santafe.edu/~simon
On Jun 29, 2021, at 11:34 AM, Glorifier85 ***@***.***> wrote:
HI there,
First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.
To be specific, here is what I need to do:
standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
remove extra whitespaces from words, e.g. two whitespaces in a row
stem and lemmatize
I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?
Many thanks in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
HI there,
First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.
To be specific, here is what I need to do:
I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?
Many thanks in advance!
The text was updated successfully, but these errors were encountered: