Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved runtime of has_bad_word #19

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

aarashy
Copy link

@aarashy aarashy commented May 26, 2019

These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes.

aarashy added 4 commits May 26, 2019 14:03
These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes. Please accept this change.
@aarashy
Copy link
Author

aarashy commented May 26, 2019

Hmm, I suppose the weakness of my approach is that it doesn't play as nicely with your self._no_word_boundaries flag. I tried to make it insensitive to punctuation and case, but maybe you have a suggestion on how to make it better. Do you have a suggestion on how to implement Regex here while keeping the linear runtime from the dictionary approach?

@DonaldTsang
Copy link

Is this still being updated to be made compatible?

@aarashy
Copy link
Author

aarashy commented Jan 17, 2020

Is this still being updated to be made compatible?

Basically, my new implementation of has_bad_word is in linear time with the number of bad words and the size of the input string and the old one is at least quadratic and impractical for sufficiently large datasets. But, my implementation doesn't match a sub-word, for example "asdfuckasdf" would be missed by my implementation but caught by yours.

My recommendation is to split into two functions, keeping yours the way it is, and using mine as a much faster but less sensitive check.

@duttonw
Copy link
Collaborator

duttonw commented Nov 25, 2024

Hi @aarashy ,

Are you able to add a unit test for what you have suggested.

Regards,

@duttonw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants