Improved runtime of has_bad_word #19

aarashy · 2019-05-26T18:03:58Z

These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes.

These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes. Please accept this change.

aarashy · 2019-05-26T19:14:23Z

Hmm, I suppose the weakness of my approach is that it doesn't play as nicely with your self._no_word_boundaries flag. I tried to make it insensitive to punctuation and case, but maybe you have a suggestion on how to make it better. Do you have a suggestion on how to implement Regex here while keeping the linear runtime from the dictionary approach?

DonaldTsang · 2020-01-13T08:19:04Z

Is this still being updated to be made compatible?

aarashy · 2020-01-17T20:08:19Z

Is this still being updated to be made compatible?

Basically, my new implementation of has_bad_word is in linear time with the number of bad words and the size of the input string and the old one is at least quadratic and impractical for sufficiently large datasets. But, my implementation doesn't match a sub-word, for example "asdfuckasdf" would be missed by my implementation but caught by yours.

My recommendation is to split into two functions, keeping yours the way it is, and using mine as a much faster but less sensitive check.

duttonw · 2024-11-25T22:58:10Z

Hi @aarashy ,

Are you able to add a unit test for what you have suggested.

Regards,

@duttonw

aarashy added 4 commits May 26, 2019 14:03

Fixed small typo

a6cc921

Ignoring Case in has_bad_word

b21e41e

Also checks words with punctuation removed.

7a067af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved runtime of has_bad_word #19

Improved runtime of has_bad_word #19

aarashy commented May 26, 2019 •

edited

Loading

aarashy commented May 26, 2019

DonaldTsang commented Jan 13, 2020

aarashy commented Jan 17, 2020

duttonw commented Nov 25, 2024

Improved runtime of has_bad_word #19

Are you sure you want to change the base?

Improved runtime of has_bad_word #19

Conversation

aarashy commented May 26, 2019 • edited Loading

aarashy commented May 26, 2019

DonaldTsang commented Jan 13, 2020

aarashy commented Jan 17, 2020

duttonw commented Nov 25, 2024

aarashy commented May 26, 2019 •

edited

Loading