-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -794,6 +794,34 @@ def misleading_link(s, site): | |
return False, '' | ||
|
||
|
||
# noinspection PyUnusedLocal,PyMissingTypeHints,PyTypeChecker | ||
@create_rule("text repeated in {}", title=False, body_summary=True, max_rep=10000, max_score=10000) | ||
def body_text_repeated(s, site): | ||
""" | ||
Do some hacks to reduce the need for regex backtracking for this rule | ||
""" | ||
s = s.rstrip("\n") | ||
if s.startswith("<p>") and s.endswith("</p>"): | ||
s = s[3:-4] | ||
initial_words = regex.match(r"\A([^\W_]+)[\W_]+([^\W_]+)[\W_]+([^\W_]+)", s) | ||
if not initial_words: | ||
return False, "" | ||
escaped_initial_words = [regex.escape(x) for x in initial_words.groups()] | ||
period = regex.match( | ||
r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % ( | ||
tuple(escaped_initial_words * 2)), s) | ||
if not period: | ||
return False, "" | ||
period_words = regex.split(r"[\W_]+", period.groups(0)[0]) | ||
escaped_words = escaped_initial_words + [ | ||
regex.escape(x) for x in period_words] | ||
repeats_regex = r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}" | ||
repeats = regex.match(repeats_regex, s) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This also touches on the larger issue of how we use the |
||
if repeats: | ||
return True, "Body contains repeated phrase '%s'" % repeats.groups(0)[0] | ||
return False, "" | ||
|
||
|
||
# noinspection PyUnusedLocal,PyMissingTypeHints,PyTypeChecker | ||
@create_rule("repeating words in {}", max_rep=11, stripcodeblocks=True) | ||
def has_repeating_words(s, site): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please addregex.cache_all(False)
prior to thisregex.match()
andregex.cache_all(True)
after it to prevent the single-use regexes compiled here from filling theregex
package's internal cache of compiled regular expressions.Note: this is probably something which we should be doing in a substantial number of places in the code, but I was reminded when creating PR #7012 that it should be used here. Basically, we should be doing that wherever we use a regex which is unique per-post. We should also change things in individual functions/modules to separately store compiled regexes which are used repeatedly in order to reduce the number which we rely on theregex
package to cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've struck-out my earlier comments here and closed an open PR I had, as
regex.cache_all()
is the wrong way to handle this here, or anywhere in SD, due to threading. I'll be writing a helper function which can be used for explicit regex compiles to force the implicitregex
cache to not be used. I'll update this again once I've created that (soon). I'll also create an issue in theregex
package asking to expose an option to be able to not use the implicit cache on a per-compile basis.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a PR, #7025, which adds
regex_compile_no_cache()
to helpers.py and imports it to findspam.py. That function can be used to compile a regular expression and not have it placed in theregex
package's implicit cache. Using that function, the code here could be something like: