-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detection of string delimiters #3305
Comments
Indeed, the default word regex includes codespell/codespell_lib/_codespell.py Line 47 in a175a33
The intent is to catch the apostrophe as part of words, not the quotes. However, since |
I think that a regex like: """(["'])((?!\1)(?:\\\1|.)+?)(?=\1)""" could do to match strings and remove the surrounding quotes before matching words. |
Wouldn't the above match non-word characters? Regexes are really a pain to read. |
We need a regex that matches words, the above doesn't: >>> import re
>>>
>>> word_regex_def = """(["'])((?!\1)(?:\\\1|.)+?)(?=\1)"""
>>> word_regex = re.compile(word_regex_def)
>>>
>>> word_regex.findall("""Some text with "errror" and 'errror'""")
[]
>>> Compare with the default regex >>> import re
>>>
>>> word_regex_def = r"[\w\-'’]+"
>>> word_regex = re.compile(word_regex_def)
>>>
>>> word_regex.findall("""Some text with "errror" and 'errror'""")
['Some', 'text', 'with', 'errror', 'and', "'errror'"]
>>> |
The idea is to remove the quotes and then match the words. The expression matches a quote ('"'), then ensures that if that quote can only occur in the string if it is escaped, and finishes on the same quote as it matched on. The only thing is text within comments - there could be two single quotes that are not in strings. #!python3
import re
test_str = ("// This is a \"comment\" and should be ignored.\n"
"//This is also a 'comment' which should be ignored.\n"
"printf(\"This is text meant to be \\\"captured\\\", and can include any type of character.\\n\"); // But \" must be escaped with a backslash.\n"
"printf(\"This is the same as above, only with 'different' nested quotes.\\n\");\n"
"putchar('I');\n"
"putchar('\\'');\n"
"printf(\"\"); printf(\"m thinking.\"); // First printf is not very useful.\n"
"printf(\"\\\"OK!\\\"\");\n"
"printf(\"Hello\"); printf(\" world!\\n\");\n"
"printf(\"Hello\"); printf(\" world!\\n\");"
"printf(\"%d file%s found.\\n\",iFileCount,((iFileCount != 1) ? \"s\" : \"\");\n"
"printf(\"Result is: %s\\n\",sText); // sText is \"success\" or \"failure\".\n"
"return iReturnCode; // 0 ... \"success\", 1 ... \"error\"\n"
"return iReturnCode; // It's been nice. It's been great.\n"
)
pattern=re.compile(r"""(["'])((?:(?!\1)(?:\\\1|.))*?)\1""")
word_pattern = re.compile(r"[\w\-'’]+")
for line in test_str.splitlines():
unquoted=pattern.sub(r" \2 ", line)
print("ORG:"+line)
print("NEW:"+unquoted)
print(word_pattern.findall(unquoted)) Output which (the text after "NEW:") would be the input to the word matching logic:
Edited: Add the word regex. |
Isn't it much simpler to |
That would apply to all the "isn't", "shouldn't", "hasn't", etc . |
How exactly do you fear it would apply? It seems to me that >>> "isn't".strip("'")
"isn't"
>>>
>>> "'isn't'".strip("'")
"isn't"
>>> |
I admit I didn't think it through entirely, but here are some examples that could break the checks:
but some will have their equivalent without the quote, and some will not like "packges" and "were". |
You're right. We need to distinguish between:
Somehow I doubt we can achieve robust detection of single quotes in any file. |
Strings in code would be mostly properly detected. Exceptions would be rare multiline strings that would probably have double quotes anyway. Something like "He said 'just don't' ..." would be found only in comments. I think that:
Further, the regex can be updated to strip r"""(?!<\w)(["'])((?:(?!\1)(?:\\\1|.))*?)\1(?!\w)""" |
The thing is that we already have a regex to match words, which can be changed using option |
Seems quite challenging to me (the regex needs to match just the word, so as an extra challenge everything needs to go in lookbehind and lookahead expressions which have limitations). So if that is the requirement (i.e., propose a regex to match the words), then I suggest to close this issue. |
This just needs more effort to take into account backwards compatibility, and perhaps take into account a bigger picture. For example, text processing might include multiple steps, not only the word splitting step followed by the URL detection step. A new first step might somehow filter/preprocess the full text, before splitting into words. We need to think about the first step and make sure its scope is generalised, taken into account other possible purposes. A default first step and an option to modify it might be useful. |
Ok, I tried to find a regex to replace the word extraction but it becomes quickly quite complex, long and timeconsuming to finetune. And 're' has limited support for lookbehind so I had to test with the regex module. |
The inspired for me to look use case - https://github.com/INTERSECT-SDK/python-sdk/pull/19/files/33da9ff31d6162caa0dfc1a1155f321e6d68b1cc#diff-10380fd6e5ecb84c1ae11e135982739946c5aff1a50499378db397cf5034f54e And then I found the issue this - Close codespell-project#3305 Although may be I am missing the use-cases/problems @DimitriPapadopoulos and @mdeweerd discussed back then
Currently, words in double quoted strings are spellchecked, but not strings with simple quotes.
My suggestion is to add a feature that would let codespell determine what a string delimiter is on a line by line basis, possibly limited to certain file patterns (extensions).
The detection would be on a line by line basis.
I understand that this could be somewhat tricky.
One way to implement it would be to have a regex that finds strings and replaces the delimiters of the string with spaces before looking for words.
The text was updated successfully, but these errors were encountered: