Detection of string delimiters #3305

mdeweerd · 2024-01-23T18:51:23Z

Currently, words in double quoted strings are spellchecked, but not strings with simple quotes.

My suggestion is to add a feature that would let codespell determine what a string delimiter is on a line by line basis, possibly limited to certain file patterns (extensions).

The detection would be on a line by line basis.

I understand that this could be somewhat tricky.

One way to implement it would be to have a regex that finds strings and replaces the delimiters of the string with spaces before looking for words.

DimitriPapadopoulos · 2024-01-23T20:49:19Z

Indeed, the default word regex includes ' in the word characters:

codespell/codespell_lib/_codespell.py

Line 47 in a175a33

word_regex_def = r"[\w\-'’]+" # noqa: RUF001

The intent is to catch the apostrophe as part of words, not the quotes. However, since ' can be used as an apostrophe or a quote, processing quoted words fails. Perhaps we should strip("'") words, removing leading and trailing ' and keep occurrences inside the word.

mdeweerd · 2024-01-23T22:07:25Z

I think that a regex like:

"""(["'])((?!\1)(?:\\\1|.)+?)(?=\1)"""

could do to match strings and remove the surrounding quotes before matching words.

https://regex101.com/r/hwVS2p/1

DimitriPapadopoulos · 2024-01-24T20:34:30Z

Wouldn't the above match non-word characters? Regexes are really a pain to read.

DimitriPapadopoulos · 2024-01-24T20:48:51Z

We need a regex that matches words, the above doesn't:

>>> import re
>>> 
>>> word_regex_def = """(["'])((?!\1)(?:\\\1|.)+?)(?=\1)"""
>>> word_regex = re.compile(word_regex_def)
>>> 
>>> word_regex.findall("""Some text with "errror" and 'errror'""")
 []
>>>

Compare with the default regex r"[\w\-'’]+":

>>> import re
>>> 
>>> word_regex_def = r"[\w\-'’]+"
>>> word_regex = re.compile(word_regex_def)
>>> 
>>> word_regex.findall("""Some text with "errror" and 'errror'""")
 ['Some', 'text', 'with', 'errror', 'and', "'errror'"]
>>>

mdeweerd · 2024-01-24T21:34:18Z

The idea is to remove the quotes and then match the words.
Apparently the expression if found is flawed, I updated it.

The expression matches a quote ('"'), then ensures that if that quote can only occur in the string if it is escaped, and finishes on the same quote as it matched on.

The only thing is text within comments - there could be two single quotes that are not in strings.
I think the benefit outweighs that sideeffect and if it occurs there surely is a way to write the comment differently.
(Example in the last line of the test case).

#!python3
import re

test_str = ("// This is a \"comment\" and should be ignored.\n"
    "//This is also a 'comment' which should be ignored.\n"
    "printf(\"This is text meant to be \\\"captured\\\", and can include any type of character.\\n\");  // But \" must be escaped with a backslash.\n"
    "printf(\"This is the same as above, only with 'different' nested quotes.\\n\");\n"
    "putchar('I');\n"
    "putchar('\\'');\n"
    "printf(\"\"); printf(\"m thinking.\");  // First printf is not very useful.\n"
    "printf(\"\\\"OK!\\\"\");\n"
    "printf(\"Hello\"); printf(\" world!\\n\");\n"
    "printf(\"Hello\"); printf(\" world!\\n\");"
    "printf(\"%d file%s found.\\n\",iFileCount,((iFileCount != 1) ? \"s\" : \"\");\n"
    "printf(\"Result is: %s\\n\",sText);  // sText is \"success\" or \"failure\".\n"
    "return iReturnCode; // 0 ... \"success\", 1 ... \"error\"\n"
    "return iReturnCode; // It's been nice.  It's been great.\n"
)


pattern=re.compile(r"""(["'])((?:(?!\1)(?:\\\1|.))*?)\1""")

word_pattern = re.compile(r"[\w\-'’]+")

for line in test_str.splitlines():
    unquoted=pattern.sub(r" \2 ", line)
    print("ORG:"+line)
    print("NEW:"+unquoted)
    print(word_pattern.findall(unquoted))

Output which (the text after "NEW:") would be the input to the word matching logic:

ORG:// This is a "comment" and should be ignored.
NEW:// This is a  comment  and should be ignored.
['This', 'is', 'a', 'comment', 'and', 'should', 'be', 'ignored']
ORG://This is also a 'comment' which should be ignored.
NEW://This is also a  comment  which should be ignored.
['This', 'is', 'also', 'a', 'comment', 'which', 'should', 'be', 'ignored']
ORG:printf("This is text meant to be \"captured\", and can include any type of character.\n");  // But " must be escaped with a backslash.
NEW:printf( This is text meant to be \"captured\", and can include any type of character.\n );  // But " must be escaped with a backslash.
['printf', 'This', 'is', 'text', 'meant', 'to', 'be', 'captured', 'and', 'can', 'include', 'any', 'type', 'of', 'character', 'n', 'But', 'must', 'be', 'escaped', 'with', 'a', 'backslash']
ORG:printf("This is the same as above, only with 'different' nested quotes.\n");
NEW:printf( This is the same as above, only with 'different' nested quotes.\n );
['printf', 'This', 'is', 'the', 'same', 'as', 'above', 'only', 'with', "'different'", 'nested', 'quotes', 'n']
ORG:putchar('I');
NEW:putchar( I );
['putchar', 'I']
ORG:putchar('\'');
NEW:putchar( \' );
['putchar', "'"]
ORG:printf(""); printf("m thinking.");  // First printf is not very useful.
NEW:printf(  ); printf( m thinking. );  // First printf is not very useful.
['printf', 'printf', 'm', 'thinking', 'First', 'printf', 'is', 'not', 'very', 'useful']
ORG:printf("\"OK!\"");
NEW:printf( \"OK!\" );
['printf', 'OK']
ORG:printf("Hello"); printf(" world!\n");
NEW:printf( Hello ); printf(  world!\n );
['printf', 'Hello', 'printf', 'world', 'n']
ORG:printf("Hello"); printf(" world!\n");printf("%d file%s found.\n",iFileCount,((iFileCount != 1) ? "s" : "");
NEW:printf( Hello ); printf(  world!\n );printf( %d file%s found.\n ,iFileCount,((iFileCount != 1) ?  s  :   );
['printf', 'Hello', 'printf', 'world', 'n', 'printf', 'd', 'file', 's', 'found', 'n', 'iFileCount', 'iFileCount', '1', 's']
ORG:printf("Result is: %s\n",sText);  // sText is "success" or "failure".
NEW:printf( Result is: %s\n ,sText);  // sText is  success  or  failure .
['printf', 'Result', 'is', 's', 'n', 'sText', 'sText', 'is', 'success', 'or', 'failure']
ORG:return iReturnCode; // 0 ... "success", 1 ... "error"
NEW:return iReturnCode; // 0 ...  success , 1 ...  error 
['return', 'iReturnCode', '0', 'success', '1', 'error']
ORG:return iReturnCode; // It's been nice.  It's been great.
NEW:return iReturnCode; // It s been nice.  It s been great.
['return', 'iReturnCode', 'It', 's', 'been', 'nice', 'It', 's', 'been', 'great']

Edited: Add the word regex.

DimitriPapadopoulos · 2024-01-24T21:38:25Z

Isn't it much simpler to strip("'") the words matched by the default regex, asuggested in #3305 (comment)?

mdeweerd · 2024-01-24T21:49:28Z

Isn't it much simpler to strip("'") the words matched by the default regex, asuggested in #3305 (comment)?

That would apply to all the "isn't", "shouldn't", "hasn't", etc .

DimitriPapadopoulos · 2024-01-24T22:05:10Z

How exactly do you fear it would apply? It seems to me that strip("'") does the right thing:

>>> "isn't".strip("'")
"isn't"
>>> 
>>> "'isn't'".strip("'")
"isn't"
>>>

mdeweerd · 2024-01-24T23:14:12Z

I admit I didn't think it through entirely, but here are some examples that could break the checks:

dont'->don't
dosent'->doesn't
gaus'->Gauss'
guas'->Gauss'
guass'->Gauss'
hasnt'->hasn't
havent'->haven't
isnt'->isn't
packges'->packages'
shouldnt'->shouldn't
thats'->that's
wasnt'->wasn't
wouldnt'->wouldn't
were'->we're

but some will have their equivalent without the quote, and some will not like "packges" and "were".
So stripping on words that are correctly spelled is ok, stripping words that are not, isn't.

DimitriPapadopoulos · 2024-01-25T08:01:26Z

You're right. We need to distinguish between:

you dont' say (dont' → don't)
he said 'just dont' but he insisted (dont' → don't')

Somehow I doubt we can achieve robust detection of single quotes in any file.

mdeweerd · 2024-01-25T09:46:26Z

Strings in code would be mostly properly detected. Exceptions would be rare multiline strings that would probably have double quotes anyway.

Something like "He said 'just don't' ..." would be found only in comments.
( "... 'just dont' "would still be detected as a spelling issue ).

I think that:

It's rare to have multiple single quotes like that in comments;
It's incorrect to quote like that, so if detected, it can be corrected to "just don't";
It will be harder to robustly detect all cases, but currently a lot of strings with single quotes go unchecked.
I think we admit that codespell is not the perfect spelling checker. But by handling the strings, codespell can detect more.

Further, the regex can be updated to strip 'just don't' to just don't by adding a negative lookbehind an lookahead (I did not test the regex, showing the principle):

r"""(?!<\w)(["'])((?:(?!\1)(?:\\\1|.))*?)\1(?!\w)"""

DimitriPapadopoulos · 2024-01-25T15:24:46Z

The thing is that we already have a regex to match words, which can be changed using option --regex. I am OK to post-process matched words, but I am reluctant to pre-process input before it is fed to the regex. It can be confusing. So you would need to provide a new default regex that does the quote filtering and word matching in a single pass.

mdeweerd · 2024-01-25T17:16:29Z

So you would need to provide a new default regex that does the quote filtering and word matching in a single pass.

Seems quite challenging to me (the regex needs to match just the word, so as an extra challenge everything needs to go in lookbehind and lookahead expressions which have limitations).
Also, in that case it just needs to be passed to --regex .

So if that is the requirement (i.e., propose a regex to match the words), then I suggest to close this issue.

DimitriPapadopoulos · 2024-01-27T12:20:50Z

This just needs more effort to take into account backwards compatibility, and perhaps take into account a bigger picture. For example, text processing might include multiple steps, not only the word splitting step followed by the URL detection step. A new first step might somehow filter/preprocess the full text, before splitting into words. We need to think about the first step and make sure its scope is generalised, taken into account other possible purposes. A default first step and an option to modify it might be useful.

mdeweerd · 2024-01-27T13:25:53Z

Ok, I tried to find a regex to replace the word extraction but it becomes quickly quite complex, long and timeconsuming to finetune. And 're' has limited support for lookbehind so I had to test with the regex module.

@DimitriPapadopoulos

The inspired for me to look use case - https://github.com/INTERSECT-SDK/python-sdk/pull/19/files/33da9ff31d6162caa0dfc1a1155f321e6d68b1cc#diff-10380fd6e5ecb84c1ae11e135982739946c5aff1a50499378db397cf5034f54e And then I found the issue this - Close codespell-project#3305 Although may be I am missing the use-cases/problems @DimitriPapadopoulos and @mdeweerd discussed back then

DimitriPapadopoulos added the bug label Jan 23, 2024

DimitriPapadopoulos changed the title ~~Detection of string delimiters .~~ Detection of string delimiters Jan 24, 2024

commodis mentioned this issue Jan 30, 2024

Question: how can I create a space seperated word in a dictionary? #3342

Closed

This was referenced Nov 22, 2024

Do not allow for a word to start or end with punctuation symbols #3588

Open

Add codespell support (config, workflow to detect/not fix) and make it fix few typos INTERSECT-SDK/python-sdk#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection of string delimiters #3305

Detection of string delimiters #3305

mdeweerd commented Jan 23, 2024 •

edited

Loading

DimitriPapadopoulos commented Jan 23, 2024 •

edited

Loading

mdeweerd commented Jan 23, 2024

DimitriPapadopoulos commented Jan 24, 2024

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024 •

edited

Loading

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024

DimitriPapadopoulos commented Jan 25, 2024

mdeweerd commented Jan 25, 2024

DimitriPapadopoulos commented Jan 25, 2024

mdeweerd commented Jan 25, 2024

DimitriPapadopoulos commented Jan 27, 2024 •

edited

Loading

mdeweerd commented Jan 27, 2024

Detection of string delimiters #3305

Detection of string delimiters #3305

Comments

mdeweerd commented Jan 23, 2024 • edited Loading

DimitriPapadopoulos commented Jan 23, 2024 • edited Loading

mdeweerd commented Jan 23, 2024

DimitriPapadopoulos commented Jan 24, 2024

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024 • edited Loading

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024

DimitriPapadopoulos commented Jan 24, 2024

mdeweerd commented Jan 24, 2024

DimitriPapadopoulos commented Jan 25, 2024

mdeweerd commented Jan 25, 2024

DimitriPapadopoulos commented Jan 25, 2024

mdeweerd commented Jan 25, 2024

DimitriPapadopoulos commented Jan 27, 2024 • edited Loading

mdeweerd commented Jan 27, 2024

mdeweerd commented Jan 23, 2024 •

edited

Loading

DimitriPapadopoulos commented Jan 23, 2024 •

edited

Loading

mdeweerd commented Jan 24, 2024 •

edited

Loading

DimitriPapadopoulos commented Jan 27, 2024 •

edited

Loading