cksum: Accept non UTF-8 inputs #6603

RenjiSann · 2024-08-01T22:24:17Z

github-actions · 2024-08-02T09:25:42Z

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

RenjiSann · 2024-08-17T09:34:47Z

@BenWiederhake Could you take a look ? :)

sylvestre · 2024-08-17T10:16:23Z

src/uucore/src/lib/features/checksum.rs

+ if !last.is_ascii_whitespace() {
+ break;
+ }
+ line_trim = &line_trim[..line_trim.len() - 1];


could you please document what this line is doing, it isn't obvious

It is a simple implementation of the trim_ascii() function which is not available for MSRV 1.70.

The behavior is extracted in its own documented function in #6654

src/uucore/src/lib/features/checksum.rs

sylvestre · 2024-08-17T10:19:13Z

tests/by-util/test_cksum.rs

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
 .stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
 .no_stderr();
 }
+
+#[cfg(target_os = "linux")]


why linux only ?

could you please to add a test to inject invalid utf8 ?

why linux only ?

This is somewhat related to this.

TLDR: With the current MSRV being 1.70, on windows. there is no way to translate between OsStr and &[u8] without going through str (this holds for owned types as well).

could you please to add a test to inject invalid utf8 ?

This is what the test should be doing. Using non UTF-8 characters that would make the old implementation fail because it would try to create Strings from these characters.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

I'd like to be sure I get it correctly. In the added test, I should check that the hashes variable get written in a file with a non-UTF-8 name, is this correct ?

If yes, I might need to refactor slightly the test framework in order to be able to write to files with non-UTF-8 names.

RenjiSann · 2024-08-26T13:21:11Z

@sylvestre all your comments are addressed in #6654. If that's still an issue, please tell me, I will refactor the code accordingly.

BenWiederhake

I'm confused.

This PR has older code, doesn't address the points sylvestre already mentioned, and you also say that checksum: rework for improving checkum checking GNU behavior match #6654 should be used instead.
The other PR (checksum: rework for improving checkum checking GNU behavior match #6654) doesn't address sylvestre's points either (e.g. the linux-only issue), and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review, and it's also in draft mode (which indicates to me that it's more of a "sneak peek", and not ready).

So which PR would you like us to review?

BenWiederhake · 2024-09-07T21:26:14Z

tests/by-util/test_cksum.rs

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
 .stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
 .no_stderr();
 }
+
+#[cfg(target_os = "linux")]


there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

RenjiSann · 2024-09-17T12:58:13Z

The other PR doesn't address sylvestre's points either (e.g. the linux-only issue)

The test is linux-only because os_str_as_bytes will automatically fail if given non-UTF-8 characters on windows.
This is due to the fact that there is no safe provided way to convert a OsStr to &[u8] on windows, and we are required to go through an intermediary &str conversion.

Improving the handling of windows filesystem exceptions this might be a further enhancement, but I don't think this should be the main point of this PR.

and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review,

True, #6654 is a really big refactor, which solves many problems at once. I think about re-organizing all the changes in smaller PRs, while trying to keep the reasoning clear.

RenjiSann · 2024-09-17T13:35:36Z

So which PR would you like us to review?

Let's review and merge this one first, I eventually added the changes requested which I originally put in #6654.
Next, I will work on splitting #6654 into reasonably-sized changes, and try to document things correctly

RenjiSann · 2024-09-17T16:42:49Z

As requested, I changed the code to handle non-UTF8 filenames, and I added a test for it.

Previously, I used String::from_utf8_lossy to display the filename on the terminal when needed. I realized this was wrong, because it actually inserts a REPLACEMENT CHARACTER U+FFFD for any non-UTF-8 sequence, when we actually want to omit them completely.
At first, I tried to copy-paste the implementation of String::from_utf8_lossy and adapt it, but I figured out, because of unstabilized internals, that this could not be done without copy-pasting even more code from the stdlib.

The compromise I went with is to just remove the U+FFFD chars from the output of String::from_utf8_lossy. It might produce unwanted behavior in case the character appears in the original filename, but I consider this to be unlikely enough to wait to fix it when the involved String's API is stable (MSRV 1.79 iirc).

github-actions · 2024-09-17T16:58:20Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

- add a test for non UTF-8 chars in comments - add test for non-UTF-8 chars in filenames

github-actions · 2024-09-18T13:45:17Z

GNU testsuite comparison:

Skipping an intermittent issue tests/rm/rm1 (passes in this run but fails in the 'main' branch)

RenjiSann force-pushed the renji/utf8-cksum-comment branch 6 times, most recently from 8111dc4 to 9eecf25 Compare August 2, 2024 08:58

RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from b4d46d2 to f2abbae Compare August 3, 2024 10:02

sylvestre reviewed Aug 17, 2024

View reviewed changes

src/uucore/src/lib/features/checksum.rs Outdated Show resolved Hide resolved

sylvestre reviewed Aug 17, 2024

View reviewed changes

src/uucore/src/lib/features/checksum.rs Outdated Show resolved Hide resolved

sylvestre reviewed Aug 17, 2024

View reviewed changes

RenjiSann mentioned this pull request Aug 17, 2024

checksum: rework for improving checkum checking GNU behavior match #6654

Draft

BenWiederhake requested changes Sep 7, 2024

View reviewed changes

sylvestre force-pushed the renji/utf8-cksum-comment branch from f2abbae to c2f43fa Compare September 16, 2024 07:25

RenjiSann force-pushed the renji/utf8-cksum-comment branch from c2f43fa to 5d68b3b Compare September 17, 2024 13:32

RenjiSann force-pushed the renji/utf8-cksum-comment branch 3 times, most recently from 8d2ea86 to 3533b6e Compare September 17, 2024 16:07

checksum: Allow for non UTF-8 content in input file

b4fc8d5

RenjiSann force-pushed the renji/utf8-cksum-comment branch from 3533b6e to 0415d72 Compare September 17, 2024 16:32

test(cksum): add non-UTF-8 handling tests

07da5d9

- add a test for non UTF-8 chars in comments - add test for non-UTF-8 chars in filenames

RenjiSann force-pushed the renji/utf8-cksum-comment branch from 0415d72 to 07da5d9 Compare September 18, 2024 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cksum: Accept non UTF-8 inputs #6603

cksum: Accept non UTF-8 inputs #6603

RenjiSann commented Aug 1, 2024

github-actions bot commented Aug 2, 2024

RenjiSann commented Aug 17, 2024 •

edited

Loading

sylvestre Aug 17, 2024

RenjiSann Aug 17, 2024 •

edited

Loading

sylvestre Aug 17, 2024

RenjiSann Aug 17, 2024

BenWiederhake Sep 7, 2024

RenjiSann Sep 17, 2024

RenjiSann commented Aug 26, 2024

BenWiederhake left a comment

BenWiederhake Sep 7, 2024

RenjiSann commented Sep 17, 2024

RenjiSann commented Sep 17, 2024

RenjiSann commented Sep 17, 2024

github-actions bot commented Sep 17, 2024

github-actions bot commented Sep 18, 2024

cksum: Accept non UTF-8 inputs #6603

Are you sure you want to change the base?

cksum: Accept non UTF-8 inputs #6603

Conversation

RenjiSann commented Aug 1, 2024

github-actions bot commented Aug 2, 2024

RenjiSann commented Aug 17, 2024 • edited Loading

sylvestre Aug 17, 2024

Choose a reason for hiding this comment

RenjiSann Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

sylvestre Aug 17, 2024

Choose a reason for hiding this comment

RenjiSann Aug 17, 2024

Choose a reason for hiding this comment

BenWiederhake Sep 7, 2024

Choose a reason for hiding this comment

RenjiSann Sep 17, 2024

Choose a reason for hiding this comment

RenjiSann commented Aug 26, 2024

BenWiederhake left a comment

Choose a reason for hiding this comment

BenWiederhake Sep 7, 2024

Choose a reason for hiding this comment

RenjiSann commented Sep 17, 2024

RenjiSann commented Sep 17, 2024

RenjiSann commented Sep 17, 2024

github-actions bot commented Sep 17, 2024

github-actions bot commented Sep 18, 2024

RenjiSann commented Aug 17, 2024 •

edited

Loading

RenjiSann Aug 17, 2024 •

edited

Loading