-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix handling invalid(?) unicode in filenames. #163
Conversation
When running yr scan against some random files I have laying around I noticed a crash. I think it is because there is invalid (at least I think it is, I'm so bad at understanding unicode) unicode in the filename of one of the files contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 (available on VT). This changes it so the string is using the debug formatter which handles the invalid utf-8 better? This seems to work fine on my machine.
I'm unable to reproduce this issue in MacOS or Linux. I tried downloading and unzipping the file Notice however that Can you try using |
BTW: the problem here is that the These are the characters considered whitespaces: https://doc.rust-lang.org/reference/whitespace.html A more robust approach is writing our own function that replaces those characters with the ASCII 32 character. It would be very straightforward, every character |
I implemented a What I don't understand is that this is only when outputting the files being scanned, not when outputting a match. I would have expected to also alter the code around https://github.com/VirusTotal/yara-x/blob/main/cli/src/commands/scan.rs#L450 too, but for some reason that does not crash when outputting a match on the same file for me. |
A more efficient implementation would be: fn replace_whitespace(path: &PathBuf) -> Cow<str> {
let mut s = path.to_string_lossy();
for (i, c) in s.char_indices() {
if c != ' ' && c.is_whitespace() {
s.to_mut().replace_range(i..i, " ");
}
}
s
} With this implementation,
Also, This requires adapting |
This mutates the string as you're iterating over it, which the type checker is unhappy with:
|
Oh, that's right. Then something like this should work... fn replace_whitespace(path: &PathBuf) -> Cow<str> {
let mut s = path.to_string_lossy();
if s.chars().any(|c| c != ' ' && c.is_whitespace()) {
let mut r = String::with_capacity(s.len());
for c in s.chars() {
if c.is_whitespace() {
r.push(' ')
} else {
r.push(c)
}
}
s = Cow::Owned(r);
}
s
} The allocation for the new string will happen only in the case of strings containing a strange whitespace. |
Use suggestion from Victor on how to make this only allocate a new string when necessary. If there are no non-ascii whitespace characters we don't do an allocation, and if there is we should only allocate once.
When running yr scan against some random files I have laying around I noticed a crash. I think it is because there is invalid (at least I think it is, I'm so bad at understanding unicode) unicode in the filename of one of the files contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 (available on VT).
This changes it so the string is using the debug formatter which handles the invalid utf-8 better? This seems to work fine on my machine.