Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect hash when passing string into xxHash64.ComputeHash #26

Open
Crauzer opened this issue Feb 3, 2023 · 8 comments
Open

Incorrect hash when passing string into xxHash64.ComputeHash #26

Crauzer opened this issue Feb 3, 2023 · 8 comments

Comments

@Crauzer
Copy link

Crauzer commented Feb 3, 2023

Using ComputeHash with a string will yield an "incorrect" hash because the library is casting a string instance into an unsafe char* which can cause a different hash to be returned depending on the system it's running on.

public static unsafe ulong ComputeHash(string str, uint seed = 0)
{
Debug.Assert(str != null);
fixed (char* c = str)
{
byte* ptr = (byte*) c;
int length = str.Length * 2;
return UnsafeComputeHash(ptr, length, seed);
}
}

The official .NET documentation clearly specifies that the default encoding can very between systems and additionally that the string and char types use UTF-16 internally

The correct approach here would be to create a stack-allocated Span<byte> and then use Encoding.UTF8.GetBytes.
Additionally an optional encoding parameter could also be added, with the default being UTF8.

Example:
image

@sad1y
Copy link

sad1y commented Feb 13, 2023

I doubt this is a problem at all, because as you mentioned, char by itself takes up 16 bits, and you have to account for that before claiming it's wrong behavior. Under the hood, UTF-8 and UTF-16 use different memory sizes with different memory layouts, and just because you can treat it as the same object doesn't mean the objects are the same.

@Crauzer
Copy link
Author

Crauzer commented Feb 13, 2023

Pretty much every hash algo implementation in existence assumes that a string input is intended to be hashed using UTF8/ASCII so in that sense the behavior of this library makes no sense. At the very least you should be able to specify the encoding you intend to use for the hash, which in 99% of cases is going to be UTF8. There is no reason why you'd ever be hashing strings using UTF16.

Your comment is correct in a generic sense but wrong when you try to apply this thinking for hashing.

@Crauzer
Copy link
Author

Crauzer commented Feb 13, 2023

I was recently refactoring some of my code in my tools and that included replacing an old XXHash implementation with this library. I discovered this strange behavior after like an hour of debugging my whole tooling pipeline and at that point I decided to just code up my own XXHash64 implementation in my XXHash3 library so I don't have to deal with this.

@sad1y
Copy link

sad1y commented Feb 14, 2023

Well, can you show me, lets say, three examples of cpp libraries that assume that your std::string is UTF-8 string? The thing is that this library shouldn't contain overload for string at all and make you responsible for allocating, converting proper byte array for hashing, because that library is not about converting one type to another this library is about hashing. Another point which make my opinion stronger is that there is no overload method for stream and that's good because there is a plenty streams that can't be read in the same way, e.g. almost every stream in Kestrel is async-only so you not be able to read it synchronously

@sad1y
Copy link

sad1y commented Feb 14, 2023

I was recently refactoring some of my code in my tools and that included replacing an old XXHash implementation with this library. I discovered this strange behavior after like an hour of debugging my whole tooling pipeline and at that point I decided to just code up my own XXHash64 implementation in my XXHash3 library so I don't have to deal with this.

Is it really worth to yet another library why you don't contribute to that project?

@Crauzer
Copy link
Author

Crauzer commented Feb 14, 2023

because that library is not about converting one type to another this library is about hashing

You are correct, it is for hashing, and thus it should follow common hashing conventions. Since there is a public API method for hashing strings, you would assume that if you call the said method and an equivalent one from any other language implementation, you would get the same result, this assumption breaks as soon as you realize that it's trying to hash strings with UTF-16 encoding.

Lets see, C++'s std::string has no fixed encoding, so it can contain any kind of string and it is up to the user to handle that, majority will opt-in for UTF8 or ASCII. Rust's std::String uses UTF-8 encoding internally for example, and I can go on and on...

Anyhow, the overwhelming majority of software hashes strings as either UTF8 or ASCII, which is pretty much the same thing in this context, so when I call xxHash64.ComputeHash("abc") I'm absolutely not expecting to get a hashed UTF-16 string but an ASCII/UTF8 one.

@Crauzer
Copy link
Author

Crauzer commented Feb 14, 2023

This also isn't the only design flaw of this library's API, another major one is the fact that you have to pass in a length parameter for Span-based API's, which is total nonsense

@sad1y
Copy link

sad1y commented Feb 14, 2023

This also isn't the only design flaw of this library's API, another major one is the fact that you have to pass in a length parameter for Span-based API's, which is total nonsense

I totally agree with that.

I'm absolutely not expecting to get a hashed UTF-16 string but an ASCII/UTF8 one.

Actually you have to expect things like that because it is a part of the type system in that particular language. Are you okay with the fact that language like cpp don't care about whats the inside? And what about languages like JS and Java which are using utf-16 as coding method. Rust also have concept of unicode character so if you want to make iterator from it and feed it to hash function, well, you have to encode it first somehow.

In the end If you want hash something you have to feed bytes to hash function because bytes is the only thing that hash function working with, why are you push everyone to utf-8 I don't know, it feels like it is just convenient for you.

Moreover that the way that you convert string in your library is inefficient as it could be. Take a look at https://learn.microsoft.com/en-us/dotnet/api/system.text.unicode.utf8?view=net-7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants