-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect hash when passing string
into xxHash64.ComputeHash
#26
Comments
I doubt this is a problem at all, because as you mentioned, |
Pretty much every hash algo implementation in existence assumes that a string input is intended to be hashed using UTF8/ASCII so in that sense the behavior of this library makes no sense. At the very least you should be able to specify the encoding you intend to use for the hash, which in 99% of cases is going to be UTF8. There is no reason why you'd ever be hashing strings using UTF16. Your comment is correct in a generic sense but wrong when you try to apply this thinking for hashing. |
I was recently refactoring some of my code in my tools and that included replacing an old XXHash implementation with this library. I discovered this strange behavior after like an hour of debugging my whole tooling pipeline and at that point I decided to just code up my own XXHash64 implementation in my XXHash3 library so I don't have to deal with this. |
Well, can you show me, lets say, three examples of cpp libraries that assume that your std::string is UTF-8 string? The thing is that this library shouldn't contain overload for string at all and make you responsible for allocating, converting proper byte array for hashing, because that library is not about converting one type to another this library is about hashing. Another point which make my opinion stronger is that there is no overload method for |
Is it really worth to yet another library why you don't contribute to that project? |
You are correct, it is for hashing, and thus it should follow common hashing conventions. Since there is a public API method for hashing strings, you would assume that if you call the said method and an equivalent one from any other language implementation, you would get the same result, this assumption breaks as soon as you realize that it's trying to hash strings with UTF-16 encoding. Lets see, C++'s Anyhow, the overwhelming majority of software hashes strings as either UTF8 or ASCII, which is pretty much the same thing in this context, so when I call |
This also isn't the only design flaw of this library's API, another major one is the fact that you have to pass in a |
I totally agree with that.
Actually you have to expect things like that because it is a part of the type system in that particular language. Are you okay with the fact that language like In the end If you want hash something you have to feed bytes to hash function because bytes is the only thing that hash function working with, why are you push everyone to utf-8 I don't know, it feels like it is just convenient for you. Moreover that the way that you convert |
Using
ComputeHash
with a string will yield an "incorrect" hash because the library is casting astring
instance into an unsafechar*
which can cause a different hash to be returned depending on the system it's running on.xxHash/src/Standart.Hash.xxHash/xxHash64.cs
Lines 243 to 254 in 6b20e7f
The official .NET documentation clearly specifies that the default encoding can very between systems and additionally that the
string
andchar
types use UTF-16 internallyThe correct approach here would be to create a stack-allocated
Span<byte>
and then useEncoding.UTF8.GetBytes
.Additionally an optional encoding parameter could also be added, with the default being UTF8.
Example:
The text was updated successfully, but these errors were encountered: