Incorrect hash when passing `string` into `xxHash64.ComputeHash` #26

Crauzer · 2023-02-03T21:28:42Z

Using ComputeHash with a string will yield an "incorrect" hash because the library is casting a string instance into an unsafe char* which can cause a different hash to be returned depending on the system it's running on.

xxHash/src/Standart.Hash.xxHash/xxHash64.cs

Lines 243 to 254 in 6b20e7f

    
           public static unsafe ulong ComputeHash(string str, uint seed = 0) 
        
           { 
        
               Debug.Assert(str != null); 
        
               fixed (char* c = str) 
        
               { 
        
                   byte* ptr = (byte*) c; 
        
                   int length = str.Length * 2; 
        
                   return UnsafeComputeHash(ptr, length, seed); 
        
               } 
        
           }

The official .NET documentation clearly specifies that the default encoding can very between systems and additionally that the string and char types use UTF-16 internally

The correct approach here would be to create a stack-allocated Span<byte> and then use Encoding.UTF8.GetBytes.
Additionally an optional encoding parameter could also be added, with the default being UTF8.

Example:

The text was updated successfully, but these errors were encountered:

sad1y · 2023-02-13T19:09:35Z

I doubt this is a problem at all, because as you mentioned, char by itself takes up 16 bits, and you have to account for that before claiming it's wrong behavior. Under the hood, UTF-8 and UTF-16 use different memory sizes with different memory layouts, and just because you can treat it as the same object doesn't mean the objects are the same.

Crauzer · 2023-02-13T22:15:38Z

Pretty much every hash algo implementation in existence assumes that a string input is intended to be hashed using UTF8/ASCII so in that sense the behavior of this library makes no sense. At the very least you should be able to specify the encoding you intend to use for the hash, which in 99% of cases is going to be UTF8. There is no reason why you'd ever be hashing strings using UTF16.

Your comment is correct in a generic sense but wrong when you try to apply this thinking for hashing.

Crauzer · 2023-02-13T22:19:53Z

I was recently refactoring some of my code in my tools and that included replacing an old XXHash implementation with this library. I discovered this strange behavior after like an hour of debugging my whole tooling pipeline and at that point I decided to just code up my own XXHash64 implementation in my XXHash3 library so I don't have to deal with this.

sad1y · 2023-02-14T06:33:49Z

Well, can you show me, lets say, three examples of cpp libraries that assume that your std::string is UTF-8 string? The thing is that this library shouldn't contain overload for string at all and make you responsible for allocating, converting proper byte array for hashing, because that library is not about converting one type to another this library is about hashing. Another point which make my opinion stronger is that there is no overload method for stream and that's good because there is a plenty streams that can't be read in the same way, e.g. almost every stream in Kestrel is async-only so you not be able to read it synchronously

sad1y · 2023-02-14T06:34:51Z

I was recently refactoring some of my code in my tools and that included replacing an old XXHash implementation with this library. I discovered this strange behavior after like an hour of debugging my whole tooling pipeline and at that point I decided to just code up my own XXHash64 implementation in my XXHash3 library so I don't have to deal with this.

Is it really worth to yet another library why you don't contribute to that project?

Crauzer · 2023-02-14T10:26:37Z

because that library is not about converting one type to another this library is about hashing

You are correct, it is for hashing, and thus it should follow common hashing conventions. Since there is a public API method for hashing strings, you would assume that if you call the said method and an equivalent one from any other language implementation, you would get the same result, this assumption breaks as soon as you realize that it's trying to hash strings with UTF-16 encoding.

Lets see, C++'s std::string has no fixed encoding, so it can contain any kind of string and it is up to the user to handle that, majority will opt-in for UTF8 or ASCII. Rust's std::String uses UTF-8 encoding internally for example, and I can go on and on...

Anyhow, the overwhelming majority of software hashes strings as either UTF8 or ASCII, which is pretty much the same thing in this context, so when I call xxHash64.ComputeHash("abc") I'm absolutely not expecting to get a hashed UTF-16 string but an ASCII/UTF8 one.

Crauzer · 2023-02-14T10:28:08Z

This also isn't the only design flaw of this library's API, another major one is the fact that you have to pass in a length parameter for Span-based API's, which is total nonsense

sad1y · 2023-02-14T10:52:46Z

This also isn't the only design flaw of this library's API, another major one is the fact that you have to pass in a length parameter for Span-based API's, which is total nonsense

I totally agree with that.

I'm absolutely not expecting to get a hashed UTF-16 string but an ASCII/UTF8 one.

Actually you have to expect things like that because it is a part of the type system in that particular language. Are you okay with the fact that language like cpp don't care about whats the inside? And what about languages like JS and Java which are using utf-16 as coding method. Rust also have concept of unicode character so if you want to make iterator from it and feed it to hash function, well, you have to encode it first somehow.

In the end If you want hash something you have to feed bytes to hash function because bytes is the only thing that hash function working with, why are you push everyone to utf-8 I don't know, it feels like it is just convenient for you.

Moreover that the way that you convert string in your library is inefficient as it could be. Take a look at https://learn.microsoft.com/en-us/dotnet/api/system.text.unicode.utf8?view=net-7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect hash when passing `string` into `xxHash64.ComputeHash` #26

Incorrect hash when passing `string` into `xxHash64.ComputeHash` #26

Crauzer commented Feb 3, 2023

sad1y commented Feb 13, 2023

Crauzer commented Feb 13, 2023 •

edited

Loading

Crauzer commented Feb 13, 2023

sad1y commented Feb 14, 2023

sad1y commented Feb 14, 2023 •

edited

Loading

Crauzer commented Feb 14, 2023

Crauzer commented Feb 14, 2023

sad1y commented Feb 14, 2023 •

edited

Loading

Incorrect hash when passing string into xxHash64.ComputeHash #26

Incorrect hash when passing string into xxHash64.ComputeHash #26

Comments

Crauzer commented Feb 3, 2023

sad1y commented Feb 13, 2023

Crauzer commented Feb 13, 2023 • edited Loading

Crauzer commented Feb 13, 2023

sad1y commented Feb 14, 2023

sad1y commented Feb 14, 2023 • edited Loading

Crauzer commented Feb 14, 2023

Crauzer commented Feb 14, 2023

sad1y commented Feb 14, 2023 • edited Loading

Incorrect hash when passing `string` into `xxHash64.ComputeHash` #26

Incorrect hash when passing `string` into `xxHash64.ComputeHash` #26

Crauzer commented Feb 13, 2023 •

edited

Loading

sad1y commented Feb 14, 2023 •

edited

Loading

sad1y commented Feb 14, 2023 •

edited

Loading