Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guid conversion doesn't take into account Microsoft's mixed-endian encoding #15

Open
kdawg14 opened this issue Feb 10, 2021 · 8 comments
Assignees

Comments

@kdawg14
Copy link

kdawg14 commented Feb 10, 2021

Hey, thanks for providing this library; it's been very useful to me.

I happened to notice that the Guids that were converted from certain Ulids (and vice-versa) didn't maintain the proper lexicographical sort order that I would have expected. I believe that this is because of Microsoft's screwy way of encoding the byte arrays (https://stackoverflow.com/a/16722909).

I forked and made a commit that appears to fix this at
kdawg14@c5ae7c8

I'm not certain exactly how you like to handle pull requests, or if this was something that was intentionally done the way that it was, so I figured I'd post this here and see if this looks like a good correction or if you maybe have any additional information on the subject.

Thanks!

@RobThree
Copy link
Owner

RobThree commented Feb 10, 2021

I'm not sure I agree. What MS does when representing GUID's as a string is irrelevant I think. There's no 'defined' way to convert a ULID to a GUID and vice versa - as long as it's consistent. Nowhere is it guaranteed that the GUID and ULID lexicographicallity is the same before/after conversion from one to the other.

If I understand correctly you're basing your findings on the lexicographical order of the string ("hex") representation of a GUID, but who's to say guid 00000000-0001-0000-0000-000000000000 comes before/after 00000000-0002-0000-0000-000000000000? It may, from a string representation point of view, but it may not from a binary point of view. And most systems (like RDBMS'es) will sort on the binary representation (assuming the GUID's are stored correctly and not as varchar/string/whatever).

So I'm not sure we should change this, but I'm open to discussion.

Having said that: I didn't know/realize MS did such hocus-pocus on the first few bytes, so, either way: Today I learned.

@RobThree RobThree self-assigned this Feb 10, 2021
@kdawg14
Copy link
Author

kdawg14 commented Feb 10, 2021

Thanks for the quick response. I'm not certain that I agree either, so thanks for the feedback.

However, I don't think that it's just the hex string representation of the GUID that's the issue, though that does show the problem the same way that the binary form would/does. I didn't do a close inspection of the binary, but I assume that the Guid.CompareTo method used in the tests is operating off that, and it shows results consistent with the lexicographical ordering of the hex string.

My understanding (which mostly consists of reading the StackOverflow and Wikipedia pages that I linked to in the comments, as well as the included tests, so it's likely far from perfect) is that when serializing/deserializing GUIDs, you'd need to correct the endianness in the source or result byte arrays to maintain the UUID as it would normally be represented in serialized form.

Since the ULID spec is a subset of the UUID spec, it seems reasonable to expect that it could be converted between different representations of a UUID while maintaining the same sort order, but I admit that is an assumption that I don't know is definitely true.

Thanks again!

@RobThree
Copy link
Owner

Since the ULID spec is a subset of the UUID spec

It's not?

@kdawg14
Copy link
Author

kdawg14 commented Feb 10, 2021

I guess more accurately, I meant that it has 128-bit compatibility with UUID, and so converting to/from ULID from/to another 128-bit UUID should be able to maintain the same sortability, unless there are other considerations that I'm not thinking of.

@WorldMaker
Copy link

https://github.com/rmalayter/ulid-mssql points out some reasons to follow the odd Microsoft byte order when converting to GUIDs for storage in MS SQL Server: the sort order in database queries matches the ULID sort order and possibly more importantly the sort order in indexes matches the ULID sort order. In microbenchmarks the "properly" sorted GUID versions of ULIDs result in better clustered indexes than baseline (v6) GUIDs.

Interoperability with ulid-mssql and C# code would be potentially extremely useful to me.

@RobThree
Copy link
Owner

RobThree commented Apr 22, 2021

One issue I have with changing this is that it breaks the current implementation. Whatever .ToGUID() returns will change as will creating the ULID's from GUID's. If anything I thing we should consider creating an overload with an enum that specifies which behaviour we want ("old" vs "new") and default to old behaviour.

Second thing I would change is the 3 calls to Array.Reverse(); I'm sure simply returning the bytes in the correct order like this performs much better:

private static byte[] FixGuidByteArrayEndianness(byte[] bytes)
{
    return new[] { bytes[3], bytes[2], bytes[1], bytes[0], bytes[5], bytes[4], ... } // Or whatever the order is...
}

I'll sit on it for a day or two and get back to this.

@WorldMaker
Copy link

I was reminded that Raymond Chen has a great overview of the multiple ways Microsoft sorts GUIDs: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=102450

One interesting bit there is the reminder that System.Data.SqlGuid exists as a separate struct. https://docs.microsoft.com/en-us/dotnet/api/system.data.sqltypes.sqlguid?view=net-5.0

Instead of an enum to define the two behaviors, you could use SqlGuid versus Guid. I'm not sure if there might be pitfalls there, but it might get the idea across that the behavior difference is primarily because of the difference between general use Guids and ones intended for SQL Server.

@WorldMaker
Copy link

I took a first stab at it as extension methods to explore the idea, especially with respect to EF Core:

    public static class UlidExtensions
    {
        public static SqlGuid ToSqlGuid(this Ulid ulid)
        {
            var bytes = ulid.ToByteArray();
            return new SqlGuid(new byte[] { bytes[6], bytes[7], bytes[8], bytes[9], bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5] });
        }

        public static Ulid ToUlid(this SqlGuid guid)
        {
            var bytes = guid.ToByteArray();
            return new Ulid(new byte[] { bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7], bytes[8], bytes[9] });
        }
    }

Apparently EF Core doesn't support SqlGuid as a data type, so in the EF Core Value Converter I have to ulid.ToSqlGuid().Value and new SqlGuid(guid).ToUlid(), so I'm not sure using the SqlGuid type as an intermediary is as strong of an idea as it first seemed now that I have tested it against EF Core. I haven't yet double checked that it round-trips with ulid.sql, but that's next for me to double check that I got the byte reordering correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants