Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return binary git paths, not potentially invalid UTF8 #936

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Aug 17, 2022

  1. Return binary git paths, not potentially invalid UTF8

    Git paths have no inherent encoding, they are opaque binary strings that
    can be in any encoding. The macro `rb_str_new_utf8` is used in places to
    convert raw C strings representing paths (and other things) into ruby
    strings tagged with UTF8 encoding.
    
    This is not safe, since the macro simply copies bytes and tags the
    string with the specified encoding; it does not do any validity checking
    or transcoding. Thus it can easily create an invalid string (i.e. one
    for which `String#valid_encoding?` is false) if the repo contains files
    whose paths are multibyte strings in encodings other than UTF8. These
    strings are poisoned and difficult to work with: they can't be compared
    safely because of the semantics of ruby strings and they often can't be
    concatenated to a larger output buffer for display (which will attempt
    to transcode to the output buffer's native encoding, usually UTF8). The
    comparison issue hit us a few times at GitHub.
    
    More detail: in ruby, string equality for multibyte strings is defined
    as bytewise equality *plus* encoding. For convenience, this is
    unfortunately not enforced for ASCII strings, so it often becomes a bomb
    that only gets triggered when you pass multibyte data through your
    application.
    
    So even though a binary-encoded "hello" is equal to the UTF8 equivalent,
    
        "\x68\x65\x6C\x6C\x6F".b.bytes == "hello".bytes
        # => true
        "\x68\x65\x6C\x6C\x6F".b == "hello"
        # => true
    
    the same is not true for non-ASCII text:
    
        "\xE6\x97\xA5\xE6\x9C\xAC".b.bytes == "日本".bytes
        # => true
        "\xE6\x^C\xA5\xE6\x9C\xAC".b == "日本"
        # => false
    
    One possible fix is to transcode to UTF-8. But this is a bad idea since
    the domain of possible git paths includes many binary strings that are
    not representable in UTF8 encoding. Better to acknowledge reality and
    use an encoding which matches the actual characteristics of this data so
    clients can handle it how they choose.
    
    Note that this PR could go farther and do the same thing for some other
    uses of this macro which are similarly invalid (refs), but I've opted to
    keep the fix relatively narrow and focus on paths.
    brasic committed Aug 17, 2022
    Configuration menu
    Copy the full SHA
    5866ba1 View commit details
    Browse the repository at this point in the history