Skip to content

Commit

Permalink
Improve performance of ncodeunits(::Char) (JuliaLang#54001)
Browse files Browse the repository at this point in the history
This improves performance of `ncodeunits(::Char)` by simply counting the
number of non-zero bytes (except for `\0`, which is encoded as all zero
bytes). For a performance comparison, see [this gist](
https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad);
there's an up to 10x improvement here for collections of `Char`, with a
minor improvement for single `Char` (with much smaller spread). The
version in this PR is called `nbytesencoded` in the benchmarks.

Correctness has been verified with Supposition.jl, using the existing
implementation as an oracle:

```julia
julia> using Supposition

julia> const chars = Data.Characters()

julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
           c = reinterpret(Char, i)
           ncodeunits(c) == nbytesdiv(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.0s

julia> ncodeunits('\0') == nbytesencoded('\0')
true
```

Let's see if CI agrees!

Notably, neither the existing nor the new implementation check whether
the given `Char` is valid or not, since the only thing that matters is
how many bytes are written out.

---------

Co-authored-by: Sukera <[email protected]>
  • Loading branch information
Seelengrab and Seelengrab authored Apr 9, 2024
1 parent f870ea0 commit d183ee1
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion base/char.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,14 @@ to an output stream, or `ncodeunits(string(c))` but computed efficiently.
This method requires at least Julia 1.1. In Julia 1.0 consider
using `ncodeunits(string(c))`.
"""
ncodeunits(c::Char) = write(devnull, c) # this is surprisingly efficient
function ncodeunits(c::Char)
u = reinterpret(UInt32, c)
# We care about how many trailing bytes are all zero
# subtract that from the total number of bytes
n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
# Take care of '\0', which has an all-zero bitpattern
n_nonzero_bytes + iszero(u)
end

"""
codepoint(c::AbstractChar) -> Integer
Expand Down

0 comments on commit d183ee1

Please sign in to comment.