Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searchsorted (and variants) do not check equality? #44102

Closed
mtanneau opened this issue Feb 10, 2022 · 8 comments · Fixed by #48387
Closed

searchsorted (and variants) do not check equality? #44102

mtanneau opened this issue Feb 10, 2022 · 8 comments · Fixed by #48387
Labels
docs This change adds or pertains to documentation search & find The find* family of functions

Comments

@mtanneau
Copy link

mtanneau commented Feb 10, 2022

The following behaviors have been causing bugs in some code of mine:

searchsortedlast([0.0, 1.0], +0.0)  # returns 1
searchsortedlast([0.0, 1.0], -0.0)  # returns 0

searchsorted([0.0], +0.0)  # returns 1:1
searchsorted([0.0], -0.0)  # returns 1:0

searchsorted([0.0], 0.0; lt=<=)  # returns 2:1

In all cases I encountered, I could get away with it by using searchsortedXXXX(args... ; lt=<).

However, assuming the above behavior is what's intended, I believe there's an inconsistency in the docs.
For instance, for searchsorted:

searchsorted(a, x; by=<transform>, lt=<comparison>, rev=false)

Return the range of indices of a which compare as equal to x (using binary search) according to
the order specified by the by, lt and rev keywords, assuming that a is already sorted in that order.
Return an empty range located at the insertion point if a does not contain values equal to x.

It's hard to believe that [0.0] does not contain any value equal to 0.0 😅

If it's a bug, I probably can't fix it, but if it's intended and clarifying the docs would be satisfactory, then I'm happy to do so.

@Moelf
Copy link
Contributor

Moelf commented Feb 10, 2022

searchsortedlast([0.0, 1.0], -0.0) # returns 0

just for posterity, the reason for this is that isless() over IEEE floats make a total order set, where < only forms a partial order:

julia> isless(-0.0, 0.0)
true

julia> -0.0 < 0.0
false

Clearly, we need total order for sorting an array otherwise we get an indeterministic result back.


searchsorted([0.0], 0.0; lt=<=) # returns 2:1
It's hard to believe that [0.0] does not contain any value equal to 0.0 sweat_smile

as I have said in Slack, <= is literally not "is less" in any sense:

X is less than X should always be false, but it's true for <=, which is why it messed up the result

@mtanneau
Copy link
Author

X is less than X should always be false

Did you mean "strict" order?
Total orders are reflexive, so if <= is a total order, you should have x <= x forall x.

In any case: when I read "which compare as equal to x" in the docs, I expect 0.0 to compare as equal to 0.0.
I had a similar confusion with "less than or equal to" and "greater than or equal to" in the docs of searchsortedlast and searchsortedfirst, respectively.

All I'm saying is that, if there's room to make the docs a little clearer, it may avoid other people running into a similar issue.

@nalimilan
Copy link
Member

I agree with @mtanneau that the mention of "or equal" in the docstring is incredibly confusing. This is made even worse by the fact that the default value for lt (i.e. isless) isn't even shown in the signature.

IIUC "less than or equal" actually means "not greater than according to lt", right? This should be mentioned IMO, together with the fact that this relies on a total order. We could also show an example with -0.0.

@mtanneau Would you feel like making a PR to propose clarifications?

This has been reported before at #35179. It also caused a bug in the StatsBase implementation of histograms (JuliaStats/StatsBase.jl#766).

@nalimilan nalimilan added docs This change adds or pertains to documentation search & find The find* family of functions labels Feb 20, 2022
@Moelf
Copy link
Contributor

Moelf commented Feb 20, 2022

Did you mean "strict" order?

I just mean it semantically, what ever "less than" definition you use (plug in different lt), lt(a, a) should always return false -- a number simply cannot be less than itself

But if you use <= as your lt, it fails this check, using <= as lt does not make sense to me, so personally I don't care what searchsortedXXXX does when lt = <=, it shouldn't be used as lt by any rate?

@petvana
Copy link
Member

petvana commented Feb 20, 2022

It seems to be related to (kind of duplicate of) #11429. The behavior is undefined for ill-behaved user-specified comparisons, and Julia may even crash during sorting.

@mtanneau
Copy link
Author

@mtanneau Would you feel like making a PR to propose clarifications?

Yes, I'll try and come up with something this week

@StefanKarpinski
Copy link
Member

I think that checking equality is a good idea.

@aplavin
Copy link
Contributor

aplavin commented Feb 5, 2023

This also came up in JuliaMath/IntervalSets.jl#111. The lt=< solution is only partial: it makes searchsorted much slower for ranges:

# doesn't find -0.0, but fast:
julia> @btime searchsorted(-1000.0:1:1000, -0.0)
  2.310 ns (0 allocations: 0 bytes)
1001:1000

# finds -0.0, but much slower:
julia> @btime searchsorted(-1000.0:1:1000, -0.0; lt=<)
  87.270 ns (0 allocations: 0 bytes)
1001:1001

Would be nice to check equality by default, or make the lt=< case as fast.

vtjnash pushed a commit that referenced this issue Sep 30, 2023
No behavior change, just a performance improvement: searchsorted(range,
lt=<) now performs exactly as fast as searchsorted(range).

Passing lt=< allows searchsorted to find floating point values such as
-0.0, so it's useful to make it fast - see
#44102 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation search & find The find* family of functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants