-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: slicing projection with one-sided slices #479
Conversation
Just for clarification, it should be capped at size + 1? Should it be capped at -1 too? And should the capping happen in the Python layer, or in the C++ wrapper layer? In the C++ wrapper would keep the various UHI locators much simpler. |
Indices do not have to be capped, they are capped automatically in |
Ideally, the value to index conversion should happen in C++, so that it is guaranteed to be consistent. I think the current implementation always converts the value to an index before passing it to |
Yes, it does - so (this) problem is that |
Actually, no, this is ideal, I think.
|
Remembering more about the details: The problem is not what |
Sorry, I realize that you are talking about indices here, while I was talking about values.
Ok
That's apparently not how it is currently implemented, because I get the behavior I expected when I pass indices directly. I only see an inconsistency when I pass values instead of indices. Nevertheless, the spec should be updated then. |
We said at some point that a value range specified should select all bins which overlap with that value range. That should include the flow bins. |
It's not as clear as I thought, since it only refers to standard indexing: https://boost-histogram.readthedocs.io/en/latest/usage/indexing.html#slicing But triggering an infinitely wide bin by accident seems quite dangerous, especially when you can't do this accidentally by integer indexing (-1 is invalid), but you can via This is only an issue for slicing (or selecting), since all other operations just bring flow bins along for the ride. Either way, the current inconsistent behavior is wrong. :) |
The rule is: if the bin overlaps with the value range specified, it should be included. That also goes for bins which are infinitely wide and I don't think that this is anything to worry about. You can make sure that the overflow bin is excluded by picking the upper value inside the last bin of the axis up to the upper edge of that bin. |
Think about it carefully, making the best decision for Python users, then let me know. I worry if someone selects an upper edge of |
Everywhere else in boost-histogram the flow bins are either hidden (carried along), natural (sum over None endpoints includes everything, like |
It has to be documented properly, of course, but it is important to have simple rules that people can remember. The flow bins are special in that you normally don't see them, but they should be handled just like other bins as much as possible (from a logical point of view and also from a programming point of view). Note that what you worry about is only a problem when there are actually entries in the flow bins. It is hard to say whether the user wants to keep those or not when he/she uses an upper value beyond the axis range. Maybe keeping the overflow is exactly the intention? My mind is set on applying rules consistently that are as simple as possible, so behavior remains easy to remember. |
Note: I need to move fast with the patch in boostorg/histogram to get this still included in the upcoming release in two days, so there is some pressure. But since the feature was broken so far anyway, we actually have the opportunity to select a different behavior if we want. So you would prefer including the flow bins when values are passed only if the value passed is infinite, right? |
This is perfectly fine with Boost.Histogram, I'm just looking at the UHI slicing syntax when reducing. Let's pick the include syntax for now, but revisit it to see if it's more consistent/user friendly to make overflow/underflow always explicit when reducing via UHI reducer before a boost-histogram release. |
Yes, so these would be the only ways to apply a reducer to an overflow bin: h[start:None:sum] # More usually written as h[start::sum]
h[start:bh.loc(infty):sum] # Hadn't include this originally, but it could also make sense? At least that is what I think the original intention was, and I think it would be more explicit? |
I like the simple consistent rule, but I do see your point and I worry about one thing: I don't remember whether the current implementation of regular axis is consistent so that |
For rules, Also, this, and especially generalized reducers, |
I cannot say much about this, because it seems to address other points. Can we stick to the topic how |
I was just comparing to the existing rules to see if there was an inconstancy if we apply this explicit endpoint rule for reducers. I don't think so. Also this does include So the only question is for |
Naturally, With |
And they currently do. In fact, there is no difference in This affects two sided crops too, by the way, if |
My point was Edit on last comment: You want to introduce a difference for the upper boundary. I want to remove the distinction for the lower boundary. The current situation is of course an inconstant mess. I am assuming that whatever you decide, it will be applied to both endpoints consistently, and I just sometimes select one endpoint to discuss for clarity. |
Not sure if you are referring to my original statement that slices select bin ranges that overlap with (a,b) or something else. I am currently listening to some talks, but can we discuss this later via Zoom? |
As far as I can see, either way nothing changes for |
Collecting the discussion, I think the only thing we are considering changing is the following behavior. (This was broken before, and currently inconsistent, so "changing" might be the wrong term, "defining" might be better). h = bh.Histogram(bh.axis.Regular(1,1,2))
h.view(flow=True)[...] = 1
# Explicit endpoints
3 == h[::sum]
2 == h[:bh.loc(2.00001):sum]
2 == h[bh.loc(.99999)::sum]
1 == h[bh.loc(.99999):bh.loc(2.00001):sum]
1 == h[0:len:sum]
h[at(-1):]
h[at(-1)::sum]
h[:at(1):sum]
h[:at(2):sum]
# Possible benefit:
# h[a:b:sum] is finite width ^
# h[0:bh.loc(x):sum] never "jumps" to flow ^
# If edges are not exact, could be an issue?
h[loc(-1)]
# Flow inclusive endpoints
3 == h[::sum]
3 == h[:bh.loc(2.00001):sum]
3 == h[bh.loc(.99999)::sum]
3 == h[bh.loc(.99999):bh.loc(2.00001):sum]
1 == h[0:len:sum]
You can replace an empty endpoint with None (syntactically equivalent) or possibly Note that if you were to do this in two steps, doing the slice first before the reduction, there is no difference if including an a point in the edge bin vs. in the flow bin in the "normal" slicing. 1 == h[bh.loc(.99999):bh.loc(2.00001)][0:len:sum]
1 == h[bh.loc(1):bh.loc(2)][0:len:sum]
3 == h[bh.loc(.99999):bh.loc(2.00001)][::sum]
3 == h[bh.loc(1):bh.loc(2)][::sum] A few thoughts: Currently, UHI locators are simple callables that take an axes and return a boost-histogram-style index ( |
Outcome of meeting:
|
Does that sound useful? PS: In the current implementation, you just aren't supposed to use any non-None indexer for a reduction slice endpoint if you want flow, we could also adhere to that for all built-in indexers. But it would be nice to either be able to support this or throw an error, rather than doing the rather unintuitive thing and working but not including overflow. Also, |
Yes and also yes to the thoughts in the PS. |
How do we go about with this PR? @henryiii could you finish the work by committing to the branch that I opened? I think you know best how to implement the changes to |
Sure, I can do that. Fun fact: we need the extra information as described above even if we wanted to include endpoints by default or remove them by default - the reason that currently |
Also, this affects I remember describing the first behavior in PyHEP 2019, and then I think @HDembinski decided on the second behavior for Boost.Histogram, but since this part is controlled by normal indexes (since UHI converts all values to indexes, and then boost-histogram uniformly passes around indices behind the scenes), this behavior inherits the natural Python index behavior and never changed (technically from Boost.Histogram bin based indexing, which is also one past the end, like Python). It was unable to change - you have to detect which part of the slice you are in in order to add 1 to the value of loc. This would be a major, breaking change, all code using bh.loc as an endpoint to a slice would change. And might require some tricky behavior based on being exactly on a floating point edge. To me, this is significantly more natural in Python (as evidenced by the fact it couldn't even be done before); slicing is very well defined in Python (unlike C++, unless you count valarray's std::slice, which is also like Python) - but there really isn't a clear definition for what happens for floating point slices, which don't exist in Python, so it's only a generalization. On the other hand, including the bin is harder to make by hand with a locator, so if we only provide one, providing the harder to make one might be better. Also, from a practical standpoint, (more?) often you do want the bin that contains the endpoint - in PyHEP 2019, I recommended using |
Slicing with indices works in C++ in the same way as in Python. In a range a:b, b is not included. In Boost.Histogram that does not matter at all, because I don't use a construct like bh.loc. You pass a slice with the slice object, which then has normal slicing behavior. A slice can only use indices. To use values you need the crop or shrink object, which only accept values. The two concepts are well separated and there is no problem that one includes the bin that contains the value and the other does not. The design in Pyhton has become awkward because you force both values and indices to go through a Python slice. The behavior of the slice simply does not generalize to this case. |
Note that Jim's rule was born out of what the user most likely needs, we should not forget that. I agree that for consistency, bh.loc should not automatically add +1 when it is on the upper edge, because that is awkward. Instead, we should offer an alternative object that you can use instead of the slice, which keeps the bins that overlap. |
This is part of UHI, the locators control the conversion to integers, giving this power to users very powerful. We also can't directly use Boost.Histogram's implementation anyway, because it doesn't support mixing endpoint types; there is no way to do The question here is do we keep the current simpler behavior or convert to the more like Boost.Histogram's behavior? |
Okay, that's much less disruptive. And I do think it tends to be a common need, it's often mentally simpler to request a range that is inclusive when using data coords instead of bins, hence the immediate request for a +1. What would you like to call it? |
|
My proposal is to have another object for doing a slice in values. I don't care that you cannot mix values and integers then, because I don't believe that's what people want. For this PR, it does not matter how we solve that in future. Let's focus on this PR. For this PR, we do as you suggest, bh.loc gives you the index that corresponds to the value. If you want to include the upper bin you do bh.loc(value) + 1. |
In other words, I was happy with the result of our Zoom call, let's just implement that and worry about other ways of slicing later. The only thing important to me is that [:bh.loc(value):sum] includes the underflow bin and [bh.loc(value)::sum] includes the overflow bin, but that is still the case, as far as I can see. |
I wrote up what I had in mind for a value_slice here #482, so that we can focus on this PR, which is strictly about the behavior of bh.loc. |
I just was afraid that you thought the behavior of loc was different in the Zoom call (I didn't really think of it then, either). No problem, was not going to block or make too many changes in one PR anyway. I'll get to this late (the actual point of the PR, the one we talked about) today or tomorrow. |
This looks good, will merge. I'll make a followup PR to add the smarter UHI support, which will allow locators to detect which part of a slice they are in, allowing much more powerful locators to be written by users. |
Fixes #478
The core bug is fixed in boostorg/histogram, but there still seems to be a bug when
bh.loc
is used and the value is past the upper end of the axis. The computed index then should be size + 1 to make the crop work correctly.