Bug: Histogram2d result does not match that of the numpy function #10

jonathanishhorowicz · 2017-08-04T12:35:13Z

Thank you for this package, it is indeed very fast and I am looking forward to using it.

There is a discrepancy between your histogram2d function and the numpy equivalent, which is demonstrated below:

import numpy as np
from fast_histogram import histogram2d as fast_hist_2d

n_bins = 5	

x = np.array([3,2,5,4,6,2,3,5,3,2,5,3,6,7,3])
y = np.array([1,6,3,2,5,3,1,6,4,6,4,2,4,5,3])

xmin, xmax = np.min(x), np.max(x)
ymin, ymax = np.min(y), np.max(y)

h_fast = fast_hist_2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])


h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

print "h_fast =\n", h_fast
print "\nh_np =\n", h_np

On my machine this outputs

h_fast =
[[ 0. 0. 1. 0. 0.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 1.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

The fast_histogram result only includes 11 of the 15 points. The 4 missing points are all from the final column

astrofrog · 2017-08-04T20:46:19Z

Thanks for reporting this - I think this is a literal 'edge' case where I think we must be treating points that lie on the exact edge differently, but I agree it'd probably be good to make them consistent.

astrofrog · 2017-08-04T20:51:50Z

The relevant line is here: https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L186 - specifically here we use < instead of <= for the upper bound, otherwise the index calculation returns an out of bounds index. We could change < to <= and then special case the case where the index needs to be reduced by 1, though this would add an additional 'if' statement in the loop which could further slow things down. I'll try and investigate early next week, but in the mean time the solution is to not use exactly the same value for xmax/ymax as some of the values in the arrays being binned (for example one could add a tiny value to each of the upper limits)

jonathanishhorowicz · 2017-08-06T16:09:24Z

Yes, adding a small value to the upper limits for x and y does solve the problem for this example (and many others that I tested).

Thank you for your quick response!

raphaelphys · 2017-10-30T20:32:50Z

Hi,

One comment: your problem of discrepancies seems to be solved if you add a small value, for example, 0.01 to range that is shared by both np.histogram and the fast_histogram. For example as in,

xmin, xmax = np.min(x), np.max(x) + 0.01 
ymin, ymax = np.min(y), np.max(y) + 0.01

This is shared by both the functions. However, I was trying to test by not modifying anything in the numpy side, that is by doing:

h_fast = fast_hist_2d(x,y,
	range=[[xmin, xmax + 0.01], [ymin, ymax + 0.01]],
	bins = [n_bins,n_bins])          

h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

The result is not the same as shown below. Why do we have to pass the modified range to np.histogram as well?

h_fast =
[[ 3. 2. 1. 0. 2.]
[ 1. 0. 0. 0. 0.]
[ 0. 1. 1. 0. 1.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 0.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

nr7s · 2019-03-13T13:57:41Z

Although increasing the range can act as a quick fix to the problem it doesn't really fix the problem.

Changing the range will affect the calculated edges (bin intervals), so numpy and fast_histogram will, in those examples return the same counts for each bin but with different edges (even though fast_histogram doesn't return the edges).
Depending on how much you increase the range and on the precision of your values this may still alter the bin counts.

atrettin · 2021-09-29T21:00:51Z

I believe that this issue can be closed now. We have decided to purposefully break with numpy on this particular point because the additional if statement to check for the edge case slows everything down by 20%. For the vast majority of use cases, the additional performance will be more appreciated than the adherence to numpy's convention.

astrofrog · 2021-09-29T21:28:47Z

I agree! Before we close this I think we might want to add an entry to the FAQ.

atrettin mentioned this issue Aug 5, 2021

Make behavior at upper bin edge consistent with between 1D, 2D and DD #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Histogram2d result does not match that of the numpy function #10

Bug: Histogram2d result does not match that of the numpy function #10

jonathanishhorowicz commented Aug 4, 2017

astrofrog commented Aug 4, 2017

astrofrog commented Aug 4, 2017

jonathanishhorowicz commented Aug 6, 2017

raphaelphys commented Oct 30, 2017 •

edited

Loading

nr7s commented Mar 13, 2019

atrettin commented Sep 29, 2021

astrofrog commented Sep 29, 2021

Bug: Histogram2d result does not match that of the numpy function #10

Bug: Histogram2d result does not match that of the numpy function #10

Comments

jonathanishhorowicz commented Aug 4, 2017

astrofrog commented Aug 4, 2017

astrofrog commented Aug 4, 2017

jonathanishhorowicz commented Aug 6, 2017

raphaelphys commented Oct 30, 2017 • edited Loading

nr7s commented Mar 13, 2019

atrettin commented Sep 29, 2021

astrofrog commented Sep 29, 2021

raphaelphys commented Oct 30, 2017 •

edited

Loading