Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Histogram2d result does not match that of the numpy function #10

Open
jonathanishhorowicz opened this issue Aug 4, 2017 · 7 comments

Comments

@jonathanishhorowicz
Copy link

Thank you for this package, it is indeed very fast and I am looking forward to using it.

There is a discrepancy between your histogram2d function and the numpy equivalent, which is demonstrated below:

import numpy as np
from fast_histogram import histogram2d as fast_hist_2d

n_bins = 5	

x = np.array([3,2,5,4,6,2,3,5,3,2,5,3,6,7,3])
y = np.array([1,6,3,2,5,3,1,6,4,6,4,2,4,5,3])

xmin, xmax = np.min(x), np.max(x)
ymin, ymax = np.min(y), np.max(y)

h_fast = fast_hist_2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])


h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

print "h_fast =\n", h_fast
print "\nh_np =\n", h_np

On my machine this outputs

h_fast =
[[ 0. 0. 1. 0. 0.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 1.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

The fast_histogram result only includes 11 of the 15 points. The 4 missing points are all from the final column

@astrofrog
Copy link
Owner

Thanks for reporting this - I think this is a literal 'edge' case where I think we must be treating points that lie on the exact edge differently, but I agree it'd probably be good to make them consistent.

@astrofrog
Copy link
Owner

The relevant line is here: https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L186 - specifically here we use < instead of <= for the upper bound, otherwise the index calculation returns an out of bounds index. We could change < to <= and then special case the case where the index needs to be reduced by 1, though this would add an additional 'if' statement in the loop which could further slow things down. I'll try and investigate early next week, but in the mean time the solution is to not use exactly the same value for xmax/ymax as some of the values in the arrays being binned (for example one could add a tiny value to each of the upper limits)

@jonathanishhorowicz
Copy link
Author

Yes, adding a small value to the upper limits for x and y does solve the problem for this example (and many others that I tested).

Thank you for your quick response!

@raphaelphys
Copy link

raphaelphys commented Oct 30, 2017

Hi,

One comment: your problem of discrepancies seems to be solved if you add a small value, for example, 0.01 to range that is shared by both np.histogram and the fast_histogram. For example as in,

xmin, xmax = np.min(x), np.max(x) + 0.01 
ymin, ymax = np.min(y), np.max(y) + 0.01

This is shared by both the functions. However, I was trying to test by not modifying anything in the numpy side, that is by doing:

h_fast = fast_hist_2d(x,y,
	range=[[xmin, xmax + 0.01], [ymin, ymax + 0.01]],
	bins = [n_bins,n_bins])          

h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0] 

The result is not the same as shown below. Why do we have to pass the modified range to np.histogram as well?

h_fast =
[[ 3. 2. 1. 0. 2.]
[ 1. 0. 0. 0. 0.]
[ 0. 1. 1. 0. 1.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 0.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

@nr7s
Copy link

nr7s commented Mar 13, 2019

Although increasing the range can act as a quick fix to the problem it doesn't really fix the problem.

Changing the range will affect the calculated edges (bin intervals), so numpy and fast_histogram will, in those examples return the same counts for each bin but with different edges (even though fast_histogram doesn't return the edges).
Depending on how much you increase the range and on the precision of your values this may still alter the bin counts.

@atrettin
Copy link
Contributor

I believe that this issue can be closed now. We have decided to purposefully break with numpy on this particular point because the additional if statement to check for the edge case slows everything down by 20%. For the vast majority of use cases, the additional performance will be more appreciated than the adherence to numpy's convention.

@astrofrog
Copy link
Owner

I agree! Before we close this I think we might want to add an entry to the FAQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants