Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choice of returned histogram datatype #5

Open
manodeep opened this issue Jul 19, 2017 · 9 comments
Open

choice of returned histogram datatype #5

manodeep opened this issue Jul 19, 2017 · 9 comments

Comments

@manodeep
Copy link
Contributor

Thanks for writing this fast histogram package. A task that we astronomers tend to do often and repeatedly :)

Coming back to the package itself, it seems to me that the return datatype of the histograms are doubles. Conceptually, the histogram is simply a count in that bin, and should be of an integral type. Is the return datatype chosen to be double to reflect the input array type? Or is that in keeping with the numpy convention?

Changing the histogram increment to be an integral type (i.e., the postfix ++ operator) might also lead to better performance. (There are some opportunity for performance improvements -- e.g., pointer operations, reduction of scope for variables, unrolled loops + separate counters)

Separately, any particular reason to perform a bitwise and (&) rather than the logical and (&&) here?

@astrofrog
Copy link
Owner

I decided to return doubles for consistency with Numpy, but we could potentially have an option to choose the return type (at least maybe choosing between double and long?). All the other improvements sound good, and the use of && was a mistake :)

Would you be interested in submitting a pull request to implement/fix some of these as well as #4?

You mention reduction of scope for variables, but just to let you know, I found that all the variables had to be declared at the top to keep the Windows compilers happy (but maybe you know how to avoid this).

@manodeep
Copy link
Contributor Author

I sent up the trivial stuff in the PR #6.

I have never compiled for Windows. What's the error you get on Windows with these lines:

const double postfac = normx * fnx;
for(i = 0; i < n; i++) {
    if (*x >= xmin && *x < xmax) {
        const long ix = (*x - xmin) * postfac;
        count[ix] += 1.;
    }
   x++;
}

@astrofrog
Copy link
Owner

@manodeep - Thanks! I don't have Windows locally, but you can try and open a PR, and there will be an AppVeyor build in which you can check any error messages if the build fails.

@henryiii
Copy link
Contributor

henryiii commented Nov 1, 2018

FYI, item += 1 stops working if the numbers get large enough for floating point numbers. However, that number is very large for doubles.

@henryiii
Copy link
Contributor

henryiii commented Nov 1, 2018

Also, numpy returns int64s:

import numpy as np
a,b = np.histogram([1,2,.3])
print(a.dtype)
# prints int64

@astrofrog
Copy link
Owner

I wonder if this changed in Numpy since I wrote this package. I'd certainly be up for returning integer values from the plain histogram and doubles from the weighted one, though I need to check if that might have any undesired side effects on other packages.

@henryiii
Copy link
Contributor

henryiii commented Nov 2, 2018

I think this changed in 1.10, at least the 64 part of the int was added then. This is also when numpy got a high-speed 1D histogram algorithm. I wonder how it compares to fast-histogram now in 1D?

numpy/numpy@34b582a#diff-9d44603693b9544b900e135a203b3ae2R180, numpy/numpy#7629

Edit, per this issue numpy/numpy#7845, 2D/ND histograms still return a floating type.

@astrofrog
Copy link
Owner

The current benchmarks shown in the README are against a later version than 1.10, so this is still quite a bit faster than numpy. I'm not sure if you are familiar with the asv tool for benchmarks, but I keep a running benchmark here of how fast fast-histogram is compared to Numpy: https://astrofrog.github.io/fast-histogram/#benchmarks.Histogram1D.time_histogram1d?x-axis=commit

@henryiii
Copy link
Contributor

henryiii commented Nov 4, 2018

By the way, the current Numpy code is internally in int64s, and then casts at the end. There's one line that can be dropped to stop the casting. This allows Numpy to count past the point that floating point can no longer add 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants