Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow generators and iterators #194

Merged
merged 26 commits into from
Dec 18, 2020
Merged

Conversation

dkarrasch
Copy link
Member

@dkarrasch dkarrasch commented Dec 5, 2020

This adds support for iterable objects as arguments to evaluate,(EDIT: and pairwise and colwise). I haven't touched pair- and colwise stuff, because that is (partially/completely?) addressed by #188.

Closes #187. Closes #162. Closes #165. Closes #152. Closes #190. Closes #192. Closes #188.

@dkarrasch
Copy link
Member Author

@mkborregaard You seemed to be interested in this.

src/generic.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@johnnychen94 johnnychen94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at every detail; it generally looks good to me. Approval of the goal.

Might need a benchmark to detect potential regression.


ImageDistances might get even simplified with this change.

@codecov-io
Copy link

codecov-io commented Dec 5, 2020

Codecov Report

Merging #194 (1350d47) into master (be3a901) will increase coverage by 1.22%.
The diff coverage is 99.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #194      +/-   ##
==========================================
+ Coverage   96.74%   97.97%   +1.22%     
==========================================
  Files           8        8              
  Lines         675      739      +64     
==========================================
+ Hits          653      724      +71     
+ Misses         22       15       -7     
Impacted Files Coverage Δ
src/metrics.jl 97.73% <98.24%> (+0.99%) ⬆️
src/Distances.jl 100.00% <100.00%> (ø)
src/bhattacharyya.jl 100.00% <100.00%> (+13.04%) ⬆️
src/bregman.jl 100.00% <100.00%> (ø)
src/common.jl 94.52% <100.00%> (+0.23%) ⬆️
src/generic.jl 98.62% <100.00%> (+0.70%) ⬆️
src/haversine.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be3a901...1350d47. Read the comment docs.

@dkarrasch
Copy link
Member Author

I haven't generalized *Mahalanobis, because that looks so linear-algebra-y. Otherwise, I think, all metrics now work with abstract iterables, and are tested for Iterators and Generators. I'll try to do some benchmarks. In a few cases, I have left both a specific AbstractVector version and an unconstrained version. I'll try to benchmark whether there is any benefit from the constrained versions. Any comments or concerns are welcome, of course.

@dkarrasch
Copy link
Member Author

This is ready for a thorough review. Additional use cases and tests to be considered would be also welcome.

src/bhattacharyya.jl Show resolved Hide resolved
src/bhattacharyya.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/haversine.jl Show resolved Hide resolved
src/metrics.jl Show resolved Hide resolved
src/metrics.jl Outdated Show resolved Hide resolved
src/metrics.jl Outdated Show resolved Hide resolved
src/metrics.jl Show resolved Hide resolved
test/test_dists.jl Outdated Show resolved Hide resolved
test/test_dists.jl Outdated Show resolved Hide resolved
@dkarrasch dkarrasch changed the title Allow generators and iterators in evaluate Allow generators and iterators Dec 12, 2020
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/bhattacharyya.jl Outdated Show resolved Hide resolved
src/metrics.jl Outdated Show resolved Hide resolved
src/metrics.jl Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated Show resolved Hide resolved
src/metrics.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

I got a little obsessive about the "lazy evaluation" approach. If you look closer, then colwise is nothing but map(metric, zip(eachcol(A), eachcol(B)) (and similarly for iterators), and pairwise(..., dims=2) is nothing but map(Iterators.product(eachcol(A), eachcol(B)) (or eachrow for dims=1 and similarly for iterators). Replacing map by Iterators.map makes the whole construct lazy, i.e., a generator. Does anybody see any value in exposing this lazy construct, which could be, by default, eagerly evaluated via collect? One naive use case I can imagine is computing k-nearest neighbors within the data set via brute force. That shouldn't require having the total distance matrix in memory?

Given that using Iterators.map manually is relatively easy, is it really worth providing something else? I guess we could add an argument to pairwise (say, lazy), but that can be done at any time if somebody requests it.

src/generic.jl Outdated Show resolved Hide resolved
src/generic.jl Outdated
colwise(r::AbstractMatrix, metric::PreMetric,
a::AbstractMatrix, b::AbstractMatrix)

Compute distances between each corresponding columns of `a` and `b` according
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note is that these methods are inconsistent with the ones treating a and b as iterators of columns: a matrix of vectors will be treated differently from a vector of the same vectors. That's probably OK in practice, but that's one of the reasons why I'd like to move to requiring explicitly writing pairwise(d, eachol(a), eachcol(b)) in the longer term. That way we won't need dims anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be very nice, because it redirects the matrix-based method to the iterator-based method, and one could get rid of the matrix-based ones. The only issue I see is that for the specialized *Euclidean (and a few others) distances, where we do need the underlying matrix for performance reason, I don't seem to be able to unwrap it from the eachcol generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately JuliaLang/julia#32310 should allow us to retrieve the underlying matrix!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I looked at that one a bit today. Will they let it go into v1.6? I wonder how the ecosystem is going to adapt to v1.6 being the new LTS, and how fast packages will really drop 1.6- support. In many cases, there is no hard reason, only soft ones.

src/generic.jl Outdated
Comment on lines 301 to 302
Compute distances between each pair of rows (if `dims=1`) or columns (if `dims=2`)
in `a` and `b` according to distance `metric`. If a single matrix `a` is provided,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note is that these methods are inconsistent with the ones treating a and b as iterators of columns: a matrix of vectors will be treated differently from a vector of the same vectors. That's probably OK in practice, but that's one of the reasons why I'd like to move to requiring explicitly writing pairwise(d, eachol(a), eachcol(b)) in the longer term. That way we won't need dims anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure I understand the inconsistency, actually. Could you please sketch an application case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, pairwise(d, [a, b, c, d]) vs. pairwise(d, reshape([a, b, c, d], 2, 2)) with a, b, c and d vectors of numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I understood you correctly. Out of the two calls you mentioned, only the first one works. The second one fails because it treats a, b, c and d like numbers, but then calls like abs(a) (or whatever necessary) fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But if we were fully consistent, the second call would be equivalent to the first one, since matrices are just one kind of iterator.

src/generic.jl Outdated Show resolved Hide resolved
@dkarrasch
Copy link
Member Author

Alright, I better stop polishing now. This now fixes a whole bunch of issues and/or adds new features. I believe this is in good shape now. Comments and reviews are welcome. Shall we bump the patch number, or the minor version? It's rather difficult to propagate "next level versions" through the ecosystem, and since this doesn't break any existing functionality, maybe a patch is enough?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just two small things.

Comment on lines 720 to 726
@testset "CartesianIndex" begin
A = reshape(collect(1:9), 3, 3)
inds1 = findall(iseven, A)
inds2 = findall(isodd, A)
@test sum(pairwise(SqEuclidean(), inds1, inds2)) == 52
@test euclidean(inds1[1], inds1[1]) === 0.0
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cover a real use case? I find it surprising that somebody would want to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a feature request in #177. But we could remove it for now and clarify there what the actual intention and the use are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd leave it out until we have a clearer view of why that would be useful. The fact that CartesianIndex isn't iterable seems to indicate that it's not intended to be used that way, and it would be absurd to have packages work around that everywhere...

src/generic.jl Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good if you leave CartesianIndex out for now. Thanks!

@dkarrasch
Copy link
Member Author

I'd merge and release today. Any opinions on how to bump the version number?

@dkarrasch dkarrasch merged commit f6ee353 into JuliaStats:master Dec 18, 2020
@dkarrasch dkarrasch deleted the dk/generators branch December 18, 2020 11:36
@dkarrasch
Copy link
Member Author

I merged the PR as is. We may include some others and then decide about the version bump.

@Datseris
Copy link

@dkarrasch thanks you are a legend. With 1 PR closed like 10 issues!

@dkarrasch
Copy link
Member Author

Thanks, @Datseris. My next projects are world peace and healing cancer... in one PR! 🤣

@nalimilan
Copy link
Member

Thanks! Regarding the release number, I think we have two options:

  • tag 0.10.1 to avoid forcing dependencies to update their version requirements (since this isn't breaking)
  • tag 1.0.0 so that we can more easily bump either the minor or patch version as appropriate in the future

@johnnychen94
Copy link
Contributor

This PR gives so many possibilities and thus a new start, so I'm voting for 1.0.0 😆

@dkarrasch
Copy link
Member Author

How about tagging 0.10.1, then have a v0.11 that deprecates the dims keyword and redirects to eachcol/eachrow, and then have a v0.12/v1 for a version that only contains iterator-based methods, except for array-based specializations for improved performance?

@nalimilan
Copy link
Member

Yes, if we anticipate breaking changes soon, better not tag 1.0. Maybe file an issue so that we can discuss these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants