-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse kde #222
Sparse kde #222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good. Most of the comments are fairly easy. The main points are that you should test of the user provides the correct size of the cell that you expect and raise an error otherwise. The second main point is instead of requiring strings for some parameters we can directly use executables.
Also, can you please update the CHANGELOG file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvements.
CHANGELOG
Outdated
@@ -11,8 +11,15 @@ The rules for CHANGELOG file: | |||
|
|||
.. inclusion-marker-changelog-start | |||
|
|||
0.3.0 (XXXX/XX/XX) | |||
0.2.1 (XXXX/XX/XX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the version 0.3.0
.
0.2.1 (XXXX/XX/XX) | |
0.3.0 (XXXX/XX/XX) |
In the pyproject.toml
we also use version 0.3.0. And we will decide once we relase if we do a major/minor or patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like not done?
examples/neighbors/sparse-kde.py
Outdated
labels: np.ndarray, | ||
probs: np.ndarray, | ||
normpks: float, | ||
metric: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think passing an instance would be even easier. You don't even have to keep a dictionary and users can use their own distance measures without changing the code.
src/skmatter/metrics/_pairwise.py
Outdated
X, Y = check_pairwise_arrays(X, Y) | ||
cov_inv = _mahalanobis_preprocess(cov_inv) | ||
dists = _mahalanobis(cell, X, Y, cov_inv) | ||
if not squared: | ||
dists **= 0.5 | ||
return dists | ||
|
||
|
||
def _check_dimension(X, cell): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do also want to add a test here that the cell is rectangular? Maybe you have it already and I just oversaw it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not, because I do not give a place for the angle in the cell parameter, and every number in the cell will be interpreted as a side length of a rectangular box. I will add a note in the documentation to inform the user that it only supports the rectangular cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay but if the cell is rectangular your format is an 1d array of length 3?
This should be tested and you write in your docs that this is the expected format and if anything else is given your raise a meaningful error message giving the actual type and the type that you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, perhaps I should not say rectangular. The cell can be rectangular, cubic, 4-cube or n-cube. Here is not a limitation on the dimension. Thus I only test if the dimension of the cell matches the dimension of descriptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell
to cell_length
to make sure that what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. I think we are almost there from my side.
available. | ||
|
||
.. note:: | ||
Currently only rectangular cells are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed above maybe also give the expected format here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool. Now I also get the "problem" with a cell. I made a suggestion to make clear what the code can handle and what not.
src/skmatter/metrics/_pairwise.py
Outdated
X : {array-like, sparse matrix} of shape (n_samples_X, n_components) | ||
An array where each row is a sample and each column is a component. | ||
|
||
Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \ | ||
default=None | ||
An array where each row is a sample and each column is a component. | ||
If `None`, method uses `Y=X`. | ||
|
||
Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need empty lines between the arguments. I don't why this was done in the past. I also removed them from other functions in #227.
Can you do this for your the doc strings as well?
X : {array-like, sparse matrix} of shape (n_samples_X, n_components) | |
An array where each row is a sample and each column is a component. | |
Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \ | |
default=None | |
An array where each row is a sample and each column is a component. | |
If `None`, method uses `Y=X`. | |
Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \ | |
X : {array-like, sparse matrix} of shape (n_samples_X, n_components) | |
An array where each row is a sample and each column is a component. | |
Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \ | |
default=None | |
An array where each row is a sample and each column is a component. | |
If `None`, method uses `Y=X`. | |
Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \ |
src/skmatter/metrics/_pairwise.py
Outdated
X, Y = check_pairwise_arrays(X, Y) | ||
cov_inv = _mahalanobis_preprocess(cov_inv) | ||
dists = _mahalanobis(cell, X, Y, cov_inv) | ||
if not squared: | ||
dists **= 0.5 | ||
return dists | ||
|
||
|
||
def _check_dimension(X, cell): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell
to cell_length
to make sure that what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I a happy. I asked other developers for their review.
Hi @agoscinski, I think this PR is ready for further review. Thank you for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good, only the properties that are not recomputed when refitted I think are concerning.
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Co-authored-by: Alexander Goscinski <[email protected]>
Returns the instance itself. | ||
""" | ||
self._bandwidth_inv_ = None | ||
self._normkernels_ = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more transparent to the reader of the code if you set self._bandwidth_inv = None
. Otherwise in the remaining code you don't use it anymore and it becomes only clear when reading the property.
For that you need a property setter
@_bandwith_inv.setter
def _bandwith_inv(self, value)
self._bandwith_inv_ = value
Alternatively, you could use https://docs.python.org/3/library/functools.html#functools.cached_property then you don't have two variables, less code and you can reset cached_properties with del self.property_name
.
Same for _normkernels_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After consideration, I think adding some comments at these lines might be better. There are two reasons why I did not take the two ways you mentioned. For the setter way, it does allow me to re-set the value of _bandwith_inv
much clearer, but it also exposes the logic of calculating it. For the cached_property
way, the value of these two properties won't be updated if the user do the fitting twice.
But I am not pretty sure if not exposing the logic of setting these two properties makes sense. If not, I think the setter way is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cached_property way, the value of these two properties won't be updated if the user do the fitting twice.
If you set it to None at the beginning, as you do it anyway here, they can be updated. I think that would be the better solution here, but I am fine with the comment.
Co-authored-by: Alexander Goscinski <[email protected]>
Would squash merge to one commit with your PR message a bit adapted. Are you fine with it?
|
Totally okay, thank you for your time and consideration! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks also for the work and the patience!
This PR introduces SparseKDE:
SparseKDE
is located atsrc/skmatter/utils/_sparsekde.py
. It mitigates the high cost of doing KDE for large datasets by doing KDE for selected data points (e.g. grid points sampled by farthest point-sampling). This class takes the original dataset as a parameter and fits the model using the sampled grid points.SparseKDE
stored insrc/skmatter/utils/_sparsekde.py
.pairwise_euclidean_distances
andpairwise_mahalanobis_distances
, are realized and stored insrc/skmatter/metrics/pairwise.py
.SparseKDE
and some auxiliary functions are stored intests/test_neighbors.py
. Tests for distance metrics are stored intests/test_metrics.py
.I am not sure if the current API of
SparseKDE
is OK and if the auxiliary classes should be integrated intoSparseKDE
. Also,SparseKDE
seems to be too large and complex. Perhaps it needs to be decomposed into smaller parts, but I have not figured out how.Contributor (creator of PR) checklist
For Reviewer
📚 Documentation preview 📚: https://scikit-matter--222.org.readthedocs.build/en/222/