Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can it be made a more transparent drop-in for ndarray? #21

Open
cboulay opened this issue May 14, 2020 · 4 comments
Open

Can it be made a more transparent drop-in for ndarray? #21

cboulay opened this issue May 14, 2020 · 4 comments

Comments

@cboulay
Copy link

cboulay commented May 14, 2020

I'm trying to see how far I can take my ~50 GB hdf5 datasets through my processing pipeline before explicitly creating an ndarray. My pipeline uses a framework (Neuropype) that puts the ndarray in a container along with some metadata and makes extensive use of ndarray functions returning views. I think I could get a lot further in this framework with my h5 dataset if a wrapper class like DatasetViewh5py reimplemented some of those ndarray functions that return views.

Are there any downsides to renaming lazy_transpose to transpose?

Do you foresee any problems with a lazy implementation of reshape?

I'm also considering a custom implementation of squeeze.

numpy users expect flatten() to return a copy so probably not that one.

What about min, max, argmin, argmax, any and all when an axis is provided? Even though all of the data will have to be loaded into memory eventually, it can be done sequentially row-by-row (or column-by-column) so maybe this will help avoid out-of-memory errors. I am fairly new to processing data cached-on-disk so I'm hoping others with more experience can tell me if this is a bad idea from the outset.

@cboulay cboulay changed the title Why not make it a more transparent drop-in for ndarray? Can it be made a more transparent drop-in for ndarray? May 14, 2020
@d-sot
Copy link
Contributor

d-sot commented May 24, 2020

Hello, implementing squeeze is easy. Dropping dimensions is already happening with int indexing. If we have a dataset we want to squeeze, we can just pass it as dsetview.lazy_slice[:,0,:,:]. e.g. dsetview.lazy_slice([0 if i==1 else slice(None) for i in dsetview.shape])

It's possible to assign dsetview.transpose to dsetview.lazy_transpose for one's use case, since h5py datasets do not have a transpose method, but if there's a different underlying class it'd override its transpose method.

I'm not too sure about integrating a general lazy reshape. Perhaps when looking to reshape at chunk boundaries, in certain cases. PRs are welcome, but a general reshape working along with lazy transpose and slicing might get too complicated.

To implement min function and others, the data has to be read, or sequentially processed perhaps in chunks. Have you looked into dask?
Also, for transposing data in place fastremap maybe of help if memory constrained.

@cboulay
Copy link
Author

cboulay commented May 24, 2020

I spent much of the last week implementing lazy reshape. And yes, it was complicated. I ended up rewriting about 90% of the code. Though I got it working, and I tested quite a few combinations of transpose and reshape and slice, I'm sure some there are some corner cases where it will fail.

After I've had more time to play with it I'll push my changes to my fork, but it's such a huge change that I doubt a PR is what you want. I'll post here again when I feel it's ready for other eyes and you can let me know how you feel.

I took a quick look at dask but it didn't seem to meet my use case. I should look again.
Cheers!

@bendichter
Copy link
Contributor

bendichter commented May 24, 2020

@cboulay .T works as a lazy transpose. Let us know if you come up with anything for reshape. It sounds like a tough problem. Would be interested in incorporating if it doesn't dramatically increase the difficulty of supporting this package

@cboulay
Copy link
Author

cboulay commented May 26, 2020

I'm not quite ready to say it's suitable for a PR, but I'll post the main commit for reference in case I get otherwise distracted and someone wants these features without waiting for me to clean it up more:
cboulay@1ef59e6

As I was working on it, I thought it would have been better to use strides as the object's state variable to manage transpose & reshape rather than the solution I came up with. I'm sure someone cleverer than myself could get it to work and it would be more elegant than my solution and probably much more flexible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants