Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.load() feature upgrade #644

Open
hamza-712 opened this issue Aug 9, 2023 · 12 comments
Open

numpy.load() feature upgrade #644

hamza-712 opened this issue Aug 9, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@hamza-712
Copy link

hamza-712 commented Aug 9, 2023

numpy.load() feature upgrade
Hi,
Can you please add a very useful feature of using memmap to load only a part of numpy array from a file?
For example

  1. Let's create a memory-mapped array in write mode:
import numpy as np
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))
  1. Let's feed the array with random values, one column at a time because our system's memory is limited!
for i in range(ncols):
    f[:, i] = np.random.rand(nrows)
x = f[:, -1]
del

### READING

f = np.memmap('memmapped.dat', dtype=np.float32,
              shape=(nrows, ncols))
np.array_equal(f[:, -1], x)
True
del f

Additional context
(https://numpy.org/doc/stable/reference/generated/numpy.load.html)
like using numpy.memmap inside numpy.load()

@hamza-712 hamza-712 added the enhancement New feature or request label Aug 9, 2023
@v923z
Copy link
Owner

v923z commented Aug 9, 2023

I believe this is actually much more than just reading part of the file, at least, this is what I understand from this: https://numpy.org/doc/stable/reference/generated/numpy.memmap.html. Basically, you don't load anything with memmap, you just create a pointer to data on the disc, so if you take the method sum as an example, sum has to know how to handle data that are not stored in RAM, and that is highly non-trivial.

@hamza-712
Copy link
Author

Can you implement a way to save the numpy arrays in append mode. Similarly, reading the numpy partial subarray of the numpy with somekind of 'offset' variable.

@v923z
Copy link
Owner

v923z commented Aug 13, 2023

Can you point to the relevant documentation?

@hamza-712
Copy link
Author

For appending arrays, there is a library. It is not part of the official numpy docs.
https://pypi.org/project/npy-append-array/

for reading I haven't seen other way implemented than h5py or numpy.memmap
https://numpy.org/doc/stable/reference/generated/numpy.load.html

@v923z
Copy link
Owner

v923z commented Aug 14, 2023

I feel that we're rapidly going off-tangent, but still, here are a couple of comments:

  1. Dtype mod #327 implements more or less what you want. As I said, your request is not trivial, and we have to tread carefully here. It's no accident that that hasn't yet been merged, but we could dust it off.
  2. As you pointed out, npy-append-array is not part of numpy, which leads me to the question, whether what you would like could/should be implemented not at the C-level, but in python. If so, the next question is, what would you need for that. Would it help, if you had a method that simply lays bare the binary contents of an ndarray's pointer, which you could then write to a file from python? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you the ndarray. You would then manipulate the header or your .npy file from python.

@hamza-712
Copy link
Author

  1. I need it for ndarray file reading. Its basically audio data that I'm loading from sdcard into ESP32. I have implemented file handling in micropython. but its slow as compared to normal numpy.load()
    . numpy.load() roughly takes around 4ms usually for the size of data. and my implementation takes around 50ms.

  2. I have made a rudimentary python based implementation that puts a header in the beginning of a binary file. the header needs to edited in append mode which slows down the write process.

a method that simply lays bare the binary contents of an ndarray's pointer, which you could then write to a file from python? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you the ndarray. You would then manipulate the header or your .npy file from python.

This will help a lot. Anything lays out pointers to rows is good enough.

@v923z
Copy link
Owner

v923z commented Aug 27, 2023

It's not quite clear to me what your vision for such a function would be. The way you describe it seems to indicate that you'd need access to data that is not contiguous. Is that the case?

@hamza-712
Copy link
Author

let's say I have my data stored in ndarray (1000,7) in a file.
I want to retrieve only a block (10,7) from file without bringing the whole ndarray into the memory.

The function should be able to allow to read some block of rows. The function I have implemented in Python allows to read contiguous rows from the file only.

def filereader( rows_to_read = 1, offset_index = 0):

@v923z
Copy link
Owner

v923z commented Aug 28, 2023

OK, so one thing we could do is add the numpy-incompatible keywords offset and count to load, so that you could start from a particular place, and read a given number of values.

There might be an issue, and I don't quite know how to handle that: if you want to add offset and count, then you have to know beforehand what the shape in the file is, otherwise, you might request something that's not compatible with the contents of the file.

@hamza-712
Copy link
Author

I have created a header struct in my python struct which keeps track of array dtype and array shape.

        header_format = "BBHH"
        header_data = ustruct.pack(header_format, byte_size, array.dtype, row_dimension, column_dimension)

Moreover, it's better that the write operation mode should be only overwrite mode so that we don't have to edit the header again and again.

@v923z
Copy link
Owner

v923z commented Aug 29, 2023

What you're saying here doesn't address the issue I mentioned earlier. If we add a keyword or something like that to load, then we cannot rely on the fact that you know everything about the file that you're going to read. So, if the file contains data that were of the shape (4, 4, 4), which is 16 entries, but you're trying to read into a shape (2, 5), what should happen?

Also, the title of this thread is "numpy.load() feature upgrade", so we shouldn't talk about write operations here. Even memmap is about reading from a file, and not writing to it. I have the feeling that we're dealing with a feature creep here. Could you, please, define exactly what this new feature of the load function should do?

We might actually be better off adding the function to utils, if you really need it.

@jonnor
Copy link

jonnor commented Aug 11, 2024

I wrote an implementation of .npy file loading/saving for MicroPython - which also supports streaming reading of data. The streaming API is different than the numpy.load() one - to allow accessing/validating the metadata/structure information before actually reading the data. https://github.com/jonnor/micropython-npyfile?tab=readme-ov-file#streaming-read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants