Write references in compound datasets at the end #149

mavaylon1 · 2023-12-13T23:58:19Z

Motivation

What was the reasoning behind this change? Please explain the changes briefly.
Fix #144

Notes:
The fix at face value is the following:

for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
             new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(tuple(new_item))
        
        generated_dtype = []
        for item in builder.dtype:
              if item['dtype'] == type('string'):
                  item['dtype'] = 'U25'
              generated_dtype.append((item['name'], item['dtype']))

 dset[...] = np.array(new_items, dtype=generated_dtype)

TLDR: We want to write outside of the the loop. What does this lead to?

To keep our compound dataset in its original form (i.e not expanded) we can use a structured array an pull the dtypes from the builder. However, it is required that the "data" in the array be in tuples and not lists (I believe it can also be in np.arrays but I have not tested this).

Having these as tuples makes things a bit hard when retrieving data on read. For example, in the test test_read_reference_compound

def test_read_reference_compound(self):
        self.test_write_reference_compound()
        self.read()
        builder = self.createReferenceCompoundBuilder()['ref_dataset']
        read_builder = self.root['ref_dataset']

If I try to run read_builder['data'][0] I will get an error that *** TypeError: 'tuple' object does not support item assignment. Digging deeper it is because our get_item swaps/will reassign values in the tuple. This is not allowed because it is a tuple. Now I can convert that to a list on _get_item_ and that does work. However, it is a little hacky. It also doesn't cover when my index is ":" vs a single integer.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

mavaylon1 · 2023-12-14T00:08:49Z

@oruebel As I was going through the tests, I saw that the way I had to interact with the read data differed from before. Diving deeper it comes down to what I said above where we are now using tuples, which cannot change. The fix (even if it is a little hacky) is to convert to a list when trying to use index notation to "get". I wanted your thoughts on the matter and I also think this may need to go all the way up the parent hierarchy to HDMFDataset.

oruebel · 2023-12-14T00:18:23Z

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

mavaylon1 · 2023-12-14T00:21:51Z

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

Lines

oruebel · 2023-12-14T00:33:48Z

src/hdmf_zarr/zarr_utils.py

+        rows = list(copy(super().__getitem__(arg)))
+        # breakpoint()


If I understand what you are saying correctly, the issue appears when self.__swap_refs is being called because the individual rows are now represented as tuples. I.e, rows is either a tuple (if arg is an int) or a list of tuples (if arg selects multiple rows). If that is the issue, then I think the we could update self.__swap_refs to return a new instance of the row (rather than updating the existing one). Here is what I think might work:

def __getitem__(self, arg): rows = copy(super().__getitem__(arg)) if np.issubdtype(type(arg), np.integer): rows = self.__swap_refs(rows) else: rows = [self.__swap_refs(row) for row in rows] return rows def __swap_refs(self, row): updated_row = list(row) # convert tuple to a list so we can update it for i in self.__refgetters: getref = self.__refgetters[i] updated_row[i] = getref(row[i]) return updated_row # TODO if we want to keep these as tuples then we could convert them back here

sneakers-the-rat · 2023-12-14T01:30:15Z

I was just going to work on #146 and saw this parallel PR, should I work on this instead or what

src/hdmf_zarr/backend.py

sneakers-the-rat · 2023-12-14T03:08:00Z

Wrote this here: #146 (comment)

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1044 in 8100341

dtype=dtype,

instead of this:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1016 in 005bfbc

dtype=object,

mavaylon1 · 2023-12-14T04:14:06Z

I was just going to work on #146 and saw this parallel PR, should I work on this instead or what

Yes please

mavaylon1 · 2023-12-14T04:17:10Z

read_builder['data'][0]

Interesting. Could you explain more?

mavaylon1 · 2023-12-14T04:25:18Z

Wrote this here: #146 (comment)

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1044 in 8100341

dtype=dtype,

instead of this:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1016 in 005bfbc

dtype=object,

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

sneakers-the-rat · 2023-12-14T04:30:05Z

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

mavaylon1 · 2023-12-14T04:40:50Z

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

I believe we were working on this in parallel. We can just keep it to your branch on your fork to give you credit. The intent was for me to dig around from the last checkpoint I saw on yours. Let me know when you're done and I'll review it. I'll continue to play around here for educational purposes.

sneakers-the-rat · 2023-12-14T05:00:01Z

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

mavaylon1 · 2023-12-14T05:10:41Z

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

I can check the tests, but I will be out of town starting tomorrow. I'll go through it will most likely finalize this for a merge by Tuesday.

sneakers-the-rat · 2023-12-14T06:43:25Z

lmk what else needs to be written re: docs and tests, happy to do that! i try to be as labor-neutral as i can be when raising issues/PRs :). take ur time, no rush. also sorry for my tone earlier, frustrating day.

rly · 2024-03-15T14:35:42Z

@mavaylon1 can this be closed, now that #146 has been merged?

first

005bfbc

oruebel reviewed Dec 14, 2023

View reviewed changes

sneakers-the-rat reviewed Dec 14, 2023

View reviewed changes

src/hdmf_zarr/backend.py Outdated Show resolved Hide resolved

sneakers-the-rat mentioned this pull request Dec 14, 2023

Write references in compound datasets as an array #146

Merged

6 tasks

check

5654528

mavaylon1 closed this Mar 15, 2024

rly deleted the iter_ref branch November 26, 2024 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write references in compound datasets at the end #149

Write references in compound datasets at the end #149

mavaylon1 commented Dec 13, 2023 •

edited

Loading

mavaylon1 commented Dec 14, 2023

oruebel commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

oruebel Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023 •

edited

Loading

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

rly commented Mar 15, 2024

Write references in compound datasets at the end #149

Write references in compound datasets at the end #149

Conversation

mavaylon1 commented Dec 13, 2023 • edited Loading

Motivation

How to test the behavior?

Checklist

mavaylon1 commented Dec 14, 2023

oruebel commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

oruebel Dec 14, 2023

Choose a reason for hiding this comment

sneakers-the-rat commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023 • edited Loading

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

mavaylon1 commented Dec 14, 2023

sneakers-the-rat commented Dec 14, 2023

rly commented Mar 15, 2024

mavaylon1 commented Dec 13, 2023 •

edited

Loading

sneakers-the-rat commented Dec 14, 2023 •

edited

Loading