Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write references in compound datasets at the end #149

Closed
wants to merge 2 commits into from
Closed

Conversation

mavaylon1
Copy link
Contributor

@mavaylon1 mavaylon1 commented Dec 13, 2023

Motivation

What was the reasoning behind this change? Please explain the changes briefly.
Fix #144

Notes:
The fix at face value is the following:

for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
             new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(tuple(new_item))
        
        generated_dtype = []
        for item in builder.dtype:
              if item['dtype'] == type('string'):
                  item['dtype'] = 'U25'
              generated_dtype.append((item['name'], item['dtype']))

 dset[...] = np.array(new_items, dtype=generated_dtype)

TLDR: We want to write outside of the the loop. What does this lead to?

To keep our compound dataset in its original form (i.e not expanded) we can use a structured array an pull the dtypes from the builder. However, it is required that the "data" in the array be in tuples and not lists (I believe it can also be in np.arrays but I have not tested this).

Having these as tuples makes things a bit hard when retrieving data on read. For example, in the test test_read_reference_compound

def test_read_reference_compound(self):
        self.test_write_reference_compound()
        self.read()
        builder = self.createReferenceCompoundBuilder()['ref_dataset']
        read_builder = self.root['ref_dataset']

If I try to run read_builder['data'][0] I will get an error that *** TypeError: 'tuple' object does not support item assignment. Digging deeper it is because our get_item swaps/will reassign values in the tuple. This is not allowed because it is a tuple. Now I can convert that to a list on _get_item_ and that does work. However, it is a little hacky. It also doesn't cover when my index is ":" vs a single integer.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@mavaylon1
Copy link
Contributor Author

@oruebel As I was going through the tests, I saw that the way I had to interact with the read data differed from before. Diving deeper it comes down to what I said above where we are now using tuples, which cannot change. The fix (even if it is a little hacky) is to convert to a list when trying to use index notation to "get". I wanted your thoughts on the matter and I also think this may need to go all the way up the parent hierarchy to HDMFDataset.

@oruebel
Copy link
Contributor

oruebel commented Dec 14, 2023

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

@mavaylon1
Copy link
Contributor Author

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

Lines

Comment on lines +154 to +155
rows = list(copy(super().__getitem__(arg)))
# breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand what you are saying correctly, the issue appears when self.__swap_refs is being called because the individual rows are now represented as tuples. I.e, rows is either a tuple (if arg is an int) or a list of tuples (if arg selects multiple rows). If that is the issue, then I think the we could update self.__swap_refs to return a new instance of the row (rather than updating the existing one). Here is what I think might work:

def __getitem__(self, arg):
        rows = copy(super().__getitem__(arg))
        if np.issubdtype(type(arg), np.integer):
            rows =  self.__swap_refs(rows)
        else:
            rows = [self.__swap_refs(row) for row in rows]
       return rows

def __swap_refs(self, row):
        updated_row = list(row)  # convert tuple to a list so we can update it
        for i in self.__refgetters:
            getref = self.__refgetters[i]
            updated_row[i] = getref(row[i])
        return updated_row   # TODO if we want to keep these as tuples then we could convert them back here 

@sneakers-the-rat
Copy link
Contributor

I was just going to work on #146 and saw this parallel PR, should I work on this instead or what

@sneakers-the-rat
Copy link
Contributor

Wrote this here: #146 (comment)

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this:

dtype=dtype,

instead of this:

dtype=object,

@mavaylon1
Copy link
Contributor Author

I was just going to work on #146 and saw this parallel PR, should I work on this instead or what

Yes please

@mavaylon1
Copy link
Contributor Author

read_builder['data'][0]

Interesting. Could you explain more?

@mavaylon1
Copy link
Contributor Author

Wrote this here: #146 (comment)

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this:

dtype=dtype,

instead of this:

dtype=object,

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

@sneakers-the-rat
Copy link
Contributor

sneakers-the-rat commented Dec 14, 2023

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

@mavaylon1
Copy link
Contributor Author

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

I believe we were working on this in parallel. We can just keep it to your branch on your fork to give you credit. The intent was for me to dig around from the last checkpoint I saw on yours. Let me know when you're done and I'll review it. I'll continue to play around here for educational purposes.

@sneakers-the-rat
Copy link
Contributor

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

@mavaylon1
Copy link
Contributor Author

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

I can check the tests, but I will be out of town starting tomorrow. I'll go through it will most likely finalize this for a merge by Tuesday.

@sneakers-the-rat
Copy link
Contributor

lmk what else needs to be written re: docs and tests, happy to do that! i try to be as labor-neutral as i can be when raising issues/PRs :). take ur time, no rush. also sorry for my tone earlier, frustrating day.

@rly
Copy link
Contributor

rly commented Mar 15, 2024

@mavaylon1 can this be closed, now that #146 has been merged?

@mavaylon1 mavaylon1 closed this Mar 15, 2024
@rly rly deleted the iter_ref branch November 26, 2024 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Iterative write for compound reference datasets takes 5ever
4 participants