-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write references in compound datasets at the end #149
Conversation
@oruebel As I was going through the tests, I saw that the way I had to interact with the read data differed from before. Diving deeper it comes down to what I said above where we are now using tuples, which cannot change. The fix (even if it is a little hacky) is to convert to a list when trying to use index notation to "get". I wanted your thoughts on the matter and I also think this may need to go all the way up the parent hierarchy to HDMFDataset. |
Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references). |
|
rows = list(copy(super().__getitem__(arg))) | ||
# breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what you are saying correctly, the issue appears when self.__swap_refs
is being called because the individual rows are now represented as tuples. I.e, rows
is either a tuple (if arg
is an int) or a list of tuples (if arg
selects multiple rows). If that is the issue, then I think the we could update self.__swap_refs
to return a new instance of the row (rather than updating the existing one). Here is what I think might work:
def __getitem__(self, arg):
rows = copy(super().__getitem__(arg))
if np.issubdtype(type(arg), np.integer):
rows = self.__swap_refs(rows)
else:
rows = [self.__swap_refs(row) for row in rows]
return rows
def __swap_refs(self, row):
updated_row = list(row) # convert tuple to a list so we can update it
for i in self.__refgetters:
getref = self.__refgetters[i]
updated_row[i] = getref(row[i])
return updated_row # TODO if we want to keep these as tuples then we could convert them back here
I was just going to work on #146 and saw this parallel PR, should I work on this instead or what |
Wrote this here: #146 (comment) but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to So do this: hdmf-zarr/src/hdmf_zarr/backend.py Line 1044 in 8100341
instead of this: hdmf-zarr/src/hdmf_zarr/backend.py Line 1016 in 005bfbc
|
Yes please |
Interesting. Could you explain more? |
Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple. |
Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.
right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed |
I believe we were working on this in parallel. We can just keep it to your branch on your fork to give you credit. The intent was for me to dig around from the last checkpoint I saw on yours. Let me know when you're done and I'll review it. I'll continue to play around here for educational purposes. |
ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs? |
I can check the tests, but I will be out of town starting tomorrow. I'll go through it will most likely finalize this for a merge by Tuesday. |
lmk what else needs to be written re: docs and tests, happy to do that! i try to be as labor-neutral as i can be when raising issues/PRs :). take ur time, no rush. also sorry for my tone earlier, frustrating day. |
@mavaylon1 can this be closed, now that #146 has been merged? |
Motivation
What was the reasoning behind this change? Please explain the changes briefly.
Fix #144
Notes:
The fix at face value is the following:
TLDR: We want to write outside of the the loop. What does this lead to?
To keep our compound dataset in its original form (i.e not expanded) we can use a structured array an pull the dtypes from the builder. However, it is required that the "data" in the array be in tuples and not lists (I believe it can also be in np.arrays but I have not tested this).
Having these as tuples makes things a bit hard when retrieving data on read. For example, in the test
test_read_reference_compound
If I try to run
read_builder['data'][0]
I will get an error that*** TypeError: 'tuple' object does not support item assignment
. Digging deeper it is because ourget_item
swaps/will reassign values in the tuple. This is not allowed because it is a tuple. Now I can convert that to a list on_get_item_
and that does work. However, it is a little hacky. It also doesn't cover when my index is ":" vs a single integer.How to test the behavior?
Checklist
ruff
from the source directory.