Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Reusing Keys and Entities should be automatic #961

Closed
3 tasks done
mavaylon1 opened this issue Oct 5, 2023 · 1 comment
Closed
3 tasks done

[Feature]: Reusing Keys and Entities should be automatic #961

mavaylon1 opened this issue Oct 5, 2023 · 1 comment
Assignees
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)

Comments

@mavaylon1
Copy link
Contributor

mavaylon1 commented Oct 5, 2023

What would you like to see added to HDMF?

When we first made HERD, it was not built for "bulk" adding. You could loop add_ref; however, when reusing keys and entities you need to change the parameters of add_ref. For keys, you would need to use get_key to use the key object in add_ref. For entity, you would need to remove the uri parameter.

Say we wanted to use a DANDI set of nwbfiles and add references for subject and experimenter. It's not user friendly to have to have a try/except set up based on whether they key or entity exists.

from pynwb.resources import HERD
from pynwb import NWBHDF5IO, NWBFile
from glob import glob
from tqdm import tqdm

# Path to all the files
path = '/Users/mavaylon/Research/NWB/000015/sub*'

# Create HERD
herd = HERD()

# populate iteratively
folders = glob(path)
for folder in folders:
    for file in tqdm(glob(folder+'/*')):
        io = NWBHDF5IO(file, mode='r')
        read_file = io.read()
        #Add HERD for Subject
        try:
            entity = herd.get_entity(entity_id='NCBI_TAXON:10090')
            if entity is not None:
                raise ValueError()
            else:
                herd.add_ref(file=read_file,
                             container=read_file.subject,
                             key=read_file.subject.species,
                             entity_id = 'NCBI_TAXON:10090', # this assumes the same species for each file
                             entity_uri = 'https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=NCBI_TAXON:10090'
                             )
        except ValueError: # after the first use of an entity_id and key, you are required to reuse them
            herd.add_ref(file=read_file,
                         container=read_file.subject,
                         key=read_file.subject.species,
                         entity_id = 'NCBI_TAXON:10090'
                         )


        # Add HERD for Experimenter
        try:
            if len(read_file.experimenter)>1:
                breakpoint()
            herd.get_entity(entity_id='0000-0001-6782-3819')
            if entity is not None:
                raise ValueError()
            else:
                herd.add_ref(file=read_file,
                             container=read_file,
                             attribute="experimenter",
                             key=read_file.experimenter[0], # this assumes the experimenter is the same for each file
                             entity_id = '0000-0001-6782-3819',
                             entity_uri = 'https://orcid.org/0000-0001-6782-3819'
                             )

        except ValueError:
            herd.add_ref(file=read_file,
                          container=read_file,
                          attribute="experimenter",
                          key=read_file.experimenter[0],
                          entity_id = '0000-0001-6782-3819'
                          )
            io.close()

As of now, our "bulk" method is to use the TermSetWrapper, but we haven't actually tested duplicate data that would need to use a key object. This will fail adding to HERD.

We need to either

  1. Have a way to modify add_ref to support resolving the right key if it needs to be be reused and not rely on a manual call to get_key from the user.
  2. Even though entity_id somewhat resolves on its own, an error will still be raised to remove the "uri" if reusing the entity_id. Now this would also hinder bulk adding (having to manually remove). Should we make a strong assumption that when reusing an "id" to always ignore the URI. I think this is fine.

Without some form of 1 and 2, we don't support a seamless bulk adding of references.

Is your feature request related to a problem?

No response

What solution would you like?

Read Above.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@mavaylon1 mavaylon1 self-assigned this Oct 5, 2023
@oruebel
Copy link
Contributor

oruebel commented Oct 5, 2023

  1. Have a way to modify add_ref to support resolving the right key if it needs to be be reused and not rely on a manual call to get_key from the user.

Sounds reasonable. There may need to be some logic to specify behavior to, e.g., reuse: 1) any matching key, 2) reuse key only if the neurodata_type and relative path match, 3) reuse key only if the object_id matches.

2. Even though entity_id somewhat resolves on its own, an error will still be raised to remove the "uri" if reusing the entity_id. Now this would also hinder bulk adding (having to manually remove). Should we make a strong assumption that when reusing an "id" to always ignore the URI. I think this is fine.

I think here we should raise a warning if the URI is different from what is already in HERD.

@mavaylon1 mavaylon1 mentioned this issue Oct 31, 2023
13 tasks
@rly rly added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

3 participants