-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for IDless imports #123
Comments
Just decided in the meeting that this is worth a try, I will have a look |
I like this idea Adrian!! I think IDs are a huge source of error (and also take a lot of mental energy to think through), so it would be awesome to solve this "once", whether then solve it each time for every dataset. Happy to help look at this! |
Note: Put age into this and calculate from lab_age and lab_age_units |
First version is done, still needs
|
Since IDs (and linking them) have been the biggest source of confusion, errors, code duplication, and bloat when importing datasets into the peekbank format, we could think about a future approach that forgoes IDs completely on the importer side and handles them via peekds. Since IDs are uniquely defined by properties within each table, they can be inferred.
In that version, all of the data would be put into a big dataframe with one row for each xy/aoi timepoint, and all other data (stimulus info, subject info, etc) would be placed redundantly into every row. This table is then fed into peekds, which infers IDs and sorts the data into the separate tables. This approach would result in a slightly larger intermediate dataset size. However, most of the datasets we get come in singular large unnormalized tables anyway, so adding some of the peekbank variables to that should not add that much bloat.
The productivity gains in the MB2 imports were striking, so I think this approach is promising.
From a software upkeep/error reduction perspective: right now, every single import script has to include the rules on which variables within a table uniquely identify rows when creating IDs. The "huge table into peekds" approach would put this into a centralized location, leaving the import scripts to deal with dataset-specific functionality.
Happy to have more opinions on this before I go and write the code!
The text was updated successfully, but these errors were encountered: