Proposal for IDless imports #123

adriansteffan · 2024-05-24T08:28:19Z

Since IDs (and linking them) have been the biggest source of confusion, errors, code duplication, and bloat when importing datasets into the peekbank format, we could think about a future approach that forgoes IDs completely on the importer side and handles them via peekds. Since IDs are uniquely defined by properties within each table, they can be inferred.

In that version, all of the data would be put into a big dataframe with one row for each xy/aoi timepoint, and all other data (stimulus info, subject info, etc) would be placed redundantly into every row. This table is then fed into peekds, which infers IDs and sorts the data into the separate tables. This approach would result in a slightly larger intermediate dataset size. However, most of the datasets we get come in singular large unnormalized tables anyway, so adding some of the peekbank variables to that should not add that much bloat.

The productivity gains in the MB2 imports were striking, so I think this approach is promising.

From a software upkeep/error reduction perspective: right now, every single import script has to include the rules on which variables within a table uniquely identify rows when creating IDs. The "huge table into peekds" approach would put this into a centralized location, leaving the import scripts to deal with dataset-specific functionality.

Happy to have more opinions on this before I go and write the code!

adriansteffan · 2024-05-24T20:27:19Z

Just decided in the meeting that this is worth a try, I will have a look

mzettersten · 2024-05-24T21:23:21Z

I like this idea Adrian!! I think IDs are a huge source of error (and also take a lot of mental energy to think through), so it would be awesome to solve this "once", whether then solve it each time for every dataset. Happy to help look at this!

adriansteffan · 2024-06-04T20:06:53Z

Note: Put age into this and calculate from lab_age and lab_age_units

adriansteffan · 2024-08-22T11:08:28Z

adriansteffan added info/input needed Extra attention is needed import An issue regarding an import script pipeline An issue regarding the general data pipeline from import to database and removed info/input needed Extra attention is needed labels May 24, 2024

adriansteffan self-assigned this May 24, 2024

adriansteffan mentioned this issue Jun 4, 2024

update all templates to schema 2024 once finalized #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for IDless imports #123

Proposal for IDless imports #123

adriansteffan commented May 24, 2024 •

edited

Loading

adriansteffan commented May 24, 2024

mzettersten commented May 24, 2024

adriansteffan commented Jun 4, 2024

adriansteffan commented Aug 22, 2024 •

edited

Loading

Proposal for IDless imports #123

Proposal for IDless imports #123

Comments

adriansteffan commented May 24, 2024 • edited Loading

adriansteffan commented May 24, 2024

mzettersten commented May 24, 2024

adriansteffan commented Jun 4, 2024

adriansteffan commented Aug 22, 2024 • edited Loading

adriansteffan commented May 24, 2024 •

edited

Loading

adriansteffan commented Aug 22, 2024 •

edited

Loading