Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for IDless imports #123

Open
adriansteffan opened this issue May 24, 2024 · 4 comments
Open

Proposal for IDless imports #123

adriansteffan opened this issue May 24, 2024 · 4 comments
Assignees
Labels
import An issue regarding an import script pipeline An issue regarding the general data pipeline from import to database

Comments

@adriansteffan
Copy link
Contributor

adriansteffan commented May 24, 2024

Since IDs (and linking them) have been the biggest source of confusion, errors, code duplication, and bloat when importing datasets into the peekbank format, we could think about a future approach that forgoes IDs completely on the importer side and handles them via peekds. Since IDs are uniquely defined by properties within each table, they can be inferred.

In that version, all of the data would be put into a big dataframe with one row for each xy/aoi timepoint, and all other data (stimulus info, subject info, etc) would be placed redundantly into every row. This table is then fed into peekds, which infers IDs and sorts the data into the separate tables. This approach would result in a slightly larger intermediate dataset size. However, most of the datasets we get come in singular large unnormalized tables anyway, so adding some of the peekbank variables to that should not add that much bloat.

The productivity gains in the MB2 imports were striking, so I think this approach is promising.

From a software upkeep/error reduction perspective: right now, every single import script has to include the rules on which variables within a table uniquely identify rows when creating IDs. The "huge table into peekds" approach would put this into a centralized location, leaving the import scripts to deal with dataset-specific functionality.

Happy to have more opinions on this before I go and write the code!

@adriansteffan adriansteffan added info/input needed Extra attention is needed import An issue regarding an import script pipeline An issue regarding the general data pipeline from import to database and removed info/input needed Extra attention is needed labels May 24, 2024
@adriansteffan
Copy link
Contributor Author

Just decided in the meeting that this is worth a try, I will have a look

@adriansteffan adriansteffan self-assigned this May 24, 2024
@mzettersten
Copy link
Contributor

I like this idea Adrian!! I think IDs are a huge source of error (and also take a lot of mental energy to think through), so it would be awesome to solve this "once", whether then solve it each time for every dataset. Happy to help look at this!

@adriansteffan
Copy link
Contributor Author

Note: Put age into this and calculate from lab_age and lab_age_units

@adriansteffan
Copy link
Contributor Author

adriansteffan commented Aug 22, 2024

First version is done, still needs

  • test this with legacy imported datasets to check equivalence (ids will be different, but the structure/content should be the same)
  • automated testing framework for legacy comparisons (stephan is working on it)
  • testing for datasets with xy coordinates
  • a cdi helper to facilitate cdi data import
  • documentation on how to use it (only the spreadsheet left)
  • validation that helps importers with common errors
  • a marking in the dataset table (which dataset uses which import method, legacy vs idless)
  • integration into peekds (once it is more mature - maybe also think about combining peekbankr and peekds)
  • think of quality of life additions (taking L,l,r,R as inputs for target_side and transforming in the function, same for other common outputs)
  • the current version does not support having multiple aoi region sets for a dataset - should be an easy to fix though (?)
  • automatic aoi generation for datasets that provide xy data (we trust our aois more than theirs)
  • expose the column list from the idless import to give importers an easy way to filter down their wip wide table to make the idless import easier on the memory
  • better aux data nesting helper function (maybe move this into a separate issue as this might be a bigger thing)
  • make language entry more permissive by performing mapping of languages to iso in the digestion function
  • relax the "space after comma for multiple languages in iso format" rule by adding that in the digestion function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import An issue regarding an import script pipeline An issue regarding the general data pipeline from import to database
Projects
None yet
Development

No branches or pull requests

2 participants