Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORT kremin_2021 #133

Open
alvinwmtan opened this issue Aug 28, 2024 · 15 comments
Open

IMPORT kremin_2021 #133

alvinwmtan opened this issue Aug 28, 2024 · 15 comments

Comments

@alvinwmtan
Copy link
Contributor

alvinwmtan commented Aug 28, 2024

Kremin et al. (2021) has two data subsets: Eng–Fra bilingual and Eng–Spa bilingual. The former is included in montat_2022, but the latter is not; we will import this whole dataset separately.

@alvinwmtan
Copy link
Contributor Author

IDless import complete; ready for review

@alvinwmtan
Copy link
Contributor Author

alvinwmtan commented Aug 30, 2024

Checklist for code review v2024

To start:

  • Git pull this repo to get the latest version
  • Update your peekds and peekbankr to the latest version
    • Be sure to restart your R Session to apply these updates
  • Get the latest version of the dataset from osf (delete your raw_data, so that the script automatically downloads the data)
  • Run the import script
  • Does it run into issues due to missing libraries? during restructuring, import statements for libraries like janitor might have been lost in some datasets - re-add them if necessary
  • Does the validator complain about the processed data? Complain to Adrian (or fix the listed issues if you feel like it)

Common issues to check:

Trials

  • Are trials now unique between administrations?
  • Is exclusion info handled correctly? (look closely at how exclusion information is marked)

Trial Types

  • Check if the trial type IDs are created independently of administrations/subjects
  • Is vanilla_trial coded appropriately?

Stimuli

  • If the images are on osf, make sure the image path contains the image path on osf
  • Make sure each row represents a label-image association
    • the labels should be the words that the participants hear. For example, "apple" is okay, "red_apple_little" is wrong and was probably erroneously extracted from the file name
  • Are there items in the imported dataset not mentioned in the paper?
  • Are distractors represented correctly?
    • Special explanation for distractors: If an item only ever appeared in distractor position, it still gets its own row. The label is typically the label given to the image in the experiment description (e.g., "the distractor was an image of a chair"). If there is no obvious label provided in the experiment design, leave label blank.

Subjects

  • Does CDI data follow the new aux_data format?
  • Age rounded correctly? (decision: we do no rounding)

General

  • Double-check the citation and update it in the dataset table and make sure it’s consistent with the peekbank datasets google sheet: peekbank datasets
  • Are there any TODOs left in the code - resolve/double check
  • Review (or add) a ReadME (example)
    • Make sure any TO-DOs or other decision points in the comments of the code are documented in the ReadMe AND removed from the code to prevent ambiguity
  • General data sanity-checking (summary output helps here)
    • is there are the general numbers (e.g. # of participants, # of stimuli, average trials per administratoin) in the summary consistent with the paper? aoi_timepoints are hard to gague, but a super small number is probably bad
    • is the subject summary (age, sex distribution) approximately consistent with the paper? (note that it is not surprising if it is not identical - often we have a slightly different dataset and are not trying reproduce the exact numbers)
    • is the target side distribution skewed towards one side?
    • any weird trial durations?
    • do the cdi rawscore numbers match the instrument and measure?
    • is the exclusion % and the exclusion reasons sensible? (bearing in mind that we only have exclusion info for some datasets)
    • Inspect the timecourse and accuracy plots/output at the end of the import:
      • Compare timecourse patterns with paper (as best as possible)
      • Does the timing seem right? (accuracy spike later than PoD might be sensible, earlier is suspicious)
      • (if multiple conditions) Does the number conditions make sense in the context of the paper?
      • (if multiple conditions) Are the overall accuracies for conditions vastly different in a way not explained by the paper?
      • Any odd item-level patterns?
      • Any odd subject-level patterns?
    • Any large (unexpected) discrepancies between data reported in paper vs. data in the imported dataset?
  • After checking everything and rerunning the script: Upload the output to osf

@vboyce
Copy link
Contributor

vboyce commented Sep 10, 2024

@alvinwmtan Is the raw data for kremlin on the peekbank osf? (because I'm not able to download it and not seeing it there?)

@alvinwmtan
Copy link
Contributor Author

sorry, just uploaded. should be there now

@vboyce
Copy link
Contributor

vboyce commented Sep 10, 2024

Thanks! Might I also have "demo_comp.Rda"?

@alvinwmtan
Copy link
Contributor Author

ah yes sorry, forgot i had to copy it over from sander-montant_2022

@vboyce
Copy link
Contributor

vboyce commented Sep 10, 2024

sorry to keep asking for files (@alvinwmtan), but I can't find
target-distractor-pairs.csv
trial_info_fr.csv
trial_info_sp.csv
and also the images if we have them

@alvinwmtan
Copy link
Contributor Author

alvinwmtan commented Sep 10, 2024

my bad; added them. there are wmv files that could be screenshotted to grab the images but i haven't done so—feel free to!

note also that trial_info_fr.csv and trial_info_sp.csv were constructed by me rather than from the original raw_data; it is possible but just somewhat annoying to programmatically pull together all the different pieces of info needed.

@vboyce
Copy link
Contributor

vboyce commented Sep 13, 2024

Images could be pulled via screenshot
readme mentions that CDI data for the montreal subset exists (but has not been imported yet) (and DVAP -- another vocab measure)

@vboyce
Copy link
Contributor

vboyce commented Sep 13, 2024

@alvinwmtan I'm getting a validation error because of an aoi region that is all NA's -- it looks like this is coming from the fact that one of the datasets has aoi coordinates and the other doesn't (hand coded) and NAs get added to that one in a bind_rows?
Does that sound right? / Do you have ideas for fixing?

@alvinwmtan
Copy link
Contributor Author

this is correct (montreal has AOIs and princeton doesn't). somehow i didn't get a validation error when i ran it though? not sure why this is popping up

@vboyce
Copy link
Contributor

vboyce commented Sep 16, 2024

so the validation issue seems to not be about the NAs and more be about that there are multiple regions in the aoi_region_set, but there's only 1 in the trial_types? which I think I've traced back to

 if(!is.na(data$l_x_max[[1]])){
    
    trial_types$aoi_region_set_id <- 0

in the digest_data function at
https://github.com/peekbank/peekbank-data-import/blob/33e750103cb1249b35daa01c6a7e9de1da5a6749/helper_functions/idless_draft.R#L284C1-L284C39.

Commenting out that line does seem to "fix" things, but I don't know what it's purpose is -- @adriansteffan what is this line supposed to be doing?

@adriansteffan
Copy link
Contributor

so the validation issue seems to not be about the NAs and more be about that there are multiple regions in the aoi_region_set, but there's only 1 in the trial_types? which I think I've traced back to

 if(!is.na(data$l_x_max[[1]])){
    
    trial_types$aoi_region_set_id <- 0

in the digest_data function at https://github.com/peekbank/peekbank-data-import/blob/33e750103cb1249b35daa01c6a7e9de1da5a6749/helper_functions/idless_draft.R#L284C1-L284C39.

Commenting out that line does seem to "fix" things, but I don't know what it's purpose is -- @adriansteffan what is this line supposed to be doing?

The line is supposed to remind me to read my code more carefully before committing. I removed it, thanks for the catch!

@adriansteffan
Copy link
Contributor

Stimulus screenshots are on osf and image paths updated. Will look into CDI/Lang Exposure/DVAP next

@adriansteffan
Copy link
Contributor

Imported lang exposure/dvap data, cdi data still needs some checking if the data actually matches our sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants