Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORT sander-montant_2022 #115

Open
mzettersten opened this issue May 17, 2024 · 11 comments
Open

IMPORT sander-montant_2022 #115

mzettersten opened this issue May 17, 2024 · 11 comments

Comments

@mzettersten
Copy link
Contributor

No description provided.

@adriansteffan
Copy link
Contributor

import not started, refer to #125 first

@alvinwmtan
Copy link
Contributor

Go ahead with import; #125 has been resolved.

@alvinwmtan alvinwmtan changed the title IMPORT montat_2022 IMPORT sander-montant_2022 Aug 29, 2024
@alvinwmtan
Copy link
Contributor

First-pass IDless import complete. Ready for code review

@alvinwmtan
Copy link
Contributor

alvinwmtan commented Aug 29, 2024

Checklist for code review v2024

To start:

  • Git pull this repo to get the latest version
  • Update your peekds and peekbankr to the latest version
    • Be sure to restart your R Session to apply these updates
  • Get the latest version of the dataset from osf (delete your raw_data, so that the script automatically downloads the data)
  • Run the import script
  • Does it run into issues due to missing libraries? during restructuring, import statements for libraries like janitor might have been lost in some datasets - re-add them if necessary
  • Does the validator complain about the processed data? Complain to Adrian (or fix the listed issues if you feel like it)

Common issues to check:

Trials

  • Are trials now unique between administrations?
  • Is exclusion info handled correctly? (look closely at how exclusion information is marked)

Trial Types

  • Check if the trial type IDs are created independently of administrations/subjects
  • Is vanilla_trial coded appropriately?

Stimuli

  • If the images are on osf, make sure the image path contains the image path on osf
  • Make sure each row represents a label-image association
    • the labels should be the words that the participants hear. For example, "apple" is okay, "red_apple_little" is wrong and was probably erroneously extracted from the file name
  • Are there items in the imported dataset not mentioned in the paper?
  • Are distractors represented correctly?
    • Special explanation for distractors: If an item only ever appeared in distractor position, it still gets its own row. The label is typically the label given to the image in the experiment description (e.g., "the distractor was an image of a chair"). If there is no obvious label provided in the experiment design, leave label blank.

Subjects

  • Does CDI data follow the new aux_data format?
  • Age rounded correctly? (decision: we do no rounding)

General

  • Double-check the citation and update it in the dataset table and make sure it’s consistent with the peekbank datasets google sheet: peekbank datasets
  • Are there any TODOs left in the code - resolve/double check
  • Review (or add) a ReadME (example)
    • Make sure any TO-DOs or other decision points in the comments of the code are documented in the ReadMe AND removed from the code to prevent ambiguity
  • General data sanity-checking (summary output helps here)
    • is there are the general numbers (e.g. # of participants, # of stimuli, average trials per administratoin) in the summary consistent with the paper? aoi_timepoints are hard to gague, but a super small number is probably bad
    • is the subject summary (age, sex distribution) approximately consistent with the paper? (note that it is not surprising if it is not identical - often we have a slightly different dataset and are not trying reproduce the exact numbers)
    • is the target side distribution skewed towards one side?
    • any weird trial durations?
    • do the cdi rawscore numbers match the instrument and measure?
    • is the exclusion % and the exclusion reasons sensible? (bearing in mind that we only have exclusion info for some datasets)
    • Inspect the timecourse and accuracy plots/output at the end of the import:
      • Compare timecourse patterns with paper (as best as possible)
      • Does the timing seem right? (accuracy spike later than PoD might be sensible, earlier is suspicious)
      • (if multiple conditions) Does the number conditions make sense in the context of the paper?
      • (if multiple conditions) Are the overall accuracies for conditions vastly different in a way not explained by the paper?
      • Any odd item-level patterns?
      • Any odd subject-level patterns?
    • Any large (unexpected) discrepancies between data reported in paper vs. data in the imported dataset?
  • After checking everything and rerunning the script: Upload the output to osf

@vboyce
Copy link
Contributor

vboyce commented Sep 11, 2024

exclusions look to only be for participant-level, not trial-level (probably all we have)

full phrase is missing for some of the data (presumably because we don't have it?) (Alvin confirms we don't have it)

[resolved] looks like trial_type info is coming from a trial_info csv, possibly coded by someone (alvin?) off of raw stimuli? I'm wondering why there are trials that are marked vanilla but have condition mispronounced. condition column of trial_info was miscoded (very understandably) and has been corrected.

@vboyce
Copy link
Contributor

vboyce commented Sep 13, 2024

I think everything is good except I couldn't track down the images.

It sounds from the readme that the unpublished bh2017 but with younger kids should have videos that could be screenshotted somewhere, but I didn't find it in a cursory look through osf repos.

The schott osf is still private (and the corresponding github https://github.com/e-schott/CrossLanguagePhonologicalOverlap) doesn't seem to have stimuli. Idk if it's worth asking for the stimuli.

@alvinwmtan
Copy link
Contributor

@vboyce bh2017 + unpub video files can be found here: https://osf.io/htn9j/ (would also be good to update bh2017 if you get the images (ref #97))

@mzettersten
Copy link
Contributor Author

@vboyce Should I reach out and ask? Maybe for images for all of the various projects, if it were to make sense?

@vboyce
Copy link
Contributor

vboyce commented Sep 16, 2024

@mzettersten reaching out for the various projects for image stimuli would be great!

and @alvinwmtan thanks, I can do the screenshotting for bh2017 and unpub and update both places!

@vboyce
Copy link
Contributor

vboyce commented Sep 17, 2024

hmm, do the bh2017 files open for you @alvinwmtan ? For me the two bouche ones do, but the others seem corrupted or wrong extension or something, and I can't open them.

@adriansteffan
Copy link
Contributor

the videos needed some convincing, but I beat them over the head with ffmpeg and they told me their secrets. I have extracted the stimuli and updated the file paths for both the unpublished sample of this dataset and for byers-heinlein_2017.

That should conclude the review! The only thing we could still do is try to get the missing images that did not overlap with bh2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants