Skip to content

Commit

Permalink
documentation for creating import scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
adriansteffan committed Aug 30, 2024
1 parent eb36207 commit 31661e1
Showing 1 changed file with 69 additions and 10 deletions.
79 changes: 69 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ Having done that, you can now set a `upload=TRUE` flag when calling `write_and_v

## Running the existing import scripts

You can (re-)run singular import scripts inside their respective folders. It will automatically download the specific dataset from osf, run the import, run validation and (if the flag at the bottom is set) upload the processed data to OSF.
You can (re-)run singular import scripts inside their respective folders. It will automatically download the specific dataset from OSF, run the import, run validation and (if the flag at the bottom is set) upload the processed data to OSF.

To run and validate the entirety of the import scripts (& download/upload all data), you can run the `run_all()` function in `pipeline.R`. It has parameters to control various aspects of process (e.g. should all data be redownloaded, should previous processed data be wiped, should the data be uploaded).
To run and validate the entirety of the import scripts (& download/upload all data), you can run the `run_all()` function in `pipeline.R`. It has parameters to control various aspects of the process (e.g. should all data be re-downloaded, should previous processed data be wiped, should the data be uploaded).

## Creating a new import script

Expand All @@ -62,14 +62,13 @@ The following steps will get you up and running with creating an import script f

We use an [OSF repository](https://osf.io/pr6wu/) for storing both the raw input data as well as the processed output data.

When starting to import a new dataset, create a folder at the top level of the OSF repo named after the dataset identifier (e.g. `weaver_zettersten_2024`) - if no identifier exists yet, you can refer to the other datasets to get an idea about the format.
When starting to import a new dataset, create a folder at the top level of the OSF repository named after the dataset identifier (e.g. `weaver_zettersten_2024`) - if no identifier exists yet, you can refer to the other datasets to get an idea about the format.

Next, create a `raw_data` folder in the

Now put the dat. Usually, the paper has a github or osf repository with pointers to where the data is stored.
Next, create a `raw_data` subdirectory in the OSF folder you just created and put the raw dataset into this folder. Usually, the corresponding paper has a GitHub or OSF repository with pointers to where the data can be found.

### Generating the import template

Move your command line to your local copy of this repository (or clone it and cd into it).
To generate the import script template, run
```
Rscript start_import.R
Expand All @@ -88,21 +87,81 @@ Rscript start_import.R --legacy

2. Copy over the `idless_template.R` from `helper_functions/` to the directory you just created and rename it to `import.R` (use `legacy_template.R` if you need to create a legacy import script)

3. If your data is already in an appropiate osf, run the `import.R` script up to the `init()` function call to automatically download it. Otherwise, create a `raw_data` folder within your dataset specific folder and move your raw data there.
3. If your data is already in a OSF folder names after the dataset, run the `import.R` script up to (&including) the `init()` function call to automatically download it. Otherwise, create a `raw_data` folder within your dataset-specific folder and move your raw data there.

## Import Guide

The underlying `peekds` data format stores the eyetracking gazepoints and metadata in a set of tables where relationship are expressed through linked ids. As it can be a bit unwiedly to directly turn your data into this format, we have created a set of templates and functions to streamline the data importing process. The next few paragraphs will go over how to prepare your data for import using these helpers.
The underlying `peekds` data format stores the eyetracking gazepoints and metadata in a set of tables where relationships are expressed through linked IDs. As it can be unwieldy to directly turn your data into this format, we have created a set of templates and functions to streamline the data importing process. The next few paragraphs will go over how to prepare your data for import using these helpers.

If you run into issues with the provided methods (e.g. memory limitations, edge cases not sufficiently mapped in our helper functions), you can also manually generate the tables and IDs in a "legacy" import script. Sparser instructions and help for this can be found [here](#import-guide-legacy-method).

After following the steps above, check out your `import.R`. The template can be divided into 5 steps.

### 1. Initial Setup

The first part of the script loads the required libraries/helper functions and downloads the raw dataset from osf (if not already downloaded). If you want to redownload the raw_data, delete the `raw_data/` folder and rerun the `init()` function. Naturally, put the libraries you use at the top.

### 2. Creating a wide.table

First, read in all relevant data from `raw_data` (the folder path is specified in `data_path`).
The main portion of the import script will now be transforming this raw data into a big table that we will later feed into the digestion function.


This table should have one row per recorded gazepoint, with all of the associated metadata present (subject, trial information, stimuli, general metadata) in each row. This format has a high level of redundancy but is easy to reason about and can be easily split by our helper function.


Refer to the generated template for an overview of the columns that need to be present in this table. This [specification of the wide.table fields](https://docs.google.com/spreadsheets/d/1Z24n9vfrAmEk6_HpoTSh58LnkmSGmNJagguRpI1jwdo/edit?gid=516411759#gid=516411759) explains the contents of each column. Some columns are optional, so depending on your dataset, you might not have to specify all of them.


### 3. Digest the wide.table

Pass the `wide.table` to the digestion function along with the `dataset_name`.
Also specify:
* `lab_dataset_id`: internal name of the dataset by the original authors, can be `NA`
* `cite`: Full length citation to the original publication containing the dataset (APA-Format)
* `shortcite`: author (year) - APA-style in-text citation
* `rezero`: `TRUE`, except if the t=0 already marks the start of trials or t is centered around the target onset
* `normalize`: `TRUE`, except if the t=0 is already centered on the target word onset
* `resample`: `TRUE` (until we find an edgec ase where this breaks things)

Explanation for rezero/normalize/resample:
Our final data format expects every trial to have t values centered around the point of disambiguation(target word onset) and timesteps to be resampled to 40Hz. To achieve this, the digest function runs the looking data through three manipulations:
* *Rezeroing*: Resets t to zero at the beginning of each trial
* *Normalization*: Subtracts the point of disambiguation from each timepoint
* *Resampling*: Resamples the timepoints to 40Hz

Normally you should leave all of these flags as `TRUE`, but they give you control over the time manipulation process in edge cases (e.g. your data is centered around the target onset already).

The function returns a list of 7 or 9 datasets containing the tables in `peekds` format. The specific tables can be accessed using
`dataset_list[["TABLE_NAME"]]`.

### 4. Aux Data

To attach arbitrarily shaped data to the peekds data tables, tables have special "*_aux_data" columns that house json strings. The output of the digestion function has `NA` values for all *_aux_data columns of the tables. If you have any aux data to add (CDI Data, Language Exposures, Other Language Measures) - this is the place to do it.

As CDI is the most commonly imported aux data, there is a specific helper to ingest this type of data, which expects a long format table with one cdi rawscore per row. The template contains code that specifies the usage of the helper function and the exact shape of the input table. If you have more data or data other than cdi, refer to the next section.

#### Passing in other aux data

A more generalized helper function to turn nested data into the correct json output is on the roadmap, but you will have to manually populate the json strings for the aux data columns for now.

The required shape for the aux data json objects can be found [here](https://docs.google.com/spreadsheets/d/1Z24n9vfrAmEk6_HpoTSh58LnkmSGmNJagguRpI1jwdo/edit?gid=556307379#gid=556307379). For reference, you can check other `import.R` scripts that imported aux data or the `idless_draft.R` cdi helper function at the bottom.

When joining data together, be mindful that the meanings of some fields have changed in the output, e.g. what you entered as `subject_id` is now called `lab_subject_id`, while the new `subject_id` refers to the internal IDs of the `peekds` format (the renamings were done for clarity on the digestion side).

### 5. Write and Validate the Data (+ Upload)

In `write_and_validate_list()`, set `cdi_expected`to indicate whether you are importing any cdi data.

TODO
After you have finished all of the steps, run the entire import script. Some automated validation steps will be run, so check if the validator complains about any aspect of your data - adjust your import script accordingly. If some error messages are confusing, make sure to open a github issue so we can improve the feedback provided by the validator.

After the validation, a summary along with multiple plots about the dataset will be printed. Use these to sanity-check the data and cross-reference it with the original paper.

Once you are happy with the state of the data, upload the data by adding a `upload=TRUE` flag to the arguments of `write_and_validate_list()` and run it again.
Finally, manually upload your `README.md` to the OSF folder of the dataset.

## Import Guide (Legacy Method)

Here's [an in-progress google doc guide](https://docs.google.com/document/d/1hQrbV33Zdl3SmbJAdTuzyfwCNF9nYpouQ7lQ8U0dQSw/edit) for importing.
Here's [an old in-progress google doc guide](https://docs.google.com/document/d/1hQrbV33Zdl3SmbJAdTuzyfwCNF9nYpouQ7lQ8U0dQSw/edit) for importing.

Further information about the database schema is also linked in [Peekbank - START HERE](https://docs.google.com/document/d/1PrIrLg_A9VTITIp--ucf_wMN-C0VPODtirz8jowUL1Y/edit).

0 comments on commit 31661e1

Please sign in to comment.