update script usage

DCAN-Labs · Sep 20, 2023 · 856366e · 856366e
1 parent d61c439
commit 856366e
Showing 1 changed file with 48 additions and 21 deletions.
diff --git a/docs/scripts.md b/docs/scripts.md
@@ -1,42 +1,64 @@
-# 6. Using `prepare.py` and `upload.py` Scripts
+# Run and Validate Prepare and Upload Scripts
 
 ## Using `prepare.py`
 
 The prepare.py script runs filemapper on the data to be uploaded as
 specified in the lookup.csv contained in the upload folder (eg
-destination folder for prepare.py). It then runs records.py on all of
+destination folder for prepare.py). If any of the parent directories
+are completely empty, it will delete those folders. It then runs records.py on all of
 the file-mapped folders to create manifest JSONs and records.csv, which
 contains a list of full paths to all files to upload. 
 
 When using
 [prepare.py](https://github.com/DCAN-Labs/nda-bids-upload/blob/master/prepare.py)
 there are four mandatory flags:
 
-**`--source`** (or **`-s`**): The directory under which all data desired
+```
+--source (-s): The directory under which all data desired
 for upload is found. This is usually the output of a pipeline like
 Dcm2Bids or abcd-hcp-pipeline. It is the directory your file mapper
 JSONs will be mapping from.
 
-**`--destination`** (or **`-d`**): The upload directory you began in step
-two. This directory is going to be where all of the data will be
+--destination (-d): The upload directory you began in [step two](workingdirectory.md). 
+This directory is going to be where all of the data will be
 organized after prepare.py has finished.
 
-**`--subject-list`**: A list of subjects and session pairs within a .csv
+--subject-list: A list of subjects and session pairs within a .csv
 file with column labels "bids_subject_id" and "bids_session_id,"
 respectively.
 
-**`--datatypes`**: A list of NDA data types within a .txt file you plan
+--datatypes: A list of NDA data types within a .txt file you plan
 to upload.
+```
+
+`prepare.py` should create the following:
+
+* parent/child directories for each datatype 
+* complete_folders.txt: contains one line with path to location of that particular file prepped for upload
+* complete_records.csv: contains all data for that subject pulled from lookup.csv 
+* folders_1_500_1.txt: prepare.py separates out the subject files in separate batches of 500 files each. Eg if there were 1400 files (or is it subjects?), there would be 3 submissions, batch 1 with 500 files, batch 2 with 500 files, and batch 3 with 400 files 
+* Companion csv for each txt - manifest.txt information for the files being uploaded
+
+
+## Validating `prepare.py`
 
 Once this script has been run you will want to spot check the results.
-In the upload directory, you will find a parent/child directory setup.
-You should have a parent directory for each of the JSON/YAML file pairs.
+
+You can validate that all of the expected subjects are present in each
+of the datatype folders by running the `validate-prepare.py` script with 
+your `subject_list.csv` and your working directory as the inputs. This script
+will loop through each datatype folder and compare the subject list to the 
+subject IDs present in the folder. It will otuput a text file containing the 
+subject IDs that are in the subject list but not in the datatype folders.  
+
+You should also validate the directory structure of the datatype folders 
+that were created. In the upload directory, you will find a parent/child directory setup.
+You should have a parent directory for each of the JSON/YAML file pairs 
+where relevant files were found.
 They should have the same name as their corresponding JSON/YAML files
 without the extensions. Underneath you should find a child directory for
 every subject (and session if used) that was found to have the relevant
-files listed in the corresponding file mapper JSON. If there are no
-child files under the parent directory then the script couldn't find any
-of the relevant files listed in the file mapper JSON.
+files listed in the corresponding file mapper JSON. 
 
 You will notice a common naming convention for the "child" directories
 as well. At the child directory level the naming convention has four
@@ -121,7 +143,9 @@ mapper JSON files for proper formatting.
 
 ## Using `upload.py`
 
-The upload.py script uses records.csv (generated by prepare.py above) to
+Before you begin the upload process, we recommend sending the NDA a curtesy email letting them know that you are about to start uploading data. If possible, provide an estimate of how much data is going to be uploaded. 
+
+The `upload.py` script uses `records.csv` (generated by `prepare.py` above) to
 split the files to be uploaded into batches of 500. For each batch, the
 script loops through each of the file paths to generate and run the
 necessary upload command using NDA-tools.
@@ -154,13 +178,16 @@ username and password, which is then stored in
 **\~/.NDATools/settings.cfg.** Find the upload logs, validation results,
 and submission package here: **\~/NDA/nda-tools/vtcmd**
 
-**When to contact the NDA Helpdesk**
+## Validate `upload.py`
+
+After running `upload.py`, check that the submission was successful. First check the output logs for any errors. Then, follow the steps below to check the submission on the NDA:
+
+1. Log into the NDA
+
+2. Navigate to your dashboard.
+
+3. Click on "Collections (#)"
 
-If all goes well in ***upload.py***, you should contact the NDA helpdesk
-as soon as possible after uploading and indicate that you are ready for
-the NDA's Quality Assurance (QA) checks on your collection's data.
+4. Click on the collection name you uploaded to.
 
-If all does not go well, make sure to read your NDA-specific upload log
-files, double-check the standard output streams (stdout) of your upload
-script runs, and double-check the submission status within your NDA
-collection's website tab on **Submissions**.
+5. Click on the "Submissions" tab and check the "Submission Loading Status"