Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add corrected labels #248

Closed
jcohenadad opened this issue Jul 12, 2023 · 24 comments
Closed

Add corrected labels #248

jcohenadad opened this issue Jul 12, 2023 · 24 comments
Assignees
Labels
dataset: philadelphia-pediatric Pediatric HC data from Feroze Mohammed

Comments

@jcohenadad
Copy link
Member

jcohenadad commented Jul 12, 2023

Related to the processing done in neuropoly/template#53

Branch: nb/template_pediatric nb/add-labels

@jcohenadad
Copy link
Member Author

jcohenadad commented Jul 19, 2023

@mguaypaq can you please help @NadiaBlostein with uploading clean dataset on git-annex. The problem is that there is a bunch of processed files, eg from cd07588e3cfdaceb4440b3b5a7d0d6eff0f341e4:

ok                                
get derivatives/labels/qc/data_processed/sub-124/anat/sct_label_vertebrae/2023_02_14_164043.807521/overlay_img.png (from origin...)  --> THESE ARE QC DATA AND THEY SHOULD NOT BE UPLOADED
ok                                
get derivatives/labels/qc/data_processed/sub-124/anat/sct_label_vertebrae/2023_02_14_165254.616938/bkg_img.png (from origin...) 
ok                                
get derivatives/labels/qc/data_processed/sub-124/anat/sct_label_vertebrae/2023_02_14_165254.616938/overlay_img.png (from origin...) 
ok                                
get derivatives/labels/sub-101/anat/straight_ref.nii.gz (from origin...)     --> THESE ARE TEMPORARY DATA AND THEY SHOULD NOT BE UPLOADED

etc.

The only files that should be git-annexed (in addition to the files already present in the master branch) are:

  • SC seg
  • disc labels

@mguaypaq
Copy link
Member

mguaypaq commented Jul 20, 2023

Sure. To avoid confusion, I used a new branch (nb/add-labels) to do the re-org, and edited the issue description to strike out the old branch (nb/pediatric-template).

For the re-organization inside derivatives/labels/, I did this:

  • Removed the log/ and qc/ directories.
  • Moved the script process_data_deepseg.sh to a new code/ directory.
  • Removed the various temporary files:
    sub-*/anat/{straight_ref.nii.gz,warp_curve2straight.nii.gz,warp_straight2curve.nii.gz}
    
  • Dropped the modifications to the git-annex settings in .gitattributes, since we don't need to have any .png files after all.

I re-used @NadiaBlostein's original commit message for this revised commit.
I also added an extra commit to add some skeleton for metadata inside derivatives/labels/, which should be filled out:

  • README.md
  • dataset_description.json
  • .bids-validator-config.json

Some things to note:

  • It looks like sub-107 and sub-125 are present in the base dataset, but not in derivatives/labels/, is this normal?
  • bids-validator complains about the naming scheme of the files. I'll try to suggest something which conforms to the spec in another commit soon. Probably with some changes to the JSON sidecars as well.

@NadiaBlostein
Copy link

NadiaBlostein commented Jul 21, 2023

@mguaypaq Thank you very much, this is perfect. To address your notes:

About the missing labels:

  • as noted in participants.tsv, the data acquisition is incomplete for subjects sub-107, sub-110 and sub-125. Spinal cord and vertebral labeling was only performed on the re-composed T1w and T2w images (sub-???_rec-composed_{T1w, T2w}.nii.gz). These images are generated by "stitching" the top (sub-???_acq-top_run-{T1w, T2w}.nii.gz) and bottom (sub-???_acq-bottom_run-{T1w, T2w}.nii.gz) acquisitions with each other.
  • sub-110 is missing the T1w re-composed image, but still has its T2w re-composed image.
  • sub-107 and sub-125 are each missing a T1w and T2w re-composed image.

About the bids-validator: sounds good! I can look into this after OHBM as well.

Thank you!

@jcohenadad
Copy link
Member Author

Is this done?

@jcohenadad
Copy link
Member Author

Issue with 62fc613a79dba7e8c9ff02d25a76cdebf6988ef0 on nb/add-labels, there are naming inconsistencies:

sub-101_rec-composed_T1w_label-SC_seg.nii.gz --> MISSING JSON FILE
sub-101_rec-composed_T1w_label-disc-manual.json --> NOT SAME FILE NAME AS THE FILE BELOW
sub-101_rec-composed_T1w_label-disc.nii.gz
sub-101_rec-composed_T2w_label-disc-manual.json --> NOT SAME FILE NAME AS THE FILE BELOW
sub-101_rec-composed_T2w_label-disc.nii.gz
sub-101_rec-composed_T2w_label-disc_level.nii.gz --> WHAT IS THIS FILE? AND IT HAS NO JSON
sub-101_rec-composed_T2w_label-SC_seg.nii.gz --> MISSING JSON FILE

@mguaypaq
Copy link
Member

mguaypaq commented Aug 3, 2023

Sorry, I got side-tracked with submitting a pull request to the BIDS spec while figuring out the best name for these files. I'll do the rename today.

@mguaypaq
Copy link
Member

mguaypaq commented Aug 3, 2023

Ok, I renamed the files and did some other cleanup. I still have some questions, though.

Done:

  • sub-102 contained some random *-header.txt files, removed.
  • Added information to README.md about missing data, as per the comment above.
  • Standardized from "Note" to "Notes" in the JSON sidecars.
  • Fixed formatting of participants.tsv in the source dataset.
  • Renamed files to align with the BIDS spec.

Questions:

  • 2 subjects are missing the T1w spinal cord mask, even though they have corresponding disc labels, why?
    • sub-103
    • sub-109
  • 6 subjects are missing the T2w JSON sidecar:
    • sub-113
    • sub-115
    • sub-116
    • sub-117
    • sub-122
    • sub-123
  • Several subjects are missing one or both of the disc level files (originally named *_label-disc_level.nii.gz). My understanding is that these files are auto-generated by using the SC segmentation and the (manual) disc labels, to "color" the entire SC segmentation with the corresponding disc levels.
    • Do we even want these files in the dataset?
    • Maybe because they're autogenerated, these files don't have JSON sidecars.
    • List of subjects with one or both missing:
      • sub-101
      • sub-103
      • sub-109
      • sub-110
      • sub-114
      • sub-116
      • sub-117
      • sub-119
      • sub-123

@NadiaBlostein
Copy link

NadiaBlostein commented Aug 7, 2023

Great! I just pulled the latest updates from nb/add-labels:

git checkout nb/add-labels
git pull && git annex sync --no-content


I have the following responses to your questions (and some questions of my own):

  • About *-header.txt file removal: confirming that these files can be deleted.

  • About 2 subjects (sub-103, sub-109) missing the T1w spinal cord (SC) mask:

    • for sub-103 and sub-109 T1w images, the spinal cord masks were not great so I did not keep them (such that future users do not just blindly use those labels for their projects and is forced to either regenerate, QC and maybe manually correct the masks).
    • I did not need to manually fix the SC mask labels for these two subjects because sub-103 and sub-109 centerline files were used instead of the spinal cord masks to obtain disc labels.
    • The centerline files were not kept because they are automatically regenerated (and then not saved) in the template-generation pipeline here.
    • Moving forward: Do you want me to add the SC masks for these subjects anyway? Should I also generate the centerline NIFTI files for the T1w and T2w images of each subject?
  • About 6 subjects missing the T2w JSON sidecar: I am not sure what you mean here. I am able to see and catthe following files:

    • philadelphia-pediatric/sub-113/anat/sub-113_rec-composed_T2w.json
    • philadelphia-pediatric/sub-115/anat/sub-115_rec-composed_T2w.json
    • philadelphia-pediatric/sub-116/anat/sub-116_rec-composed_T2w.json
    • philadelphia-pediatric/sub-117/anat/sub-117_rec-composed_T2w.json
    • philadelphia-pediatric/sub-122/anat/sub-122_rec-composed_T2w.json
    • philadelphia-pediatric/sub-123/anat/sub-123_rec-composed_T2w.json
  • About several subjects missing one or both of the disc level files: I thought I had deleted these files altogether. They are not needed for the template generation pipeline and create a redundancy with the disc labels + SC mask.

  • Naming convention & moving forward: derivatives/labels/sub-XXX/anat/ do not seem to be following the convention that I thought should be used (based on the suffixes mentioned in the SCT manual-correction module, here). Is it okay if I rename files according to the following convention, and push the changes to nb/add-labels?

    • SC masks: philadelphia-pediatric/derivatives/labels/sub-XXX/anat/sub-XXX_rec-composed_label-SC_desc-{T1w, T2w}_mask.nii.gz --> philadelphia-pediatric/derivatives/labels/sub-XXX/anat/sub-XXX_rec-composed_{T1w, T2w}_label-SC_mask.nii.gz
    • Disc labels: philadelphia-pediatric/derivatives/labels/sub-XXX/anat/sub-XXX_rec-composed_desc-T2wdisc_dseg.nii.gz --> philadelphia-pediatric/derivatives/labels/sub-XXX/anat/_rec-composed_{T1w, T2w}_labels-disc.nii.gz
    • Disc levels: Delete remaining files.
    • Centerline: Delete remaining file (only 1 subject has a saved centerline file: philadelphia-pediatric/derivatives/labels/sub-103/anat/sub-103_rec-composed_T2w_label-centerline.nii.gz)

Thank you very much! I hope this was clear. Cheers.

@jcohenadad
Copy link
Member Author

Moving forward: Do you want me to add the SC masks for these subjects anyway? Should I also generate the centerline NIFTI files for the T1w and T2w images of each subject?

not if they are automatically generated by the pipeline, and if they are satisfactory

@mguaypaq
Copy link
Member

mguaypaq commented Aug 7, 2023

@NadiaBlostein Thanks! That's very clear yes.

  • About disc level files: I added a commit to the nb/add-labels branch remove them, since they're redundant as you said.

  • About the centerline file for sub-103: Thanks for catching this, I removed it in another extra commit.

  • About the naming convention: I was trying to be a little more BIDS-compliant with the naming scheme, but it's not clear to me that there's much benefit right now. If the manual-correction scripts are expecting the old naming scheme, that's a good argument for reverting to it. I added another commit to do this, to save you some time.

  • About SC masks for sub-103 and sub-109: Thanks for the clarification. I added a note to README.md about it.

  • About missing JSON sidecars: I should have specified that I meant inside the derivatives/labels/ folder, so for example this file doesn't exist (but I would have expected it to be present):

    philadelphia-pediatric/derivatives/labels/sub-113/anat/sub-113_rec-composed_T2w_labels-disc.json
    

    The corresponding files for other subjects list either you or Sandrine as author (along with a timestamp, and sometimes a textual note), so I couldn't guess what should go in each missing file.

I think we're nearly there! If you can add the missing JSON sidecars in another commit on nb/add-labels, I think this will be good enough to merge into master.

@NadiaBlostein
Copy link

This is terrific, thank you for updating all of these things! Updates on my end:


Changes to philadelphia-pediatric/derivatives/labels/code

  • data_processing.sh: step-by-step walk-through (dependencies & versions; Bash commands);
  • configuration-all_{T1w, T2w}_subj.json: configuration files needed to run data_processing.sh.

About the missing JSON sidecars

  • The subjects for which the .json sidecars are missing are the ones for which the disc labels output by sct_label_vertebrae were kept intact! This is the same labeling convention used by the SCT manual-correction module:
    • philadelphia-pediatric/derivatives/labels/sub-XXX/anat/_rec-composed_{T1w, T2w}_labels-disc.nii.gz: disc labels for sub-XXX, either obtained manually or via sct_label_vertebrae (no point in keeping both the manually corrected labels & the labels output by sct_label_vertebrae);
    • philadelphia-pediatric/derivatives/labels/sub-XXX/anat/_rec-composed_{T1w, T2w}_labels-disc.json: this file only exists for subjects whose disc labels (*_labels-disc.nii.gz) were manually corrected, and the .json file contains a time stamp as well as the name of the person who corrected the labels.

About SC masks for sub-103 and sub-109 T1w images (*_T1w_label-SC_mask.nii.gz)

  • I reran a newer version of the template generation pipeline on the pediatric dataset (details in philadelphia-pediatric/derivatives/labels/code) by including the SC masks for both subjects the pipeline did run correctly... I therefore:
    • added philadelphia-pediatric/derivatives/labels/sub-103/anat/sub-103-rec-composed_T1w_label-SC_mask.nii.gz;
    • added philadelphia-pediatric/derivatives/labels/sub-109/anat/sub-109-rec-composed_T1w_label-SC_mask.nii.gz;
    • removed the note in derivatives/labels/README.md about those labels missing.
  • For sub-109, re-manually correcting the disc labels (label 3 was originally misplaced) fixed downstream errors. The following files have therefore been changed:
    • philadelphia-pediatric/derivatives/labels/sub-109/anat/sub-109_rec-composed_T1w_labels-disc.nii.gz
    • philadelphia-pediatric/derivatives/labels/sub-109/anat/sub-109_rec-composed_T1w_labels-disc.json

Does this sound reasonable? Let me know if you have any questions. I will be adding the template pipeline derivatives gradually, as I receive the outputs and QC them.

@mguaypaq
Copy link
Member

That all sounds reasonable, thanks for the explanations and the extra commits.

All the git-annexed files from your most recent commit are available on the server, but it looks like some of the previous versions didn't get saved to the server. Could you run this command, which should upload the missing files to the server?

git annex copy --all --to=origin

@NadiaBlostein
Copy link

Okay!

More updates:

  • Modified philadelphia-pediatric/derivatives/code/data_processed.sh
  • Added philadelphia-pediatric/derivatives/sct_straighten_spinalcord_T1w
  • Added philadelphia-pediatric/derivatives/template_T1w

And just ran git annex copy --all --to=origin

Everything good? I have a couple more things to add :)

@mguaypaq
Copy link
Member

Great, it looks like git annex copy --all --to=origin worked, I now see the two previously missing files.

Also, all the new .nii.gz files from your latest commit are available through git annex get, so 👍

Looking at the new commit, I see that the large .mnc files and a .npz binary file are directly in git rather than git-annex. But that's a problem I can fix later, there's no missing data.

For now I would say, go ahead and add other things if you want.

@NadiaBlostein
Copy link

Great! More updates:

  • Added philadelphia-pediatric/derivatives/sct_straighten_spinalcord_T2w
  • Added philadelphia-pediatric/derivatives/template_T2w
  • ran git annex copy --all --to=origin again

Will add the templates after once they are ready (this is slow, may take up to a week).
Let me know if anything is missing / incorrect. Thank you!

@mguaypaq
Copy link
Member

I can confirm that your extra updates are on the data server, yes: the .mnc files are directly in git (I'll fix this later, once you're done with the updates), and the .nii.gz files are in git-annex.

@jcohenadad
Copy link
Member Author

jcohenadad commented Sep 22, 2023

After discussing with @NadiaBlostein, I realized that she put the generated template on git-annex.

We do not want that because

  • the purpose of git-annex is to host: (i) original NIfTI images and (ii) corrected labels

With the data from git-annex and the code to generate the template, we should be able to generate the template.

Then, the template should be stored as a github asset inside this repository: https://github.com/spinalcordtoolbox/pediatric-template

So, actions to do:

  • remove all the files that should not be under git-annex (if possible remove the history, maybe with a rebase?, because many GB were added)
  • please include me in the discussion if you have questions/concerns

thanks

@NadiaBlostein
Copy link

Dataset on git-annex

philadelphia-pediatric

Branch

nb/add-labels

Update

As discussed with @jcohenadad, I deleted all of the intermediary files output by the template-generation pipeline as they occupy a lot of space and are not needed.

The following directories were deleted:

  • philadelphia-pediatric/derivatives/template_T1w
  • philadelphia-pediatric/derivatives/template_T2w
  • philadelphia-pediatric/derivatives/sct_straighten_spinalcord_T1w
  • philadelphia-pediatric/derivatives/sct_straighten_spinalcord_T2w
  • philadelphia-pediatric/derivatives/model_nl_all_T2w

Currently

As discussed with @jcohenadad, the nb/add-labels branch of the philadelphia-pediatric dataset on git-annex only contains the following files (in BIDS format):

  • (i) original NIfTI images: philadelphia-pediatric/sub_*
  • (ii) corrected labels: philadelphia-pediatric/derivatives/labels/sub_*

Moving forward

@mguaypaq Is this clear and was everything done correctly? If so, do I have your okay to run git annex copy --all --to=origin? Or will you be the one to merge this with the master branch?

Thank you!

@jcohenadad
Copy link
Member Author

@NadiaBlostein before deleting the generate templates, can you please upload them here: https://github.com/spinalcordtoolbox/pediatric-template

@mguaypaq
Copy link
Member

@NadiaBlostein it's always OK to do git annex copy --all -to=origin, that command doesn't affect the master branch.

For the file deletions, I'll make sure everything is tidy once we have everything on the server, no need to worry. In the meantime, if you want to access the deleted files (for example, to upload the template in spinalcordtoolbox/pediatric-template#6), you can still get them by doing:

# check which branch you're on, and whether there are unsaved changes
git status

# if there are unsaved changes, save them first
git add .
git commit -m "work in progress"

# go back to the commit from before deleting various files
git checkout 8a89e4c9e078dd49851087fe91c7de70a38c5897

# once you're done looking at the deleted files, you can go back to your original branch
git checkout nb/add-labels

I see that you correctly deleted all the derivatives/ folders except derivatives/labels/, but I have one last question: the *-labels-disc.nii.gz files are clearly manually labelled, since you're listed as the author in the corresponding JSON files. But what about the *-label-SC_mask.nii.gz files? Are those auto-generated by the pipeline? If so, I might remove them too.

@NadiaBlostein
Copy link

Hi @mguaypaq ! Some updates & questions.

1. To answer your question:

  • Indeed, all of the *-label-SC_mask.nii.gz were automatically generated whereas most of the *-labels-disc.nii.gz files were manually corrected by either me or @sandrinebedard.

2. I am having trouble switching between dataset versions. I ran the following:

git annex copy --all --to origin
git checkout 8a89e4c9e078dd49851087fe91c7de70a38c5897

Currently, when I do git branch, I end up on a called edd688281916bec227c9198032adb333bc275a05. When I run git checkout nb/add-labels, I get the error below:

error: Vos modifications locales aux fichiers suivants seraient écrasées par l'extraction :
*nii.gz # long list of files

3. Finally, I would like to obtain the exact commit ID of the dataset after the changes described in the comment here. Has it been merged with the dataset master branch?

Thank you very much!

@mguaypaq
Copy link
Member

mguaypaq commented Oct 2, 2023

Hi @NadiaBlostein, sorry for the delay.

Thanks for the clarification about the *-label-SC_mask.nii.gz files.

For switching between dataset versions, it's hard to guess what exactly is going on, but the following commands should save your current state and let you switch to nb/add-labels:

# create a new branch to save the current state
git checkout -b nb/previous-state
git add .
git commit -m 'previous state'

# now that the local modifications are saved, you can switch branches
git checkout nb/add-labels

For the commit ID of the dataset described in that comment, it's edd688281916bec227c9198032adb333bc275a05. This is the commit that the server currently thinks is nb/add-labels, so it should show up as origin/nb/add-labels if you run git log edd688281916bec227c9198032adb333bc275a05. If you don't see origin/nb/add-labels for this commit, you can probably fix it by running git fetch origin.

After that, if you're still having trouble, I would be curious to see the copy-pasted output of these commands:

git status
git log --graph HEAD nb/add-labels

@NadiaBlostein
Copy link

NadiaBlostein commented Oct 3, 2023

It worked! Also, for some reason the sub-103 and sub-109 T1w SC masks were supposed to have been removed from the nb/add_labels branch (see above) but were not, which was causing downstream issues for @rohanbanerjee in rerunning the pipeline. I just deleted them and updated the README for the nb/add_labels branch (now commit bfe5ccef153ed127f2cb721114639889b70bbd7d). Is this okay?

Thank you!

@mguaypaq
Copy link
Member

mguaypaq commented Oct 3, 2023

Alright! Thank you for your patience. Everything looks good now, so I combined all the changes into a single new commit on the master branch (with your name on it). Since all the intermediate history is gone, the large files are gone and the downloads are quick: a simple git clone only gets 14 MB, and git annex get gets 2.2 GB.

(Just for safety, I still have a clone of the repository on joplin with all the history and intermediate files. I'll keep it for a couple of weeks in case we need anything from it, then delete it to free up 24 GB of disk space.)

@mguaypaq mguaypaq closed this as completed Oct 3, 2023
NadiaBlostein added a commit to spinalcordtoolbox/pediatric-template that referenced this issue Oct 5, 2023
Updated dataset version (now just master branch; updated commit ID), see [data-management closed issue 248](neuropoly/data-management#248 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset: philadelphia-pediatric Pediatric HC data from Feroze Mohammed
Projects
None yet
Development

No branches or pull requests

3 participants