-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module/ichorcna/1.0 #178
Module/ichorcna/1.0 #178
Conversation
…o module/ichorcna/1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big thing I'd like to raise is why the source code for ichorcna is seemingly bundled with this module. This should auto-install rather than containing all the source code.
demo/Snakefile
Outdated
@@ -83,4 +84,5 @@ rule all: | |||
rules._strelka_all.input, | |||
rules._bwa_mem_all.input, | |||
rules._liftover_all.input, | |||
rules._controlfreec_all.input | |||
rules._controlfreec_all.input, | |||
rules._ichorcna_all.input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
demo/config.yaml
Outdated
ichorcna: | ||
inputs: | ||
sample_bam: "data/{sample_id}.bam" | ||
sample_bai: "data/{sample_id}.bam.bai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline at the end if this is the last line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
readcounter: | ||
readCounterScript: "{MODSDIR}/src/readCounter" | ||
chrs: | ||
hg19: "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we wouldn't rely on the user to specify the chromosomes this way but I can live with this for now.
hs37d5: "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y" | ||
hg38: "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY" | ||
grch38: "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY" | ||
qual: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain with a comment what this does.
e.g.
qual: 20 #set the minimum mapping quality (or whatever this actually means)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"500000": "{MODSDIR}/src/inst/extdata/HD_ULP_PoN_{genome_build}_500kb_median_normAutosome_median.rds" | ||
# must use gc wig file corresponding to same binSize (required) | ||
ichorCNA_gcWig: | ||
"1000000": "{MODSDIR}/src/inst/extdata/gc_{genome_build}_1000kb.wig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this genome_build naming match the one we use ? I assume it does since this was run in GAMBL, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately ichorCNA's github repo is messy and inconsistent with their naming conventions. In my original version, I had to manually rename some of their reference files to fit this format. In the current version, there's one rule with a bunch of symlinks that renames the reference files so it would fit in downstream rules.
modules/ichorcna/1.0/ichorcna.smk
Outdated
|
||
### Directories ### | ||
|
||
resultsDir = "results/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a standard for LCR-modules. What is the purpose/benefit of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left them in by accident when I was initially crafting the module. They're not used at all in the module. They've been removed now
CFG["runs"]["binSize"] = str(CFG["options"]["readcounter"]["binSize"]) | ||
|
||
# Symlinks the input files into the module results directory (under '00-inputs/') | ||
rule _ichorcna_input_bam: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work on crams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately readCounter does not work on CRAMs. readCounter (part of bamtools) never updated for CRAMs, though it seemed like a popular idea (pezmaster31/bamtools#149)
modules/ichorcna/1.0/ichorcna.smk
Outdated
rule _run_ichorcna: | ||
input: | ||
tum = CFG["dirs"]["readDepth"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}.bin{binSize}.wig", | ||
# norm = CFG["dirs"]["readDepth"] + "{seq_type}--{genome_build}/{binSize}/{normal_id}.bin{binSize}.wig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this and any other extraneous lines
seg = CFG["dirs"]["seg"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}--{normal_id}--{pair_status}/{tumour_id}.seg", | ||
plot = CFG["dirs"]["seg"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}--{normal_id}--{pair_status}/{tumour_id}/{tumour_id}_genomeWide.pdf", | ||
#rdata = "results/ichorCNA/{sample_id}/{sample_id}.RData" | ||
params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you name your params identically to the variable name you want to use all of this is redundant because you can use unpacking to set them all (I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow here
modules/ichorcna/1.0/ichorcna.smk
Outdated
|
||
# Perform some clean-up tasks, including storing the module-specific | ||
# configuration on disk and deleting the `CFG` variable | ||
op.cleanup_module(CFG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
modules/ichorcna/1.0/ichorcna.smk
Outdated
bai = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam.bai", # specific to readCounter | ||
crai = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam.crai" | ||
run: | ||
op.relative_symlink(input.bam, output.bam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we started to use the op.absolute_symlink
here instead of relative symlink. This will also need the check for oncopipe version, and you can find example in one of the recent modules (battenberg/1.1, pathseq/1.0 etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed bam inputs to op.absolute_symlink
modules/ichorcna/1.0/ichorcna.smk
Outdated
seg = CFG["dirs"]["outputs"] + "{seq_type}--{genome_build}/seg/{binSize}/{tumour_id}--{normal_id}--{pair_status}.seg", | ||
plot = CFG["dirs"]["outputs"] + "{seq_type}--{genome_build}/plot/{binSize}/{tumour_id}--{normal_id}--{pair_status}_genomeWide.pdf" | ||
run: | ||
op.relative_symlink(input.corrDepth, output.corrDepth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the latest version of oncopipe also supports argument in_module=TRUE
, which creates "shallow" symlinks. You can add it here as well, and the recent modules have an example of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for all output relative_symlinks
modules/ichorcna/1.0/ichorcna.smk
Outdated
##### RULES ##### | ||
# ---------------------------------------------------------------------------- # | ||
|
||
CFG["runs"]["binSize"] = str(CFG["options"]["readcounter"]["binSize"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do? I understand that it basically tells rule all the binsize for each sample. If it is the same for all samples, you can in the rule all just use str(CFG["options"]["readcounter"]["binSize"])
and then you can multiply itty the number of samples, for example *len(CFG["runs"]["tumour_sample_id"]
so it expands the length for each sample. There is example in the rule all of module liftover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your interpretation is correct - I modified it to your method from liftOver.
…nloaded into the input directory
Modified version has been tested - it runs to completion for a grch37, hg19, hg38 sample. Since I had reference files I needed to link and modify from the github repo of ichorCNA, I decided to get snakemake to clone their repo into the CFG["dirs"]["inputs"] directory so I could access these files in downstream rules. The naming convention for ichorCNA's reference files is messy (ex. HD_ULP_PoN_1Mb_median_normAutosome_mapScoreFilteredmedian.rds for hg19, and HD_ULP_PoNhg38_1Mb_median_normAutosome_median.rds for hg38). I originally just renamed these files when I had them as part of the module, and symlinked reference files for related genome_builds (ex. hs37d5, hg19, grch37 would all have symlinks to the same reference file). I incorporated this symlinking step inside the module. Also added Chris's suggestion of tweaking the default parameters. |
@@ -0,0 +1,423 @@ | |||
# file: ichorCNA.R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is now downloaded directly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this one is the one that I had to modify to allow hs37d5 to be included. They hard-coded the available genomes
hg38: "{MODSDIR}/src/inst/extdata/GRCh38.GCA_000001405.2_centromere_acen.txt" | ||
ichorCNA_minMapScore: 0.75 | ||
ichorCNA_chrs: | ||
grch37: "c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to snakelike, instead of being in config? There, you can use file listing chromosomes generated by reference files (main_chromosomes.txt) or use function to generate chromosome names (there is example in sage module)
binSize: 1000000 # set window size to compute coverage | ||
# available binSizes are: 1000000, 500000, 50000, 10000 | ||
run: | ||
ichorCNA_libdir: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these files in inst are being auto downloaded? They are probably then not needed in the config since you set the naming convention and the path to these. files in the snakelike rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the chromosomes, I could use the function for the readCounter step since I need the command to look like:
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y"
but for the ichorCNA step, it needs to be in R-vector format:
"c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')"
Wouldn't it be simpler to keep it like this in the config? Otherwise, I'd need to make a function to include c( ), and ' '
'ichorCNA_libdir:' parameter is supposed to be set to where the github repo resides for ichorCNA (it'll use this to search for inst/extdata/ underneath that directory). I just left it in the config since it might be useful if someone needs to modify the path one day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that vector being given to ichorCNA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the command it would be --chrs "c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could construct that in Python probably and then just add that as a "param" in your rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Pull Request Checklists
Important: When opening a pull request, keep only the applicable checklist and delete all other sections.
Checklist for New Module
Required
I used the cookiecutter template and updated the placeholder rules.
The snakemake rules follow the design guidelines.
rules
object (e.g. for input files) are wrapped withstr()
.Every rule in the module is either listed under
localrules
or has thethreads
andresources
directives.Input and output files are being symlinked into the
CFG["inputs"]
andCFG["outputs"]
subdirectories, respectively.I updated the final target rule (
*_all
) to include every output rule.I explained important module design decisions in
CHANGELOG.md
.I tested the module on real data for all supported
seq_type
values.I updated the
default.yaml
configuration file to provide default values for each rule in the module snakefile.I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.
I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).
I replaced every value that should (or might need to) be updated in the default configuration file with
__UPDATE__
.I recursively searched for all comments containing
TODO
to ensure none were left. For example:If applicable
I added more granular output subdirectories.
I added rules to the
reference_files
workflow to generate any new reference files.I added subdirectories with large intermediate files to the list of
scratch_subdirectories
in thedefault.yaml
configuration file.I updated the list of available wildcards for the input files in the
default.yaml
configuration file.Checklist for Updated Module
To be completed.