-
Notifications
You must be signed in to change notification settings - Fork 2
Tips for saving space
If your project is quite large (>20 samples), you may prefer to run the preprocessing steps in batches (to produce the final GATK-processed bams and remove BWA intermediates to free up space). To accomplish this, subset the data config (ie, path/to/dna_fastq_config.yaml; be sure to not split up multiple samples from each patient) and only indicate --preprocessing in the command:
module load perl
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/fastq_config_partN.yaml \
--preprocessing \
-c slurm \
--remove
Since the variant calling steps are best run as a single cohort, be sure to combine all of the partial gatk_bam_config.yaml files prior to running variant calling:
cd /path/to/output/GATK/
cat gatk_bam_config\*.yaml | awk 'NR <= 1 || !/^---/' > combined_gatk_bam_config.yaml
perl pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/combined_gatk_bam_config.yaml \
--variant_calling \
--create_report \
-c slurm \
--remove
All SNV calling steps run vcf2maf as a final step. The final MAF files can be compressed (gzip) to reduce filesize.
VCF2MAF also produces VERY large log files that consist primarily of unnecessary WARNINGs. Once complete, please remove/trim these files to free up space. This can reduce the directory size anywhere from 5-95% (ie, 165G down to 734M for a HaplotypeCaller run) depending on the tool! For example:
for i in logs/run_vcf2maf_and_VEP_*/slurm/s*out; do
grep -v 'WARNING' $i > $i.trim;
mv $i.trim $i;
done
A full germline-variant detection run proceeds in 3 steps: HaplotypeCaller on each BAM file, GenotypeGVCFs on the combined results and CPSR to extract likely pathogenic variants.
- HaplotypeCaller outputs individual VCFs and then combines them into multi-sample VCFs. Once everything has completed, you can delete either (or both) of these as all the significant variants can be found in the recalibrated file.
- VCF2MAF is run on the final CPSR outputs AND is an optional step as part of GenotypeGVCFs using the recalibrated variants; clean up these logs files (see above)
Pindel produces very large log files. Once complete, please remove these files to free up space! For example:
rm logs/run_pindel_*/slurm/s*out
Once all pipelines have been run, BAM files should be archived and removed from H4H. Converting your BAM files to CRAM format can reduce file size between 10-60% without any information loss. Run the below command using your GATK processed BAM config file and the reference fasta used for your alignments, then delete your BAM files and archive your CRAM files.
module load perl
perl ~/git/pipeline-suite/scripts/convert_bam_to_cram.pl \
-d /path/to/processed_bam_config.yaml \
-r /path/to/reference/genome.fa \
-o /path/to/output/directory \
-c slurm \
--no-wait