Skip to content

Tips for saving space

sprokopec edited this page Jul 7, 2021 · 4 revisions

Remove large intermediate files

If your project is quite large (>20 samples), you may prefer to run the preprocessing steps in batches (to produce the final GATK-processed bams and remove BWA intermediates to free up space). To accomplish this, subset the data config (ie, path/to/dna_fastq_config.yaml; be sure to not split up multiple samples from each patient) and only indicate --preprocessing in the command:

module load perl
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/fastq_config_partN.yaml \
--preprocessing \
-c slurm \
--remove 

Since the variant calling steps are best run as a single cohort, be sure to combine all of the partial gatk_bam_config.yaml files prior to running variant calling:

cd /path/to/output/GATK/

cat gatk_bam_config\*.yaml | awk 'NR <= 1 || !/^---/' > combined_gatk_bam_config.yaml

perl pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/combined_gatk_bam_config.yaml \
--variant_calling \
--create_report \
-c slurm \
--remove 

VCF2MAF

All SNV calling steps run vcf2maf as a final step. VCF2MAF produces very large log files. Once complete, please remove/trim these files to free up space! For example

for i in logs/run_vcf2maf_and_VEP_*/slurm/s*out; do
  grep -v 'WARNING' $i > $i.trim; rm $i;
done
Clone this wiki locally