Skip to content

Tips for saving space

sprokopec edited this page Jun 28, 2024 · 4 revisions

Split cohort for pre-processing and remove intermediate BAM files

If your project is quite large (>20 samples), you may prefer to run the preprocessing steps in batches (to produce the final GATK-processed bams and remove BWA intermediates to free up space). To accomplish this, subset the data config (ie, path/to/dna_fastq_config.yaml; be sure to not split up multiple samples from each patient) and only indicate --preprocessing in the command:

module load perl
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/fastq_config_partN.yaml \
--preprocessing \
-c slurm \
--remove 

Since the variant calling steps are best run as a single cohort, be sure to combine all of the partial gatk_bam_config.yaml files prior to running variant calling:

cd /path/to/output/GATK/

cat gatk_bam_config\*.yaml | awk 'NR <= 1 || !/^---/' > combined_gatk_bam_config.yaml

perl pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/combined_gatk_bam_config.yaml \
--variant_calling \
--create_report \
-c slurm \
--remove 

VCF2MAF

All SNV calling steps run vcf2maf as a final step. The final MAF files can be compressed (gzip) to reduce filesize.

VCF2MAF also produces VERY large log files that consist primarily of unnecessary WARNINGs. Once complete, please remove/trim these files to free up space. This can reduce the directory size anywhere from 5-95% (ie, 165G down to 734M for a HaplotypeCaller run) depending on the tool! For example:

for i in logs/run_vcf2maf_and_VEP_*/slurm/s*out; do
  grep -v 'WARNING' $i > $i.trim;
  mv $i.trim $i;
done

HAPLOTYPECALLER

A full germline-variant detection run proceeds in 3 steps: HaplotypeCaller on each BAM file, GenotypeGVCFs on the combined results and CPSR to extract likely pathogenic variants.

  • HaplotypeCaller outputs individual VCFs and then combines them into multi-sample VCFs. Once everything has completed, you can delete either (or both) of these as all the significant variants can be found in the recalibrated file.
  • VCF2MAF is run on the final CPSR outputs AND is an optional step as part of GenotypeGVCFs using the recalibrated variants; clean up these logs files (see above)

PINDEL

Pindel produces very large log files. Once complete, please remove these files to free up space! For example:

rm logs/run_pindel_*/slurm/s*out

Convert your BAM files to CRAM format for archiving

Once all pipelines have been run, BAM files should be archived and removed from H4H. Converting your BAM files to CRAM format can reduce file size between 10-60% without any information loss. Run the below command using your GATK processed BAM config file and the reference fasta used for your alignments, then delete your BAM files and archive your CRAM files.

module load perl
perl ~/git/pipeline-suite/scripts/convert_bam_to_cram.pl \
-d /path/to/processed_bam_config.yaml \
-r /path/to/reference/genome.fa \
-o /path/to/output/directory \
-c slurm \
--no-wait