Scripts for bioinformatics processing and analysis. Supplement to other more useful tools like bcftools (http://samtools.github.io/bcftools/), vcftools (https://vcftools.github.io/index.html) and etc. Most scripts were designed to support "pipe-in" and "pipe-out".
All PERL scripts are written under PERL5.
Add the PATH to "MyPerl" folder in PERL5LIB environment, e.g. "export PERL5LIB=${PATH_to_/site_perl}/MyPerl:$PERL5LIB", or copy it to the site_perl folder which present in PERL5LIB, or just copy it to the same folder where you run the script
Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details of each option
Query, extract and processing fasta sequences.
Note: some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.
-
Query a single gene
echo "${gene_id}" | fasta_process.pl --fasta all.seq --query - --rows 0 > gene.fa
-
Query multiple genes
fasta_process.pl --rows 0 --query query_genes.list --fasta all_genes.fasta > query_genes.fas
-
Extract a single region
echo "${query_region}" | fasta_process.pl --rows 0 1 2 --subset 1 2 --query - \ --fasta genome.fasta > query.fa
-
Extract multiple sequences from the genome
fasta_process.pl --rows 0 1 2 --query regions.list --fasta genome.fasta --subset 1 2 > query_regions.fas
-
Extract flanking sequence of each variant site and replace the nucleotide at the variant site with the alternative allele
perl -ne 'next if (/\#/); my @line = split /\t/; my $ex_start = $line[1]-75; my $ex_end = $line[1]+75; print "$line[0]\t$ex_start\t$ex_end\t76\t$line[3]\t$line[4]\n";' example.vcf | \ fasta_process.pl --query - --fasta genome.fasta --rows 0 1 2 3 4 5 --subset 1 2 \ --replace 3,4,5 > vars.ex75.replaced.fas
-
Count triplet contents
fasta_process.pl --fasta genome.fasta --count-nucl triplet > triplets.csv
-
Count triplet contents in query sequences
fasta_process.pl --fasta genome.fasta --query query.regions \ --rows 1 3 4 0 2 --subset 3 4 --count-nucl triplet > query.triplets.csv
-
Extract di/tir-nucleotide contents in tabular format
awk 'BEGIN{OFS="\t"} !/\#/ {print $1,$2-1,$2;}' example.vcf | \ fasta_process.pl --query - --fasta genome.fasta \ --rows 0 1 2 --subset 1 2 --out-format tabular | sed 's/\_/\t/g' > nt2-1.csv
-
Translate nucleotides to proteins and remove final "*"
fasta_process.pl --fasta cds.fasta --translate --wordwrap 60 | \ perl -ne 's/\*$//; next if(/^\s+$/); print;' > protein.fasta
-
Split multiple-sequences file into multiple single-sequence files
fasta_process.pl --fasta multiple.fasta --split
-
Sort fasta file by a user defined order, fasta file could also given from a pipe
cat *.fasta | fasta_process.pl --fasta - --out-order orders.list > sorted.fasta
-
Reverse complement sequences
fasta_process.pl --fasta example.fasta --reverse --complement > rc.fasta
-
Filtering fasta file by length
fasta_process.pl --fasta example.fasta --lower 100 --upper 2000 > len100_2000.fasta
-
Filtering fasta file by id
fasta_process.pl --fasta example.fasta --match "scaffold|contig" > chromosome.fasta
Rename sequences ids in fasta format file
-
Rename sequences ids in list
rename_fasta.pl --fasta origin.fas --word-wrap 80 --refer rename_ids.txt > renamed.fasta
-
Example about the rename_ids.txt:
#old_id new_id seq1 human_seq1 seq2 human_seq2 ...
Convert fastq encodings
-
Convert Sanger encoding to Illumina 1.5+
zcat reads.fq.gz | convert_fastq_quality.pl -i - -c sanger2illumina | gzip -c > reads.converted.fq.gz
VCF format file related processing.
This script does quite a lot things, including filtering, combining, clustering and etc., seems I put too many functions here ... However, since the VCF format generated from different caller varies, this script was manily tailored for vcf file generated from GATK (UnifiedGenotyper or HaplotypeCaller, http://www.broadinstitute.org/gatk/), some functions require the AD (allele depth) field, so it may not perform very well for VCF files generated from other caller.
-
Filtering by depth, this only mark samples with depth failed this criteria as missing, but will not filter the whole locus
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 > depth_flt.vcf
-
Filtering by depth and number of missing allele calls, first check depth, then count all missing calls include those failed the depth criteria
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 --max-missing 8 > flt.vcf
-
Specify the depth for each sample in a file, the overall criteria will still be effective if some sample were not specified
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 \ --depth-file sample_depth.txt > depth_flt.vcf
-
Specify some samples as natural homozygous sample (e.g. inbred lines), others would be treated as heterozygous, filtering heterozygous sites in homozygous samples (denoted as "pseudo-heterozygosity" here, mostly raised from mapping errors due to duplications)
vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 --max-pseudo-het 0 > flt.vcf
-
Filtering by reference/non-reference sample counts, distinguish homozygous and heterozygous samples
vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 \ --min-hom-ref 5 --min-het-ref 4 --max-hom-missing 5 > flt.vcf
-
Screen out rare alleles (allele with sample frequency less than the specified value)
vcf_process.pl --vcf example.vcf.gz --rare-only 3 > rare.vcf
Note: some filtering criteria have priority orders, do check the results after filtering!
vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support reference allele / all reads covered) to test whether the initial genotyping was really accurate, genotypes failed these criteria could be re-genotyped or set as missing, require AD fields (also add support for NR,NV tags generated from caller like Platypus, but less tested)
-
For homozygous samples, no heterozygous genotypes should be expected, NRADR should be near zero (reference homozygous) or near 100% (alternative homozygous), considering the sequencing errors, a conserved range could be 5%~95% for high coverage data (above 20x)
vcf_process.pl --vcf hc.vcf.gz --default-sample-type hom --regenotype-hom 0.05 \ --gt-diff-as-missing > genotype.flt.vcf
-
For heterozygous samples, we need two values, one for homozygous genotypes (same as used for homozygous samples), another for heterozygous genotypes (usually expect 50%, newly arised mutations could vary), 30%~70% maybe ok for a reliable heterozgyous call
vcf_process.pl --vcf hc.vcf.gz --regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf
-
Contain both homozygous and heterozygous samples
vcf_process.pl --vcf hc.vcf.gz --homo-samples sample1 sample2 --regenotype-hom 0.05 \ --regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf
-
Collect variants metrics, mainly designed for GATK callers
vcf_process.pl --vcf hc.vcf.gz --out-metrics \ --metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \ > hc.metrics.csv
-
Metrics after filtering
vcf_process.pl --vcf hc.vcf.gz \ --quality 50 --min-alleles 2 --max-alleles 2 --min-sample-depth 10 --max-missing 14 | \ vcf_process.pl --vcf - --out-metrics \ --metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \ > hc.flt.metrics.csv
-
Collect genotype infos, require the AD field
vcf_process.pl --vcf hc.vcf.gz \ --quality 50 --stats-only --out-genotype-stats --ref-depth --var-type snp > hc.snp.gts.csv
-
Generate statistics for each locus
vcf_process.pl --vcf snp.vcf.gz --stats-only --out-locus-stats > snp.locus_stats.csv
-
Count base changes for all bi-allelic heterozygous sites
vcf_process.pl --vcf example.vcf.gz \ --min-alleles 2 --max-alleles 2 --base-changes --GT-types "0/1" > het.changes.csv
-
Generate statistics for distances between adjacent variants
vcf_process.pl --vcf snp.vcf.gz --stat-var-dist --source-tag GT > snp.dist.csv
-
Summary of results generated from GATK DiagnoseTargets (https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_diagnostics_diagnosetargets_DiagnoseTargets.php)
vcf_process.pl --vcf diagnose.vcf --sum-diagnose > diagnose.stats.csv
-
Get variant sequence context (experimental)
vcf_process.pl --vcf snp.vcf.gz --check-context --fasta genome.fasta > snp.context.vcf
Notes for context checking:
1) Only bi-allelic loci is supported while analysis sequence context, multi-alleles need to be
breaked first;
2) Extension here is different for SNPs and INDELs, e.g. upstream 5bp and downstream 5bp for SNPs,
while only downstream 10bp for INDELs, thus the INDELs are assumed to be already left aligned
Use vcf_process.pl to clustering markers (genetically linked regions). The clustering function is used to identify genome blocks through certain type of markers. This was done by fisrt search for the reliable seeds (segments with consecutive markers of the same type and pass the criteria, the "seeding" stage), then merge adjacent seeds with same type to form blocks (the "extension" stage), the boundary between blocks of different type was determined according to the markers present between two blocks or use the middle point while no more markers present. The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana. eLife 2, e01426 (2013)", which used for identify recombinat blocks.
-
Example about the input vcf file. The source type could be genotypes (use GT field) or other user defined types (e.g. ancestral status) an example of markers.vcf.gz could be:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample chr01 161 . C A 54.12 . . GT:SC 0/0:A/A chr01 431 . C T 44.2 . . GT:SC 0/0:A/A chr01 1641 . G A 64.15 . . GT:SC 1/1:B/B chr01 4165 . C A 34.31 . . GT:SC 1/1:B/B ...
-
No clustering, just output blocks with consecutive markers with same source type,
vcf_process.pl --vcf markers.vcf.gz --out-blocks --source-tag "SC" > markers.blocks.csv
-
Clustering
vcf_process.pl --vcf markers.vcf.gz --out-blocks --source-tag "SC" --fill-gaps \ --min-frag-length 10000 --min-frag-markers 25 > markers.blocks.l10km25.csv
-
Use paintGenomeBlocks.pl to visually compare two results
awk 'BEGIN{OFS="\t";} !/\#/ {$1 = $1"-original"; print;}' markers.blocks.csv | \ cat markers.blocks.l10km25.csv - | \ paintGenomeBlocks.pl --input - \ --width 1600 --height 3000 --thickness 10 --chrom-distance 20 --block-distance 2 \ --output markers.blocks.cmp --length reference_genome.fasta.fai \ --colors "type1:strong_red2;B:strong_blue2" --sort-blocks sample-original sample --format png
-
Combine two vcf files according to the "CHROM" and "POS" fields
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 \ --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
-
Combine two vcf files according to the "CHROM", "POS" and "ALT" fields, if the "ALT" field differ, there will be two records in combined vcf file
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 4 \ --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
-
Combine two vcf files according to the "CHROM" and "POS" fields, but if the "ALT" field differ, write the "ALT" info of secondary file into "SDIFF" field
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 --compare-row 4 \ --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
Downsampling vcf file.
-
Random pick 100 sites from a VCF file
downsample_vcf.pl -i input.vcf.gz -s 100 -t 100 -o rand
Extract sequences (mRNA or CDS) from gff3 file to fasta file.
-
Extract mRNA sequences
gff2fasta.pl --gff example.gff --seqs example.fasta --features mRNA > example.mRNA.fasta
-
Extract CDS sequences
gff2fasta.pl --gff example.gff --seqs example.fasta --features cds > example.cds.fasta
Convert gff to tabular format
Convert intervals to bed format, e.g. chr01:1-1000 -> chr01 0 1000
Extract masked (lowercase and Ns) regions in bed format, quite slow ...
Convert SAM/BAM to FASTA/Q format
-
Extract unmapped reads in BAM file and send it to blast+
bam2fasta.pl -b example.bam -s "-f 4" | \ blastall -p blastn -d genome.fasta -e 1e-70 -a 1 -m 8 -F F -n T -o unmapped.blast.csv
Convert SAM/BAM to FASTQ format
-
Extract paired reads
sam2fastq.pl -i example.sam -o prefix
Extract reads in pairs from SAM/BAM file
-
Extract reads mapped to region "chr03 4000 5000"
echo "chr03 4000 5000" | \ extract_bam_pairs.pl --input - --extend 2000 --samtools "-f2 -F3852 -q20" \ --min-insert 450 --max-insert 550 --max-clipping 2 --format fastq \ --bam sample1.bam sample2.bam --output extracted
Note: Difference between bam2fasta.pl, sam2fastq.pl and extract_bam_pairs.pl
bam2fasta.pl ignore paired infos;
sam2fastq.pl convert in pairs;
extract_bam_pairs.pl extract all reads overlap certain regions and search for its pairs.
They can be actually integrated, but why 3 scripts? Because I forget the previous one when I started writing a new one, and finally I got three ...
Extract multiple fragments of a sequence and join the fragments into a new sequence
-
Extract Pfam domains (use protein positions to extract CDS sequences), expected format for “pfam_domain_pos.csv”: "seq_id chr1:start1-end1;chr1:start2-end2"
extract_split_seqs.pl --query pfam_domain_pos.csv --convert-pos pro2cds --fasta cds.fasta > extract.fasta
Calculate within- or between-groups diversities from a vcf file, an alternative choice of "vcftools --window-pi".
-
Calculate pairwise diversity in non-overlapping windows
calc_vcf_diversity.pl --vcf test.vcf.gz --window-size 500000 --no-group --no-within > pairs_divs.csv
-
Calculate pairwise diversity in specified regions
calc_vcf_diversity.pl --vcf test.vcf.gz --intervals test.regions --no-within --no-group > pairs_divs.csv
-
Calculate within and between group diversity in non-overlapping windows
calc_vcf_diversity.pl --vcf test.vcf.gz --window-size 500000 \ --group test.sample_panel --all-pairs --threads 2 > group_divs.csv
-
Calculate within and between group diversity in specified regions
calc_vcf_diversity.pl --vcf test.vcf.gz --intervals test.regions \ --group test.sample_panel --all-pairs --threads 2 > group_divs.csv
Note: values calculated directly from VCF file tends to underestimate the true diversity as it divide all sites rather than informative sites, one workaround is to use a three step calculation which also get the number of informative sites
-
First count all informative sites (this requires the VCF file also contain the non-variant allele infos)
count_vcf_pairs.pl --vcf sites.vcf.gz --intervals example.regions > pair_infos.csv
-
Second, calculate differences between each pairs
calc_vcf_diversity.pl --vcf snp.vcf.gz --no-group --no-within --intervals example.regions > pair_diffs.csv
-
Finally, obtain the adjusted diversity by diffs / informative_sites
calc_pop_divs.pl --query pair_diffs.csv --subject pair_infos.csv --min-info-perc 50 > pairs_divs.mean.csv
Accompany script of calc_vcf_diversity.pl to count informative sites
Post-calculation script of results from calc_vcf_diversity.pl and count_vcf_pairs.pl
Count by slide windows
-
Count in 100kb windows, excluding scaffolds
slide_windows_count.pl -i example.csv -w 100000 -l genome.fa.fai -e scaffold > example.100k.csv
-
Count with different types
slide_windows_count.pl -i example.csv -w 100000 -l genome.fa.fai -t > example.100k.csv
Note: This script use two different algorithms while counting non-overlapping and overlapping (when a step size is specified) windows. The first algorithm is relatively fast and does not require too much memory as it count on-the-fly while reading file; however, count non-overlapping windows needs first read the entire file, and the subsequent counting rely on the grep function is not quite efficient, thus could be very slow and memory-intensive while process large files.
Plot blocks along each chromosome.
-
Plot recombinat blocks with centromeres
cat centromere.csv blocks.csv | \ paintGenomeBlocks.pl --input - --rows 0 1 2 3 4 --split-pairs \ --width 1600 --height 12000 --thickness 10 --chrom-distance 20 --block-distance 2 \ --output blocks --length genome.fa.fai --format svg \ --colors "A:strong_red2;B:strong_blue2;Centromere:grey" --sort-blocks Centromere sample1 sample2
-
Plot variants (just extend it to a "block")
awk 'BEGIN{OFS="\t"} !/\#/ {print "Example",$1,"SNP",$2-10,$2+10;}' example.vcf | \ paintGenomeBlocks.pl --input - \ --width 800 --height 600 --thickness 20 --chrom-distance 20 --block-distance 2 \ --output example.snp --length genome.fa.fai --colors "SNP:red"
Plot distributions with ggplot2
-
Prepare examples
bgzip -dc example1.vcf.gz | slide_windows_count.pl --input - --window-size 200000 > example1.w200k.csv bgzip -dc example2.vcf.gz | slide_windows_count.pl --input - --window-size 200000 > example2.w200k.csv
-
Reformat
cat example1.w200k.csv | \ awk 'BEGIN{OFS="\t"} !/##/ {if(/#/){print "Sample\tCHROM\tINTERVAL\tBIN_ID\tCOUNT";} else{print "Example1",$1,$2,$3,$4;}}' \ > example.w200k.txt cat example2.w200k.csv | \ awk 'BEGIN{OFS="\t"} !/#/ {print "Example2",$1,$2,$3,$4;}' \ >> example.w200k.txt
-
Plot
Rscript ggplot-windows-count.R example.w200k.txt example.w200k.tiff
Alternative to ggplot-windows-count.R, designed for different input format
-
Prepare examples
echo -e "chr01\t122\tSNP\nchr01\t142\tSNP\nchr01\t254\tINDEL\n" | \ slide_windows_count.pl -i - -w 500 -t | cut -f 4 --complement | \ perl -ne 'next if (/^\#\#/); s/\#//g; print;' \ > counts.w500.txt
-
Plot
Rscript ggplot-windows-multi.R counts.w500.txt counts.w500.tiff
Liner fit with ggplot2
Scatter plot
Mapping with clustalw2(http://www.clustal.org/clustal2/) or muscle(http://www.drive5.com/muscle/).
Align sequences to a reference sequence, this was done by 2 steps:
Step1: align the first sequence to the reference sequence, and get a expanded reference sequence with gaps inserted, then align the second sequence to the new reference sequence, iterate this process to generate a reference sequence expand all query sequences;
Step2: build a consensus sequence with nucleotide replaced in reference sequence, and re-align all sequences to the consensus sequence.
-
Align all subsequent sequences to the first sequence use clustalw2
reference_align.pl -i example.fa > example.aln.fas
-
Align with muscle
reference_align.pl -i example.fa -a muscle > example.aln.fas
Wrapper to run PAML (baseml) in parallel
-
Run all models in a shot
parallel_baseml.pl --input aln.fas --model 0 1 2 3 4 5 6 7 8 --align-length 100 > aln.csv
Wrapper to run PAML (codeml) in parallel
-
Run with multiple threads
parallel_codeml.pl --ref ref.tbl --samples samples.tbl --query query.tsv --threads 6 > query.codeml.csv
Convert ASCII characters to Numbers
-
Mainly used to inspect Fastq qualities
ascii2num "AAAA"
Mapping records from subject file to query file
-
Map records based on the 1st column, merge duplicates to a single line
map_records.pl --rows1 0 --rows2 0 --merge-dups --query query.list --subject subject.txt > mapped.csv
These scripts are free softwares; you can redistribute it and/or modify it under the same terms as Perl itself.