title | author | date |
---|---|---|
UCE characterization and concatenation |
CR Cardenas |
2023.09 |
- Check Synteny R functions and your own python/awk scripts for pairwise comparison of probe identities
- Clean up files
- organize github directory
- check for grammar & spelling
The goal is to map probes to genomes in order to realize the total overalp of all probes used n the Adephaga data for concatenation/merging multiple UCEs (as in Van Dam et al 2021) on the same gene and UCE characterization. In general, folks will only have one probeset to use. But because I am integrating anchored hybrid enrichment data, Adephaga UCE probesAdephaga UCE probes, and the original Coleoptera UCE probes I will need to create a merged dataset (nameed joined).
Two important assumptions being made about the genome and genes being used:
- Intergenic sequences close to genes are not being considered as promotors. This is mainly due to this information being unknown.
- There is no alternative splicing or overlapping genes (ex: overlaping genes).
- identify what probes are genetic or intergenic
- identify the overlap/intersect of probesets used between datasets
- create a new probeset that integrates all probes for use in Phyluce
- create a list of probes that should be concatenated in a partition file for phylogenetic analysis
- create a script to integrate it based on an existing partition file (e.g., output of Phyluce)
This pipeline will mainly use the Pterostichus madidus genome for testing and finalizing the workflow, but others will be used. Genomes with annotations (general feature files, GFF) can be found at the Darwin Tree of Life ensembl.
Ensure you have a GFF3 format. For details on GFF format, see here.
!!! genomes used are likely to change by final analysis, likely just the ones with GFF files Neb.brevicolis, Oph.ardosiacus, and Pte.madidus
Available genomes and their names names
code | name | ID |
---|---|---|
agrPla1 | Agrilus planipennis | GCA_000699045.1 |
* ampIns | Amphizoa insolens | DNA3784 |
anoGla1 | Anoplophora glabripennis | GCA_000390285.1 |
* bemHap1 | Bembidion haplogonum | DNA2544 |
* chlSer1 | Chlaenius sericeus | DNA4821 |
denPon1 | Dendroctonus ponderosae | GCA_000355655.1 |
lepDec | Leptinotarsa decemlineata | GCA_000500325.1 |
lioTuu1 | Lionepha | DNA3782 |
menMol1 | Mengenilla moldrzyki | GCA_000281935.1 |
! nebBrevi1 | Nebria brevicolis | GCA_944738965.1 |
! nebSali1 | Nebria salina | GCA_944039245.1 |
* omoHam1 | Omoglymmius hamatus | DNA3783 |
ontTau1 | Onthophagus taurus | GCA_000648695.1 |
! ophArdo1 | Ophonus ardosiacus | GCA_943142095.1 |
! pteMad2 | Pterostichus madidus | GCA_911728475.2 |
* pteMel1 | Pterostichus melenarius | DNA3787 |
* traGib1 | Trachypachus insolens | DNA3786 |
triCas2 | Tribolium castaneum | GCA_000002335.2 |
(* indicates Adephaga reference genome; ! indicates a genome feature file)
probe file names:
- vasilikopoulos_etal_ahe_probes.fasta
- Adephaga_2.9Kv1_UCE-Probes.fasta
- Coleoptera-UCE-1.1K-v1.fasta
Recomended installation procedure:
conda create --name characterization
conda activate characterization
conda install -c bioconda bedtools blast bwa samtools seqkit
conda install -c anaconda natsort
This pipeline should be able to run on a personal laptop (windows with linux subsystem and linux, uncertain about mac) with sufficient storage, memory and CPU available. Alternatively, you can .... [add json information]
The *.gff
and *.fna
files from ensemble or genbank do not immediately cooperate with this pipeline and require data processing. This should be done in your 0data directory with those files.
Steps to process data:
- clean up fna headers
- sort
*.gff
files - create genome file
- unrwap probe fasta files that need it
index.sh: line 29: 72366 Segmentation fault (core dumped)
BECAUSE YOU LOST ALL YOUR SEQUENCE DATA WITH YOU FASTA FILES WHEN EDITING THE HEADERS! CHECK THIS STEP AGAIN DINGUS!
check the scaffold/chromosome naming conventions in your *.fna
and *.gff3
files and adjust for replacement.
awk 'BEGIN { FS=" " } { if (/^>/) { print $1 } else { print $0 }}' genome.fa > gen1.fna
tr -s " " \\t < Pterostichus_madidus-GCA_911728475.2-2022_03-genes.gff | awk -F"\t" '{ if ($1~"##sequence-region") print $2 "\t" $4}' | natsort > pterMadi2.genomefile
bedtools sort -i Pterostichus_madidus-GCA_911728475.2-2022_03-genes.gff -g pterMadi2.genomefile > pterMadi2.sorted.gff
some probe files are wrapped, this pipeline assumes you have an unwrapped fasta file (see coleoptera UCE)
seqkit seq -w 0 Coleoptera-UCE-1.1K-v1.fasta > Coleoptera-UCE-1.1k.fasta
to simplify the pipeline and naming conventions I am shortening the probe files as well
.
├── 0data
│ ├── genomes
│ │ ├── pteMadi2.fna
│ │ ├── pteMadi2.genomefile
│ │ ├── pteMadi2.sorted.gff
│ │ ├── Pterostichus_madidus-GCA_911728475.2-2022_03-genes.gff
│ │ └── Pterostichus_madidus-GCA_911728475.2-softmasked.fa
│ └── probes
│ ├── Adephaga_2.9Kv1_UCE-Probes.fasta
│ ├── Adephaga.fasta
│ ├── Coleoptera.fasta
│ ├── Coleoptera-UCE-1.1k.fasta
│ ├── Coleoptera-UCE-1.1K-v1.fasta
│ ├── Vasil2021_AHE_probes.fasta
│ └── Vasil.fasta
└── workflow.md
The *.gff
files do not have all genome features of interest so they need to be identified, this can be done with simple awk
and bedtools
commands.
This script takes the users *.gff
file, uses the input name to seperate genic, intergenic, exon, and intron features from the sorted *.gff
file. Ensure your naming convention is consistent because it looks for that file in your ./0data/genomes
directory not in your currnet working diectory. (e.g., if your input *.gff
file is named pteMadi2.sorted.gff
your sorted.gff file pteMadi2 will be added too all output files based on the first field).
To run the following script create a directory 1genefeatures in the same directory as 0data:
bash genefeatures.sh pteMadi1.sorted.gff
this script can also be run with a for loop with a list if you have multiple gff files:
for i in $(cat mylist.txt); do
bash genefeatures.sh $i
done
#!/bin/bash
# take user input
GFFINPUT=$1
GFFNAME=$(echo $GFFINPUT | cut -d "." -f 1)
#activate environment
source /local/anaconda3/bin/activate
conda activate characterization
# extract gene features using awk into gff and bed format
printf "extracting gene features of ${GFFINPUT}\n"
awk '$3 == "gene" {print $0}' '0data/genomes/'${GFFNAME}'.sorted.gff' \
> '1genefeatures/'${GFFNAME}'.sorted.genes.gff'
awk '$3 == "gene" {print $1 "\t" $4-1 "\t" $5 "\t""gene"}' '1genefeatures/'${GFFNAME}'.sorted.genes.gff' \
> '1genefeatures/'${GFFNAME}'.sorted.genes.bed'
# extract intergenic features
printf "extracting intergenic features of ${GFFINPUT}\n"
bedtools complement -i '1genefeatures/'${GFFNAME}'.sorted.genes.gff' \
-g '0data/genomes/'${GFFNAME}'.genomefile' \
> '1genefeatures/'${GFFNAME}'.sorted.intergenic.tmp' # this line gives coordinates for all intergenic regions
awk '{print $0 "\t" "intergenic"}' '1genefeatures/'${GFFNAME}'.sorted.intergenic.tmp' \
> '1genefeatures/'${GFFNAME}'.sorted.intergenic.bed' # create a bedfile of just intergenic regions
awk '{print $1 "\t" "bedtools_complement" "\t" "intergenic" "\t" $2+1 "\t" $3 "\t" "." "\t" "." "\t" "." "\t" "description=intergenic-region-IDd-with-bedtools"}' '1genefeatures/'${GFFNAME}'.sorted.intergenic.bed' \
> '1genefeatures/'${GFFNAME}'.sorted.intergenic.gff' # create gff file of intergenic regions
# extract exon features
printf "extracting exon features of ${GFFINPUT}\n"
awk '$3 == "exon" {print $1 "\t" $4-1 "\t" $5 "\t" "exon"}' '0data/genomes/'${GFFNAME}'.sorted.gff' \
> '1genefeatures/'${GFFNAME}'.sorted.exons.bed' # generate a simple exon bed file, watch the count! GFF is 1-count and BED is 0-count
awk '$3 == "exon" {print $0}' '0data/genomes/'${GFFNAME}'.sorted.gff' \
> '1genefeatures/'${GFFNAME}'.sorted.exons.gff' # this ensures that we have a gff file to give intersect file exon names for a later step
# join and sort intergenic and exonic regions to extract intron
printf "extracting intron features of ${GFFINPUT}\n"
bedtools sort -i <(cat '1genefeatures/'${GFFNAME}'.sorted.intergenic.bed' '1genefeatures/'${GFFNAME}'.sorted.exons.bed') \
-g '0data/genomes/'${GFFNAME}'.genomefile' \
> '1genefeatures/'${GFFNAME}'.sorted.intergenic-and-exons.bed.tmp' # this sorts the input of the concatenated features, allows introns to be identified with bedtools
bedtools complement -i '1genefeatures/'${GFFNAME}'.sorted.intergenic-and-exons.bed.tmp' \
-g '0data/genomes/'${GFFNAME}'.genomefile' \
> '1genefeatures/'${GFFNAME}'.sorted.introns.bed' # create complement bed file
awk '{print $1 "\t" "bedtools_complement" "\t" "intron" "\t" $2+1 "\t" $3 "\t" "." "\t" "." "\t" "." "\t" "description=intron-region-IDd-with-bedtools"}' '1genefeatures/'${GFFNAME}'.sorted.introns.bed' \
> '1genefeatures/'${GFFNAME}'.sorted.introns.gff' # create gff file of intronic features
# create new gff files
printf "creating new GFF file for ${GFFNAME}\n"
cat '1genefeatures/'${GFFNAME}'.sorted.introns.gff' \
'1genefeatures/'${GFFNAME}'.sorted.exons.gff' \
'1genefeatures/'${GFFNAME}'.sorted.intergenic.gff' \
'1genefeatures/'${GFFNAME}'.sorted.genes.gff' \
> '1genefeatures/'${GFFNAME}'_simple.gff.tmp'
bedtools sort -i '1genefeatures/'${GFFNAME}'_simple.gff.tmp' \
-g '0data/genomes/'${GFFNAME}'.genomefile' \
> '1genefeatures/'${GFFNAME}'_simple.sorted.gff'
cat '0data/genomes/'${GFFNAME}'.sorted.gff' \
'1genefeatures/'${GFFNAME}'.sorted.introns.gff' \
> '1genefeatures/'${GFFNAME}'_detailed.gff.tmp'
bedtools sort -i '1genefeatures/'${GFFNAME}'_detailed.gff.tmp' \
-g '0data/genomes/'${GFFNAME}'.genomefile' \
> '1genefeatures/'${GFFNAME}'_detailed.sorted.gff'
# clean up tmp files
rm 1genefeatures/*.tmp
printf "output files in 1genefeatures:\n\t ${GFFNAME}_simple.sorted.gff is a new gff file with just intergenic, genic, exon, and intron information\n\t ${GFFNAME}_detailed.sorted.gff is a new gff file with intron inforation as well as original CDS, mrna, and 3' + 5' UTR info\n\n"
- Edit this shell script to properlly call your environment. On my server using
$ source genefeatures.sh
or$ bash -i genefeatures.sh
will still run this script. But it may throw a conda error, I'm not sure how to fix this. Keep this in mind if you run into errors with further scripts.
.
├── 0data
│ ├── genomes
│ │ ├── pteMadi2.fna
│ │ ├── pteMadi2.genomefile
| | └── ...
│ └── probes
│ ├── Adephaga_2.9Kv1_UCE-Probes.fasta
│ ├── Adephaga.fasta
| └── ...
├── 1genefeatures
│ ├── pteMadi2_detailed.sorted.gff
│ ├── pteMadi2_simple.sorted.gff
| └── ...
├── genefeatures.sh
└── workflow.md
Depending on the detail needed for downstream analysis or description you can use the simpe or detailed gff files to map your probes too. For simplicity, I will be using the *_simple.sorted.gff
file. There will be a few shell scripts that talk to each other that produce what probes intersect with gene features in the genome as well as what probes dont intersect.
To run this script first make two directories 2mapping
and 3intersect
in the same directory as 0data
and 1genefeatures
You can use a single genome and your probes of interest:
bash run.sh genome.fna probes1.fasta probes2.fasta probes3.fasta
Or you can use a list of the genomes you are interested in.Importantly, you need to modify run.sh
to include more than two probes!
for i in $(cat genomes\.list); do
bash run.sh $i \
probes1.fasta \
probes2.fasta \
probes3.fasta;
done
This script runs the pipeline, calling the index.sh
and then sam2bed.sh
scripts. These files should be in the same directory as the run.sh
. Ensure you input the genome fna file
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
# turn off N probes (e.g., PROBE1:PROBEN) depending on probesets used for characterization
GEN=$1
PROBE1=$2
PROBE2=$3
PROBE3=$4
printf "running index script\n"
# index probes against genomes
bash index.sh $GEN $PROBE1;
bash index.sh $GEN $PROBE2;
bash index.sh $GEN $PROBE3;
printf "running sam2bed script\n"
# convert sam file to bam file
bash sam2bed.sh $GEN $PROBE1;
bash sam2bed.sh $GEN $PROBE2;
bash sam2bed.sh $GEN $PROBE3;
printf "ready to run intersect.sh\n"
This script first indexes the genome and maps using BWA and then does the same using using blastn.
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
GEN=$(echo $1 | cut -d . -f 1)
PROBE=$(echo $2 | cut -d . -f 1)
# use BWA to map using mem
# check for indexed genome files and if none
# are present index them and then use
if [ -f '2mapping/'$PROBE'-'$GEN'.sam' ]; then
printf "$PROBE already mapped to -$GEN \n\n";
elif [ -f '0data/genomes/'$GEN'.fna.pac' ]; then
printf "mapping $PROBE to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '0data/probes/'$PROBE'.fasta' \
> '2mapping/bwa/'$PROBE'-'$GEN'.sam';
else
printf "indexing $GEN \n";
bwa index '0data/genomes/'$GEN'.fna';
printf "\n mapping $PROBE to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '0data/probes/'$PROBE'.fasta' \
> '2mapping/'$PROBE'-'$GEN'.sam';
printf " finished bwa indexing, output $PROBE-$GEN.sam in 2mapping directory \n\n";
fi
mkdir blast
The indexed *.sam
is converted to a bed file and their is a check for probes with multiple positions in the genome; potential paralogs. The output then returns a final *.bed
file, summary of mapping, and list of those probes/loci which should be excluded based on the sam file. The script creates a new probeset fasta file if there are probes (and the entire loci) to remove and re-maps them.
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
GEN=$(echo $1 | cut -d . -f 1)
PROBE=$(echo $2 | cut -d . -f 1)
# convert sam to bam
if [ -f '2mapping/'$PROBE'-'$GEN'.mem.sorted.mapped.bed' ]; then
printf "BED file already created for $PROBE-$GEN\n"
else
printf "converting to bam file\n"
samtools view -h -b -S '2mapping/'$PROBE-$GEN'.sam' \
> '2mapping/'$PROBE'-'$GEN'.mem.bam'
printf "sorting reads\n"
# sort bam file
samtools sort '2mapping/'$PROBE'-'$GEN'.mem.bam' -o '2mapping/'$PROBE'-'$GEN'.mem.sorted.bam'
printf "extracting $PROBE probes that map to $GEN\n"
# extract sequences that map against the genome
samtools view -b -F 4 '2mapping/'$PROBE'-'$GEN'.mem.sorted.bam' \
> '2mapping/'$PROBE'-'$GEN'.mem.sorted.mapped.bam'
printf "creating BED file\n"
# create bed file
bedtools bamtobed -i '2mapping/'$PROBE-$GEN.mem.sorted.mapped.bam \
> '2mapping/'$PROBE'-'$GEN'.mem.sorted.mapped.bed'
printf "output $PROBE-$GEN.mem.sorted.mapped.bed in 2mapping/ directory\n\n"
fi
if [ -f '2mapping/'$PROBE'-'$GEN'.mem.sorted.mapped.bam' ]; then
rm 2mapping/*.bam
printf "Checking for potential loci/probes map to different regions in $PROBE-$GEN.sam\n"
else
printf "Checking for potential loci/probes map to different regions in $PROBE-$GEN.sam\n"
fi
# create simple summary file, create variables and push to variables or new file for: number of UCEs/Loci maped vs UCEs/Loci total, number of probes mapping to different regions.
#number of mapped loci
awk '{if ($3 != "*") print $0}' '2mapping/'$PROBE'-'$GEN'.sam' | \
grep -v "@" | \
awk '{print $1}' | \
cut -d "p" -f 1 | \
sort -u | \
grep "." -c \
> '2mapping/'tmp.uce.count
printf "Total number of $PROBE loci mapped against $GEN\t" && cat 2mapping/tmp.uce.count
# number of multiple mappings
# count multi mapping
cat '2mapping/'$PROBE'-'$GEN'.sam' |
grep -v "@" | \
grep "SA:" -c \
> 2mapping/tmp.multi-uce.count
printf "Total $PROBE probes with multiple mappings in $GEN\t" && cat 2mapping/tmp.multi-uce.count
# store names
cat '2mapping/'$PROBE'-'$GEN'.sam' | grep -v "@" | grep "SA:" > 2mapping/tmp.multi
# count UCE/Loci names
awk '{print $1}' 2mapping/tmp.multi | cut -d "p" -f 1 | sort -u | wc -l > 2mapping/tmp.multi-uce.count
# list of uce to remove
awk '{print $1}' 2mapping/tmp.multi | cut -d "p" -f 1 | sort -u > '2mapping/'$GEN'-'$PROBE'_to_remove.list'
# store summary
if [ -f 2mapping/summary.txt ]; then
printf "Writing mapping summary for $PROBE-$GEN\n"
else
printf "Writing mapping summary for $PROBE-$GEN \n"
printf "probe-genome\tmapped-loci\ttotal-loci\tNumber-multi-mapped-loci\n" > 2mapping/summary.txt
fi
NAME=$(printf "$PROBE-$GEN")
MAPPEDLOCI=$(cat 2mapping/tmp.uce.count)
TOTALLOCI=$(cat '0data/probes/'$PROBE'.fasta' | grep ">" | cut -d "|" -f 1 | cut -d _ -f1 | sort -u | grep "." -c)
MULTIMAPPING=$(cat 2mapping/tmp.multi-uce.count)
printf "$NAME\t$MAPPEDLOCI\t$TOTALLOCI\t$MULTIMAPPING\n" >> 2mapping/summary.txt
rm 2mapping/tmp.*
# if conditional statement is satisfied then remap with removed loci
if [ $(cat '2mapping/'$PROBE'-'$GEN'.sam' | grep -v "@" | grep "SA:" -c) == 0 ]; then
printf "$PROBE does not need loci/probes removed and re-mapped against $GEN \n";
else
printf "$PROBE needs re-mapped with removed loci/probes\n"
# remove loci in probe file based on a list
awk '{
if ((NR>1) && ($0~/^>/)) { printf("\n%s", $0); }
else if (NR==1) { printf("%s", $0); }
else { printf("\t%s", $0); }
}' '0data/probes/'$PROBE'.fasta' | \
grep -vf '2mapping/'$GEN'-'$PROBE'_to_remove.list' - | \
tr "\t" "\n" \
> '0data/probes/'$PROBE'_2.fasta';
# grep -A 1 -vf '2mapping/'$PROBE'_to_remove.list' '0data/probes/'$PROBE'.fasta' > '0data/probes/'$PROBE'_2.fasta'
printf "Probes removed from $PROBE\t" && grep "_" '2mapping/'$GEN'-'$PROBE'_to_remove.list' -c
# use BWA to remap using mem
# check for indexed genome files and if none are present index them and then map
if [ -f '2mapping/'$PROBE'-'$GEN'_2.sam' ]; then
printf "$PROBE already mapped to $GEN \n";
elif [ -f '0data/genomes/'$GEN'.fna.pac' ]; then
printf "mapping new $PROBE set to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '0data/probes/'$PROBE'_2.fasta' \
> '2mapping/'$PROBE'-'$GEN'_2.sam';
else
printf "indexing $GEN \n";
bwa index '0data/genomes/'$GEN'.fna';
printf "\n re mapping new $PROBE set to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '0data/probes/'$PROBE'_2.fasta' \
> '2mapping/'$PROBE'-'$GEN'_2.sam';
printf " new mapped file $PROBE-$GEN_2.sam in 2mapping directory \n";
fi
# convert sam to bam
if [ -f '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.mapped.bed' ]; then
printf "BED file already created for new $PROBE-$GEN\n"
else
printf "converting to bam file\n"
samtools view -h -b -S '2mapping/'$PROBE'-'$GEN'_2.sam' \
> '2mapping/'$PROBE'-'$GEN'_2.mem.bam'
printf "sorting reads\n"
# sort bam file
samtools sort '2mapping/'$PROBE'-'$GEN'_2.mem.bam' -o '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.bam'
printf "extracting new $PROBE probes that map to $GEN\n"
# extract sequences that map against the genome
samtools view -b -F 4 '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.bam' \
> '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.mapped.bam'
printf "creating BED file\n"
# create bed file
bedtools bamtobed -i '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.mapped.bam' \
> '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.mapped.bed'
printf "output $PROBE-$GEN_2.mem.sorted.mapped.bed in 2mapping/ directory\n"
fi
if [ -f '2mapping/'$PROBE'-'$GEN'_2.mem.sorted.mapped.bam' ]; then
rm 2mapping/*.bam
printf "Checking if loci/probes map to different regions in $PROBE-$GEN.sam\n"
else
printf "Checking if loci/probes map to different regions in $PROBE-$GEN.sam\n"
fi
# rewrite summary into new file
awk '{if ($3 != "*") print $0}' '2mapping/'$PROBE'-'$GEN'_2.sam' | \
grep -v "@" | \
awk '{print $1}' | \
cut -d "p" -f 1 | \
sort -u | \
grep "." -c \
> '2mapping/'tmp.uce.count
printf "Total of new $PROBE loci mapped against $GEN\t" && cat 2mapping/tmp.uce.count
# number of multiple mappings
# count multi mapping
cat '2mapping/'$PROBE'-'$GEN'_2.sam' | \
grep -v "@" | \
grep "SA:" -c \
> 2mapping/tmp.multi-uce.count
printf "Total $PROBE loci with multiple mappings in $GEN\t" && cat 2mapping/tmp.multi-uce.count
# store names
cat '2mapping/'$PROBE'-'$GEN'_2.sam' | \
grep -v "@" | \
grep "SA:" \
> 2mapping/tmp.multi
# count UCE/Loci names
awk '{print $1}' 2mapping/tmp.multi | \
cut -d "p" -f 1 | \
sort -u | \
wc -l \
> 2mapping/tmp.multi-loci.count
# list of uce to remove
awk '{print $1}' 2mapping/tmp.multi | \
cut -d "p" -f 1 | \
sort -u \
> '2mapping/'$PROBE'_to_remove_2.list'
# store summary
if [ -f 2mapping/summary.txt ]; then
printf "Writing mapping summary for $PROBE-$GEN""_2\n"
else
printf "Writing mapping summary for $PROBE-$GEN""_2\n"
printf "probe-genome\tmapped-loci\ttotal-loci\tNumber-multi-mapped-loci\n" > 2mapping/summary_2.txt
fi
NAME=$(printf "$PROBE-$GEN")
MAPPEDLOCI=$(cat 2mapping/tmp.uce.count)
TOTALLOCI=$(cat '0data/probes/'$PROBE'.fasta' | grep ">" | cut -d "|" -f 1 | cut -d _ -f1 | sort -u | grep "." -c)
MULTIMAPPING=$(cat 2mapping/tmp.multi-loci.count)
printf "$NAME\t$MAPPEDLOCI\t$TOTALLOCI\t$MULTIMAPPING\n" >> 2mapping/summary_2.txt
rm 2mapping/tmp.*
fi
printf "$PROBE-$GEN conversion and check complete\n\n"
exit
The intersectn.sh
expects an *.fna
genome file and the *.bed
files that were generated. Ensure you call the *_2.mem.sorted.mapped.bed
file if the probeset was altered.
e.g.,
bash intersect.sh genome.fna probes1-genome_2.mem.sorted.mapped.bed probes2-genome_2.mem.sorted.mapped.bed probes3-genome.mem.sorted.mapped.bed
Again, multiple genomes can be used if you want
for i in $(cat genomes\.list); do
GEN=$(echo $i | cut -d "." -f 1)
bash intersect.sh $i \
2mapping/Adephaga-${GEN}_2.mem.sorted.mapped.bed \
2mapping/Coleoptera-${GEN}_2.mem.sorted.mapped.bed \
2mapping/Vasil-${GEN}.mem.sorted.mapped.bed;
done
The example here is set up for 3 probe files as input, but you could alternatively set it up to work with more or less (see commented out lines in the script). The order of your *.bed
files is important here if using more than one. Use the probeset of most interest first, as any additional probes will be compared to this first probeset. In the example above, Adephaga is first.
Lastly, the mapped loci are checked for probes that do and do not intersect in the genome using bedtools
. The output returns an *.intersect
file for all probes compared to the first input probeset, and then for each additional combination that do not intersect (e.g. a nointersect
file for: probe1-probe2-probe3, probe2-probe1-probe3, probe3-probe1-probe2).
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
# turn off N probes (e.g., PROBE1:PROBEN) depending on probesets used for characterization
# user bedfile input
BED1=$2
BED2=$3
BED3=$4
#BED#=$#
# create variable for naming probes and genomes
GEN=$(echo $1 | cut -d . -f 1)
PROBE1=$(echo $2 | cut -d . -f 1 | cut -d - -f 1)
PROBE2=$(echo $3 | cut -d . -f 1 | cut -d - -f 1)
PROBE3=$(echo $4 | cut -d . -f 1 | cut -d - -f 1)
#PROBE#=$(echo $# | cut -d . -f 1 | cut -d - -f 1)
printf "getting intersect of all probes with probe1\n"
# all probes intersect
bedtools intersect \
-a '1genefeatures/'$GEN'_simple.sorted.gff' \
-b '2mapping/'$BED1 \
'2mapping/'$BED2 \
'2mapping/'$BED3 \
-names $PROBE1 $PROBE2 $PROBE3 \
-wa -wb > '3intersect/'$GEN'-PROBES.intersect';
# turn off the number of -b inputs depending on the total number of probes you use
# e.g., using only two your names might look like: -names $PROBE1 $PROBE2 \
# summary
bedtools summary -i '3intersect/'$GEN'-PROBES.intersect' \
-g '0data/genomes/'$GEN'.genomefile' \
> '3intersect/'$GEN'-PROBES-intersect-summary.txt'
printf "getting which probes do not intersect with probe1\n"
# get all probes that do not intersect with probe1
bedtools intersect \
-a '2mapping/'$BED1 \
-b '2mapping/'$BED2 \
'2mapping/'$BED3 \
-names $PROBE2 $PROBE3 \
-v > '3intersect/'$GEN'-PROBES.nointersect';
# summary
bedtools summary -i '3intersect/'$GEN'-PROBES.nointersect' \
-g '0data/genomes/'$GEN'.genomefile' \
> '3intersect/'$GEN'-nointersect-summary.txt'
printf "getting which probes do not intersect with probe2\n"
# get all probes that do not intersect with probe1
bedtools intersect \
-a '2mapping/'$BED2 \
-b '2mapping/'$BED1 \
'2mapping/'$BED3 \
-names $PROBE1 $PROBE3 \
-v > '3intersect/'$GEN'-'$PROBE2'.nointersect';
# summary
bedtools summary -i '3intersect/'$GEN'-'$PROBE2'.nointersect' \
-g '0data/genomes/'$GEN'.genomefile' \
> '3intersect/'$GEN'-'$PROBE2'-nointersect-summary.txt'
printf "getting which probes do not intersect with probe3\n"
# get all probes that do not intersect with probe3
bedtools intersect \
-a '2mapping/'$BED3 \
-b '2mapping/'$BED2 \
'2mapping/'$BED1 \
-names $PROBE2 $PROBE1 \
-v > '3intersect/'$GEN'-'$PROBE3'.nointersect';
# summary
bedtools summary -i '3intersect/'$GEN'-'$PROBE3'.nointersect' \
-g '0data/genomes/'$GEN'.genomefile' \
> '3intersect/'$GEN'-'$PROBE3'-nointersect-summary.txt'
printf "intersect of probes complete\n"
You can reduce redunancy in the genome-PROBES.intersect
file given that probes mapped to genes will be reported twice, as genic and either exon or introns. You can create two files one with intergenic and genic mappings and one with intergenic, introns, and exons mappings. However, these can be filtered simply using R.
It may be necessary to change what is included and excluded based on whether the genome_detailed.sorted.gff
or genome_simple.sorted.gff
file was used. This example code is shown works with the *_simple.sorted.gff
and will produce a an tab delimited file with all probefiles. To subset by a condition simply select based on the column:
awk -F"\t" '($3 != "gene") {print $11"\t"$12"\t"$13"\t"$14"\t"$3"\t"$1"\t"$4"\t"$5"\t""probeset="$10";"$9}' file.intersect | awk -F";" '$1=$1' OFS="\t" > out.intersect
The intersect of the probe flies can now be analyized with the R script or the rest of the pipline can be followed
The directory structure should now look like this
.
├── 0data
│ ├── genomes
│ │ ├── pteMadi2.fna
│ │ ├── pteMadi2.fna.amb
│ │ ├── pteMadi2.fna.ann
│ │ └── ...
│ └── probes
│ ├── Adephaga_2.9Kv1_UCE-Probes.fasta
│ ├── Adephaga_2.fasta
│ ├── Adephaga.fasta
│ └── ...
├── 1genefeatures
│ ├── pteMadi2_detailed.sorted.gff
│ ├── pteMadi2_simple.sorted.gff
│ ├── pteMadi2.sorted.exons.bed
│ └── ...
├── 2mapping
│ ├── Adephaga-pteMadi2_2.mem.sorted.mapped.bed
│ ├── Adephaga-pteMadi2_2.sam
│ ├── Adephaga-pteMadi2.mem.sorted.mapped.bed
│ └── ...
├── 3intersect
│ ├── pteMadi2-Coleoptera.nointersect
│ ├── pteMadi2-Coleoptera-nointersect-summary.txt
│ ├── pteMadi2-nointersect-summary.txt
│ └── ...
├── genefeatures.sh
├── index.sh
├── intersect.sh
├── run.sh
├── sam2bed.sh
├── summary.txt
└── workflow.md
...
If you find it necessary to join multiple probesets follow the rest of this pipeline. Some steps of this pipeline will provide a summary of the probes that may be useful.
Given that Adephaga has ~300 UCE loci in common with Coleoptera by design, these will be removed from the Adephaga probekit. This pipeline is prioritizing Adephaga, but either probekit could be used to remove. Given that the original Coleoptera probekit is being integrated, those loci/probes will be removed from the Adephaga set.
This reduces the 2921 targeted loci from the previous steps to 2619:
# get list of probes to remove
grep "coleoptera" 0data/probes/Adephaga_2.fasta | cut -d "|" -f 1 | cut -d "p" -f 1 | sort -u > 0data/probes/tmp.list
# use list to remove those probes
awk '{ if ((NR>1) && ($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' 0data/probes/Adephaga_2.fasta | grep -vf 0data/probes/tmp.list - | tr "\t" "\n" > 0data/probes/Adephaga_3.fasta
order is important
if you have an probefile that isn't in the expected phyluce format (e.g., uce100_p10 | ...
) then you need to modify the cut in the last awk statements or do this step manually. This file takes input [....]
needs to be run twice;
bash blast.sh Adephaga_3.fasta Coleoptera_2.fasta
bash blast.sh Adephaga_3.fasta Vasil.fasta
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
PROBEFILE1=$1
PROBEFILE2=$2
PROBEFILE1NAME=$(echo $1 | cut -d . -f 1)
PROBEFILE2NAME=$(echo $2 | cut -d . -f 1)
if [ -d '4probe-reduction/' ]; then
printf "Starting blast of $PROBEFILE2 against $PROBEFILE1\n"
else
mkdir 4probe-reduction
printf "Starting blast of $PROBEFILE2 against $PROBEFILE1\n"
fi
if [ -f '/4probe-reduction/'$PROBEFILE2NAME'-'$PROBEFILE1NAME'.blastn' ]; then
printf "$PROBEFILE2 already BLAST'd against $PROBEFILE1\n"
elif [ -f '0data/probes/'$PROBEFILE1'.nin' ]; then
cd 0data/probes
blastn -query $PROBEFILE2 \
-db $PROBEFILE1 \
-num_threads 3 \
-max_target_seqs 100000 \
-perc_identity 0.90 \
-outfmt "6 qseqid qstart qend sseqid sstart send gaps mismatch pident evalue bitscore" \
> '../../4probe-reduction/'$PROBEFILE2NAME'-'$PROBEFILE1NAME'.blastn';
cd ../../
printf "creating list of $PROBEFILE2 loci to remove\n"
awk '$7 == "0" {print $1}' \
'4probe-reduction/'$PROBEFILE2NAME'-'$PROBEFILE1NAME'.blastn' | \
cut -d "p" -f 1 | \
sort -u \
> '4probe-reduction/'$PROBEFILE2NAME'.to-remove'
else
printf "Creating $PROBEFILE1NAME database\n"
cd 0data/probes
makeblastdb -in $PROBEFILE1 -dbtype nucl
blastn -query $PROBEFILE2 \
-db $PROBEFILE1 \
-num_threads 3 \
-max_target_seqs 100000 \
-perc_identity 0.90 \
-outfmt "6 qseqid qstart qend sseqid sstart send gaps mismatch pident evalue bitscore" \
> '../../4probe-reduction/'$PROBEFILE2NAME'-'$PROBEFILE1NAME'.blastn';
cd ../../
printf "creating list of $PROBEFILE2 loci to remove\n"
awk '$7 == "0" {print $1}' \
'4probe-reduction/'$PROBEFILE2NAME'-'$PROBEFILE1NAME'.blastn' | \
cut -d "p" -f 1 | \
sort -u \
> '4probe-reduction/'$PROBEFILE2NAME'.to-remove'
fi
Next add to the *.to-remove
list based on intersect files generated in directory 3intersect.
awk '$10 == "Coleoptera" {print $14}' 3intersect/pteMadi2-PROBES.intersect | cut -d "p" -f 1 | sort -u >> 4probe-reduction/Coleoptera_2.to-remove
awk '{print $0}' 4probe-reduction/Coleoptera_2.to-remove | sort -u > 4probe-reduction/Coleoptera_2.to-remove.list
In this example, the Vasil probes are formated differently and for downstream purposes I am considering exons over the genes (note the change in cut)
awk '$7 == "Vasil" {print $11}' 3intersect/pteMadi2-PROBES.intersect | cut -d _ -f 1,2,3 | sort -u >> 4probe-reduction/Vasil.to-remove2
awk '{print $0}' 4probe-reduction/Vasil.to-remove2 | sort -u > 4probe-reduction/Vasil.to-remove.list
next we remove the duplicates using the generated lists
awk '{ if ((NR>1) && ($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' 0data/probes/Coleoptera_2.fasta | grep -vf 4probe-reduction/Coleoptera_2.to-remove.list - | tr "\t" "\n" > 4probe-reduction/Coleoptera.subset.fasta
awk '{ if ((NR>1) && ($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' 0data/probes/Vasil.fasta | grep -vf 4probe-reduction/Vasil.to-remove.list - | tr "\t" "\n" > 4probe-reduction/Vasil.subset.fasta
In order to use these probes for the phyluce pipeline, the probe ID's need renamed for non-adephaga probesets (Coleoptera and Vasil). First the original name is appended to the probe ID at the end of the header.
# ensure you have a new directory 5join-probes
awk 'BEGIN{RS=">";-F" |"} (NR>1) {print ">" $1 " " $2",original-probe-ID:"$1 "\n" $3}' 4probe-reduction/Coleoptera.subset.fasta > 5join-probes/Coleoptera.subset.tmp
awk 'BEGIN{RS=">";-F" |"} (NR>1) {print ">" $1 " " $2",original-probe-ID:"$1 "\n" $3}' 4probe-reduction/Vasil.subset.fasta > 5join-probes/Vasil.subset.tmp
grep ">" 5join-probes/Coleoptera.subset.tmp | cut -d"|" -f1 | tr -d ">" > 5join-probes/Coleoptera-rename.list
grep ">" 5join-probes/Vasil.subset.tmp | cut -d"|" -f1 | tr -d ">" > 5join-probes/Vasil-rename.list
Next a delimited file is created, with the old probe ID, new target loci, new probe number, and final renamed loci. Recall that the Vasil probes, TC######-PA_# is the "gene" and TC######-PA_#_# exon, and the rest, TC######-PA_#_#_#_#_# probe ID's. This may vary depending on the probes used.
touch 5join-probes/Coleoptera-rename.tmp.list
for i in $(cat 5join-probes/Coleoptera-rename.list); do
LOCI=$(echo $i | cut -d "p" -f 1);
echo -e $i "\t" $LOCI >> 5join-probes/Coleoptera-rename.tmp.list;
done
touch 5join-probes/Vasil-rename.tmp.list
for i in $(cat 5join-probes/Vasil-rename.list); do
LOCI=$(echo $i | cut -d _ -f 1,2,3);
echo -e $i "\t" $LOCI >> 5join-probes/Vasil-rename.tmp.list;
done
Next create a new ID for these probes based on their "loci" to make it compatable with phyluce
. This uses the new tmp.list file created previously to alter the fasta file.
awk 'BEGIN {count=1000000;probe[0]=""};
# if the probe is not loaded into the array; that probe gets +1
{if (!($2 in probe)) {probe[$2]=count++} ;
# print a new column with your new loci name
print $0, "\t", "uce-"count}' 5join-probes/Coleoptera-rename.tmp.list > 5join-probes/Coleoptera-rename.tmp.list2
# do the same, but for our newly created column
# compare the value from the first field stored in save
awk '$3 != save { counter = 1; save = $3 }
# if the values differ resets to one
{ print $0, "\t", "_p"counter++ }' 5join-probes/Coleoptera-rename.tmp.list2 > 5join-probes/Coleoptera-rename.tmp.list3
# and kiss
awk '{print ">"$1, "\t", ">"$3 $4}' 5join-probes/Coleoptera-rename.tmp.list3 > 5join-probes/Coleoptera.renamed-probe.list
#again for Vasil but at 2 million so there is no overlap
awk 'BEGIN {count=2000000;probe[0]=""};
{if(!($2 in probe)) {probe[$2]=count++};
print $0, "\t", "uce-"count}' 5join-probes/Vasil-rename.tmp.list > 5join-probes/Vasil-rename.tmp.list2
awk '$3 != save { counter = 1; save = $3 } { print $0, "\t", "_p" counter++ }' 5join-probes/Vasil-rename.tmp.list2 > 5join-probes/Vasil-rename.tmp.list3
awk '{print ">" $1, "\t", ">" $3 $4}' 5join-probes/Vasil-rename.tmp.list3 > 5join-probes/Vasil.renamed-probe.list
# clean up tmp files
#rm 5join-probes/*.tmp.list*
First we rename the probe files using the *.renamed-probe.list
and the *.subset.fasta
awk 'BEGIN {OFS="\t"} NR==FNR {a[$1]=$2; next} /^>/ {if (a[$1]) {$1=a[$1]}}1' 5join-probes/Coleoptera.renamed-probe.list 5join-probes/Coleoptera.subset.tmp | tr "\t" " " > 5join-probes/Coleoptera.renamed.tmp.probes
awk 'BEGIN {OFS="\t"} NR==FNR {a[$1]=$2; next} /^>/ {if (a[$1]) {$1=a[$1]}}1' 5join-probes/Vasil.renamed-probe.list 5join-probes/Vasil.subset.tmp | tr "\t" " " > 5join-probes/Vasil.renamed.tmp.probes
cat 0data 0data/probes/Adephaga_3.fasta 5join-probes/Coleoptera.renamed.tmp.probes 5join-probes/Vasil.renamed.tmp.probes > 5join-probes/joined.tmp.fasta
following this pipeline the resulting probeset contains 4255 loci comprised of 90130 probes. So the new probefile will follow the naming convention of other UCE probesets joined_probes4.2k.fasta
mv 5join-probes/joined.renamed.tmp.probes > 5join-probes/joined_probes4.2k.fasta
Here are the resulting changes for the probes
$ seqkit stats *.fasta
file | format | type | num_seqs | sum_len | min_len | avg_len | max_len |
---|---|---|---|---|---|---|---|
0data/probes/Adephaga.fasta | FASTA | DNA | 38,948 | 4,673,760 | 120 | 120 | 120 |
0data/probes/Adephaga_2.fasta | FASTA | DNA | 38,682 | 4,641,840 | 120 | 120 | 120 |
0data/probes/Adephaga_3.fasta | FASTA | DNA | 34,995 | 4,199,400 | 120 | 120 | 120 |
0data/probes/Coleoptera.fasta | FASTA | DNA | 13,674 | 1,640,880 | 120 | 120 | 120 |
0data/probes/Coleoptera_2.fasta | FASTA | DNA | 13,609 | 1,633,080 | 120 | 120 | 12 |
0data/probes/Vasil.fasta | FASTA | DNA | 49,786 | 5,974,320 | 120 | 120 | 120 |
5join-probes/Coleoptera.subset.tmp | FASTA | DNA | 9,033 | 1,083,960 | 120 | 120 | 120 |
5join-probes/Vasil.subset.tmp | FASTA | DNA | 46,102 | 5,532,240 | 120 | 120 | 120 |
5join-probes/joined_probes4.2k.fasta | FASTA | DNA | 90,130 | 10,815,600 | 120 | 120 | 120 |
.
├── 0data
│ ├── genomes
│ │ ├── pteMadi2.fna
│ │ ├── pteMadi2.fna.amb
│ │ ├── pteMadi2.fna.ann
│ │ └── ...
│ └── probes
│ ├── Adephaga_2.9Kv1_UCE-Probes.fasta
│ ├── Adephaga_2.fasta
│ ├── Adephaga_2.fasta.ndb
│ └── ...
├── 1genefeatures
│ ├── pteMadi2_detailed.sorted.gff
│ ├── pteMadi2_simple.sorted.gff
│ ├── pteMadi2.sorted.exons.bed
│ └── ...
├── 2mapping
│ ├── Adephaga-pteMadi2_2.mem.sorted.mapped.bed
│ ├── Adephaga-pteMadi2_2.sam
│ ├── Adephaga-pteMadi2.mem.sorted.mapped.bed
│ └── ...
├── 3intersect
│ ├── pteMadi2-Coleoptera.nointersect
│ ├── pteMadi2-Coleoptera-nointersect-summary.txt
│ ├── pteMadi2-nointersect-summary.txt
│ └── ...
├── 4probe-reduction
│ ├── Coleoptera_2-Adephaga_2.blastn
│ ├── Coleoptera_2.to-remove
│ ├── Coleoptera_2.to-remove.list
│ └── ...
├── 5join-probes
│ ├── Coleoptera.renamed-probe.list
│ ├── Coleoptera-rename.list
│ ├── Coleoptera.subset.tmp
│ └── ...
├── blast.sh
├── genefeatures.sh
├── index.sh
├── intersect.sh
├── run.sh
├── sam2bed.sh
└── workflow.md
The goal here is to ensure that we have selected orthologous probes. Adephaga probes have ~300 Coleoptera loci, so there should be approximately a 300 shared uces.
** notes **
##~#~#~#~#~#
#26.04.2023#
##~#~#~#~#~#
conversation with grey
check just Adephaga probes... new trimming: clipkit, relaxed gblocks trimming, bossert found a way to tweak overlap
check out untrimmed alignments in just Adephaga2.9k kit
pull loci using probes, treat dataset outside of phyluce, relaxed gblocks, gblocks, try clipkit, tweak mafft parameters think about reducing overlap in AHE probes...
AHE probes are pulling loci
Van Dam et al (2021) used taxa specific probes, but here we are using a broad number of taxa to ensure that overlap is consistent across taxa, not just one genome.
Two choices, concatenate loci that are found on the same gene OR concatenate loci based on their distance from one another. The assumption for by gene is that loci on the gene should have a similar evolutionary rate. However, genes can have varying rates of evolution within them. Creating a partition by length, assuming linkage between loci, it is possible to join loci that need joined. This, is mainly for demonstration.
- map and make bedfile
- find clusters using
bedtools cluster -d
and check a range from 0:1000 and examine the change in the number of "clusters" - write script to concatenate loci/uce's by the cluster statistic chosen
** notes **
cat 2mapping/Adephaga-pteMadi2_2.mem.sorted.mapped.bed 2mapping/Coleoptera-pteMadi2_2.mem.sorted.mapped.bed 2mapping/Vasil-pteMadi2.mem.sorted.mapped.bed | bedtools sort | bedtools cluster -d 1000 > tmp.cluster1000
cat 2mapping/Adephaga-pteMadi2_2.mem.sorted.mapped.bed 2mapping/Coleoptera-pteMadi2_2.mem.sorted.mapped.bed 2mapping/Vasil-pteMadi2.mem.sorted.mapped.bed | bedtools sort | bedtools cluster -d 500 > tmp.cluster500
This code takes in user input and maps the new probes against the genome. It produces a list of loci that group together on the same gene. For ease, it is best to simplify the created probefile to be succinct (joined_probes4.3k.fasta > joined.fasta)
bash group-by.sh pteMadi2.fna joined.fasta
#!/bin/bash
source /local/anaconda3/bin/activate
conda activate characterization
GEN=$(echo $1 | cut -d . -f 1)
PROBE=$(echo $2 | cut -d . -f 1)
# index genome and map probes
if [ -f '6group-by/'###'sam' ]; then
printf "$PROBE already mapped to -$GEN \n\n";
elif [ -f '0data/genomes/'$GEN'.fna.pac' ]; then
printf "Mapping $PROBE to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '5join-probes/'$PROBE'.fasta' \
> '6group-by/'$PROBE'-'$GEN'.sam';
else
printf "Indexing $GEN \n";
bwa index '0data/genomes/'$GEN'.fna';
printf "Mapping $PROBE to $GEN \n";
bwa mem '0data/genomes/'$GEN'.fna' '5join-probes/'$PROBE'.fasta' \
> '6group-by/'$PROBE'-'$GEN'.sam';
printf "finished bwa indexing and mapping\noutput $PROBE-$GEN.sam in 6group-by directory \n\n";
fi
# convert sam to bam
if [ -f '6groupby/'$PROBE'-'$GEN'.mem.sorted.mapped.bed' ]; then
printf "BED file already created for $PROBE-$GEN\n"
else
printf "converting to bam file\n"
samtools view -h -b -S '6group-by/'$PROBE'-'$GEN'.sam' \
> '6group-by/'$PROBE'-'$GEN'.mem.bam'
printf "sorting reads\n"
# sort bam file
samtools sort '6group-by/'${PROBE}'-'${GEN}'.mem.bam' \
-o '6group-by/'${PROBE}'-'${GEN}'.mem.sorted.bam'
printf "extracting $PROBE probes that map to $GEN\n"
# extract sequences that map against the genome
samtools view -b -F 4 \
'6group-by/'${PROBE}'-'${GEN}'.mem.sorted.bam' \
> '6group-by/'${PROBE}'-'${GEN}'.mem.sorted.mapped.bam'
printf "creating BED file\n"
# create bed file
bedtools bamtobed -i '6group-by/'${PROBE}'-'${GEN}'.mem.sorted.mapped.bam' \
> '6group-by/'${PROBE}'-'${GEN}'.mem.sorted.mapped.bed'
printf "output $PROBE-$GEN.mem.sorted.mapped.bed in 6group-by/ directory\n"
fi
echo -e "Find intersect of $PROBE and $GEN gene features \n"
bedtools intersect -a '1genefeatures/'${GEN}'.sorted.genes.gff' \
-b '6group-by/'${PROBE}'-'${GEN}'.mem.sorted.mapped.bed' \
-names ${PROBE} \
-wa -wb > '6group-by/'${PROBE}-${GEN}'.gene.intersect'
#here we group the GFF feature info and sumarize by UCE loci
# -g group by column or ranges (1-4; 1,2,3,4)
# -c summarize by this column
printf "group $PROBE mapped to $GEN \n"
bedtools groupby -i '6group-by/'${PROBE}-${GEN}'.gene.intersect' \
-g 9 -c 13 -o collapse > '6group-by/'${PROBE}'-'${GEN}'.grpby_gene'
# create output files
cat '6group-by/'${PROBE}'-'${GEN}'.grpby_gene' | tr -d " " > 6group-by/tmp
while read TMP; do
GENE=$(echo $TMP | cut -d ";" -f 1 | cut -d ":" -f 2)
UCELIST=$(echo $TMP | awk '{print $2}')
echo -e "$GENE \t $UCELIST"
done < 6group-by/tmp | tr -d " " > 6group-by/tmp2
GENOME=$(echo $GEN | cut -d"_" -f1,2)
awk 'BEGIN{FS=OFS="\t"} # set intput and output as tab delimited
{split($2, probes, ","); # split the second feild of each line into an array called probes, and as a comma delimited file
for (i in probes) {split(probes[i], loci, "_"); # loop through each element in the proe array, and split it according to the underscore
loci_array[loci[1]] = 1} delete probes; # while in the for loop save the first part of the loci as associative array called loci_array
for (j in loci_array) {loci_list = loci_list ? loci_list "," j : j}; # loop through each key in the loci array and append a string called "loci list". The string is built up using the ? : in the loop condition: if loci_list is not empty (if it does not contain a previous loci) the current loci is appended with a comma; otherwise the current loci is added to the string.
delete loci_array; # delete the loci array to free up memory for the next line
print "'$GENOME'", $1, loci_list; loci_list=""}' 6group-by/tmp2 > '6group-by/'${PROBE}'-'${GEN}'.short.grpby_gene' # print the first field ($1) followed by the loci_list string built earlier and then reset loci_list to start with the next line.
awk 'BEGIN{FS=OFS="\t"} {split($2, probes, ","); for (i in probes) {split(probes[i], loci, "_"); loci_array[loci[1]] = 1} delete probes; for (j in loci_array) {loci_list = loci_list ? loci_list "," j : j}; delete loci_array; print "'$GENOME'", $1, loci_list; loci_list=""}'
printf "finished grouping $PROBE by $GEN genes \n"
printf "\t Detailed grouped by file is ${PROBE}-${GEN}.grpby_gene\n"
printf "\t Succinct grouped by gene file is ${PROBE}-${GEN}.short.grpby_gene\n"
this gives a filethat looks like
ENSMPTG0000500001 uce-1,uce-2,uce-3,uce-7
ENSMPTG0000700105 uce-4,uce-5,uce-6
ENSMPTG0001400123 uce-9,uce-10
...
Using a nexus partition file like this:
#nexus
begin sets;
charset uce-1.nexus = 1-100;
charset uce-2.nexus = 101-200;
charset uce-3.nexus = 201-300;
charset uce-4.nexus = 301-400;
charset uce-5.nexus = 401-500;
charset uce-6.nexus = 500-1500;
charset uce-7.nexus = 1501-1700;
charset uce-8.nexus = 1700-10000;
charset uce-9.nexus = 10001-10100;
charset uce-10.nexus = 10100-10500;
done;
We can run a script that generates a new partition of our sequence data:
#nexus
begin sets;
charset ENSMPTG0000500001 = 1-100 101-200 201-300 1501-1700;
charset ENSMPTG0000700105 = 301-400 401-500 500-1500;
charset uce-8 = 1700-10000;
charset ENSMPTG0001400123 = 10001-10100 10100-10500;
done;
First make a directory to work from 7modify-partition
, and copy your phyluce partition file into it. Because phyluce uses the file name, and not just the loci/uce name, it needs the *.nexus
removed from the file. Additionally, we only want to rename UCEs if there is more than one found on a gene. Depending on the completeness of the matrix that gets put in, then many taxa might be missing one or more of this loci/probe
cat 7modify-partition/inttrim_30p_parti.nex | sed 's/.nexus//g' > 7modify-partition/tmp.nexus
grep "," 6group-by/joined-pteMadi2.short.grpby_gene > 7modify-partition/tmp.grpby_gene
to run this script use the *.short.grpby_gene
awk -f modify-partitions.awk 7modify-partition/tmp.grpby_gene 7modify-partition/tmp.nexus > 7modify-partition/new.nexus
#!/usr/bin/awk -f
# Read the gene mapping file into an array
FNR == NR {
split($2, uces, ",");
for (i in uces) {
gene_map[uces[i]] = $1;
}
next;
}
# Process the partition file
/^charset/ {
charset_name = $2;
charset_pos = substr($0, index($0, "=") + 2);
if (charset_name in gene_map) {
gene = gene_map[charset_name];
if (gene in new_charset) {
new_charset[gene] = new_charset[gene] " " charset_pos;
} else {
new_charset[gene] = charset_pos;
}
} else {
new_charset[charset_name] = charset_pos;
}
}
END {
print "#nexus";
print "begin sets;";
for (gene in new_charset) {
gsub(";", "", new_charset[gene]);
printf "charset %s = %s;\n", gene, new_charset[gene];
}
print "done;";
}
The new partition can be run in phyluce
Linux pyrgus 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
active environment : characterization
active env location : /home/cody/.conda/envs/characterization
shell level : 2
user config file : /home/cody/.condarc
populated config files : /home/cody/.condarc
conda version : 4.6.14
conda-build version : 3.17.6
python version : 3.7.1.final.0
base environment : /local/anaconda3 (read only)
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /local/anaconda3/pkgs
/home/cody/.conda/pkgs
envs directories : /home/cody/.conda/envs
/local/anaconda3/envs
platform : linux-64
user-agent : conda/4.6.14 requests/2.21.0 CPython/3.7.1 Linux/4.4.0-119-generic ubuntu/16.04.4 glibc/2.23
UID:GID : 1028:1040
netrc file : None
offline mode : False
# packages in environment at /home/cody/.conda/envs/characterization:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bedtools 2.31.0 hf5e1c6e_2 bioconda
blast 2.14.0 pl5321h6f7f691_2 bioconda
bwa 0.7.17 he4a0461_11 bioconda
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2023.01.10 h06a4308_0 anaconda
certifi 2022.12.7 pypi_0 pypi
curl 7.87.0 h5eee18b_0
entrez-direct 16.2 he881be0_1 bioconda
gettext 0.21.1 h27087fc_0 conda-forge
htslib 1.6 h42e7767_0 anaconda
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.19.3 h3790be6_0 conda-forge
ld_impl_linux-64 2.38 h1181459_1 anaconda
libcurl 7.87.0 h91b91d3_0
libdeflate 1.14 h166bdaf_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libffi 3.4.2 h6a678d5_6 anaconda
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgomp 12.2.0 h65d4601_19 conda-forge
libidn2 2.3.4 h166bdaf_0 conda-forge
libnghttp2 1.47.0 hdcd2b5c_1 conda-forge
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libunistring 0.9.10 h7f98852_0 conda-forge
libuuid 1.41.5 h5eee18b_0 anaconda
libzlib 1.2.13 h166bdaf_4 conda-forge
natsort 7.1.1 pyhd3eb1b0_0 anaconda
ncbi-vdb 2.11.0 h1b792b2_1 bioconda
ncurses 6.3 h27087fc_1 conda-forge
openssl 1.1.1s h7f8727e_0 anaconda
pcre 8.45 h9c3ff4c_0 conda-forge
perl 5.26.2 h36c2ea0_1008 conda-forge
perl-archive-tar 2.32 pl526_0 bioconda
perl-carp 1.38 pl526_3 bioconda
perl-common-sense 3.74 pl526_2 bioconda
perl-compress-raw-bzip2 2.087 pl526he1b5a44_0 bioconda
perl-compress-raw-zlib 2.087 pl526hc9558a2_0 bioconda
perl-exporter 5.72 pl526_1 bioconda
perl-extutils-makemaker 7.36 pl526_1 bioconda
perl-io-compress 2.087 pl526he1b5a44_0 bioconda
perl-io-zlib 1.10 pl526_2 bioconda
perl-json 4.02 pl526_0 bioconda
perl-json-xs 2.34 pl526_1 bioconda
perl-list-moreutils 0.15 pl526_1 bioconda
perl-pathtools 3.75 pl526h14c3975_1 bioconda
perl-scalar-list-utils 1.52 pl526h516909a_0 bioconda
pip 22.3.1 pypi_0 pypi
python 3.10.8 h7a1cb2a_1 anaconda
readline 8.2 h5eee18b_0 anaconda
samtools 1.6 hcd7b337_9 bioconda
seqkit 2.5.0 h9ee0642_0 bioconda
setuptools 65.6.3 pypi_0 pypi
sqlite 3.40.1 h5082296_0 anaconda
tk 8.6.12 h1ccaba5_0 anaconda
tzdata 2022a hda174b7_0 anaconda
wget 1.20.3 ha56f1ee_1 conda-forge
wheel 0.37.1 pyhd3eb1b0_0 anaconda
xz 5.2.6 h166bdaf_0 conda-forge
zlib 1.2.13 h166bdaf_4 conda-forge