Step by step workflow for composite transposons analysis based on ISEScan

Citing

@article {Gligorijevic2019,
	author = {Zhiqun Xie, Haixu Tang},
	title = {SEScan: automated identification of Insertion Sequence Elements in prokaryotic genomes},
	year = {2017},
	doi = {10.1093/bioinformatics/btx433},
	URL = {https://academic.oup.com/bioinformatics/article/33/21/3340/3930124},
	journal = {Bioinformatics, Volume 33}
}

About The Project

This pipeline is built to search for genes that are transferred by transposition within the bacterial genome. The first step of the algorithm is to search for insertion sequences (that are flunking sequences of transposons) using the ISEScan tool.

Then transposon is made from found ISs by adding coordinates to the original GenBank file. The next task is to extract coding sequences that lie within the transposons in the fasta format. The visualization with Artemis is optional.

In the next step, all extracted sequences are clustered by CD-HIT, and clusters are filtered depending on the threshold. Finally, one representative sequence is chosen from each cluster and blasted against the database of interest.

Built With

Installation

Local

Setup conda environment

You can install ISEScan to other place by changing the default miniconda3 install path in step Install Miniconda3. Visit Bioconda recipe for ISEScan for more details.

Install ISEScan Automated install by Bioconda (recommended!)

conda install -c bioconda isescan

Visit ISEScan Installation for more details.

Install CD-HIT

conda install -c bioconda cd-hit

Install Artrtemis (optional)

conda install -c bioconda artemis

Install blast

conda install -c bioconda blast

Docker

Create YOUR_DATA_ROOT directory on your local machine
```
mkdir /YOUR_DATA_ROOT
```
Docker run! -u $(id -u):$(id -g) is used to make sure all files created by pipeline are accessible for users
```
docker run -it -u $(id -u):$(id -g) -v /YOUR_DATA_ROOT:/data dana162001/p_SA /bin/bash
```

Running pipeline

Download/prepear genomes of interest in .gbff format

Run script to convert .gbff to .fasta format

convert_gb_to_fasta.py -i [Name_of_dir_with_gb_files] -o [Name_of_dir_with_fasta_files]

Run ISEScan

isescan.py --seqfile seq_ID.fasta --output results --nthread 2

By default, ISEScan will use one CPU core but you can change it using command option --nthread [num]

Run artemis_visualisation.py to add coordinates of new ISs to the original .gbff files

artemis_visualisation.py -i [dir_to_ISEScan_results_csv] -u [dir_to_original_gbff_files] -m [dir_to_modified_gbff_files]

Run me_cds.py to extract coding sequences from modified .gbff files

artemis_visualisation.py -i [dir_to_modified_gbff_files]  -o merged_cds_prot.fasta

Run cd_hit.py to cluster extracted sequences

cd_hit.py -i [dir_to_merged_cds_prot.fasta]  -o [output_path]

Default parameter: -c 0.6 -aS 0.8 -n 4 -M 4000
-c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids in alignment divided by the full length of the shorter sequence
-aS alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence
-n 4 for thresholds 0.6 ~ 0.7
-M max available memory (Mbyte), default 400

Run clusters_histogram.py to make a histogram of distribution of sequences in the clusters (optional)

clusters_histogram.py -i [path_to_cd-hit_output.clstr]

Run cluster_filter.py to filter out clusters that contains less than threshold number of sequences

cluster_filter.py -i [path_to_cd-hit_output.clstr] -n [thrashold_number_of_seq] -o [path_to_output.fasta]

gives fasta file with representative sequences from clusters that contain more sequences than threshold as an output

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
figs		figs
notebooks		notebooks
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
artemis_visualisation.py		artemis_visualisation.py
blastp.py		blastp.py
cd_hit.py		cd_hit.py
clusters_histogram.py		clusters_histogram.py
clustr_filter.py		clustr_filter.py
convert_gb_to_fasta.py		convert_gb_to_fasta.py
dockerfile		dockerfile
me_cds.py		me_cds.py
requirements.txt		requirements.txt
rpsblast.py		rpsblast.py
transposons_count.py		transposons_count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step by step workflow for composite transposons analysis based on ISEScan

Citing

About The Project

Built With

Installation

Local

Docker

Running pipeline

About

Releases

Packages

Languages

Dana162001/Transposons_Analysis

Folders and files

Latest commit

History

Repository files navigation

Step by step workflow for composite transposons analysis based on ISEScan

Citing

About The Project

Built With

Installation

Local

Docker

Running pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages