__ __ __ __ ______ _______ _____ _____ _____ _ ______ _ _ _ _ _____
| \/ | \/ | ____|__ __/ ____| __ \ / ____| | | ____| /\ | \ | | | | | __ \
| \ / | \ / | |__ | | | (___ | |__) | | | | | | |__ / \ | \| | | | | |__) |
| |\/| | |\/| | __| | | \___ \| ___/ | | | | | __| / /\ \ | . ` | | | | ___/
| | | | | | | |____ | | ____) | | | |____| |____| |____ / ____ \| |\ | |__| | |
|_| |_|_| |_|______| |_| |_____/|_| \_____|______|______/_/ \_|_| \_|\____/|_|
The algorithm identifies contigs that are putatively cross-contaminated by finding pairs of identical or nearly identical sequences. It then uses coverage information to distinguish between cross-contaminated and clean sequences. The algorithm therefore assumes that contaminating DNA and the resulting reads are always present in a smaller amount than correct DNA and reads.
The deployed and prepared environment for running MMETSP cleanup iside the virtual machine.
- 150 GB of free space
- 20+ Gb of RAM
- install virtualbox:
$ sudo apt-get install virtualbox
- install vagrant:
$ sudo apt-get install vagrant
- install vagrant disksize plugin:
$ vagrant plugin install vagrant-disksize
- Create an empty folder and download the preparation shell script into it:
$ wget http://kolisko-lab.bc.cas.cz/mmetsp_cleanup/download_box.sh
- Make the script executable and run it
$ chmod +x download_box.sh
$ ./download_box.sh
This script downloads all necessary files for the virtual machine (~25 Gb).
- add the box to the vagrant environment
$ vagrant box add mmetsp_cleanup mmetsp_cleanup.box
- start the virtual machine
$ vagrant up
When virtual machine is ready you can connect to it using ssh:
$ vagrant ssh
Main configuration files for editing:
/home/vagrant/mmetsp_data/settings.yml
- coverage_ratio thresholds are defined there
winston.hits_filtering.len_ratio
— minimalqcovhsp
for hits filteringwinston.hits_filtering.len_minimum
— minimal hit lenth for hits filteringwinston.coverage_ratio.REGULAR
— Reads coverage ratio for REGULAR dataset pair type (minimal difference between coverage of LEFT_ORG and RIGHT_ORG contig to consider it a contaminated, lower values make contamination prediction more strict, less contaminations will be found)winston.coverage_ratio.CLOSE
— Reads coverage ratio for CLOSE dataset pair typewinston.coverage_ratio.LEFT_EATS_RIGHT
— Reads coverage ratio for CLOSE dataset pair typewinston.coverage_ratio.RIGHT_EATS_LEFT
— Reads coverage ratio for CLOSE dataset pair type
/home/vagrant/mmetsp_data/types.csv
- file with types and thresholds for datasets. It contains all possible combinations of dataset pairs.
The structure of file:
LEFT_ORG_ID,RIGHT_ORG_ID,THRESHOLD,TYPE
- THRESHOLD - (float) minimal percentage of identity of BLAST hit to consider it a suspicious.
- TYPE - (float) type, describled in
settings.yml
- TYPES:
- REGULAR - two unrelated organisms
- CLOSE - Evolutionary close species with more stringent setting for contamination identification to reduce false positves
- LEFT_EATS_RIGHT and RIGHT_EATS_LEFT - For situations where one sequenced organisms is also present in other cultures as a food source
- TYPES:
In the WM home folder, you will find a script run.sh, which starts the mmetsp cleanup pipeline. To run the process simply run that script:
$ ./run.sh
The results will appear in the folder: /home/vagrant/mmetsp_data/results/
To exit the virtual machine terminal session type
$ exit
To stop running virtual machine from your local computer type the command
$ vagrant halt
To view the status of VM type
$ vagrant status
You can share files between VM and local computer by putting them to the folder with Vagrantfile. They will appear in the VM in /vagrant folder. It can be useful if you don't want to edit types.csv and settings.csv from the VM.
Email me and I can solve your problems and answer your questions: [email protected]
Receives a path to the dataset folder with all the .fas
files (--datasets_path
).
Checks if the file name and MMETSP name of contigs are equal.
As a result builds a wrong_names.csv
file with the structure: file_name,name_of_contigs
.
Also the scripts assures that all the contigs belong to the same MMETSP sample.
Receives a path to the wrong_names.csv
file (--wrong_names_path
) and a path to the datasets folder (--datasets_path
).
In each file with a wrong contig name script replaces the MMETSP name of contigs with a name from file name.
Replaces all the occurrences of "wrong names" in each one vs all .blastab
file.
Replaces all the occurrences of "wrong names" in each all vs all .blastab
file.
Checks, if files in folder provided contain wrong names.
tar cvf datasets.tar.gz datasets/*.blastab
tar cvf one_vs_all.tar.gz one_vs_all/*.blastab
tar cvf all_vs_all.tar.gz all_vs_all/*.blastab
Receives the three .tar archives with:
- datasets (
.fas
files) - one vs all hits (
.blastab
) - all vs all BLAST hits (
.blastab
)
As an output makes a prepared structure of Decross project.
#TODO