-
Notifications
You must be signed in to change notification settings - Fork 8
Plasmid Database
pedroscampoy edited this page Sep 18, 2018
·
15 revisions
Please, follow those steps to download a reliable and complete plasmid database. This is going to take several hours but needs to be done only once.
ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt
This command outputs a raw FASTA with about 12000 sequences
for i in $(cat plasmids.txt | awk 'BEGIN{FS="\t"} (NR>2) {if ($6 ~ "N") {print $6;} else {print $7}}'); do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=$i&retmode=text&rettype=fasta"; done > plasmids.fna
From PlasmidID folder execute:
filter_fasta.sh -i PATH/TO/FILE/plasmids.fna -N -l gene -l partial -l putative -l protein -l hypothetical -l unnamed -o PATH/TO/FILE -n plasmids
A file named plasmids_term.fasta will be created with -o argument for the output directory and -n for file name.
From PlasmidID folder execute:
cdhit_cluster.sh -i PATH/TO/FILE/plasmids_term.fasta -p -c 100 -M 20000 -T 8
NOTE:
- -i argument is the route to and plasmids.fna file
- The output will be the same as the input
- Memmory (-M) and number of threads (-T) can vary depending on the computer than execute this command
NOTE2:
This step is optional, PlasmidID works with any DNA database. Redundancy removal is useful in order to reduce execution time. Also, any other clustering software is welcome.