Skip to content

Plasmid Database

pedroscampoy edited this page Sep 18, 2018 · 15 revisions

Please, follow those steps to download a reliable and complete plasmid database. This is going to take several hours but needs to be done only once.

1. Download plasmid database info file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt

2. Extract sequences from all accession numbers into a FASTA file using eutils:

This command outputs a raw FASTA with about 12000 sequences

for i in $(cat plasmids.txt | awk 'BEGIN{FS="\t"} (NR>2) {if ($6 ~ "N") {print $6;} else {print $7}}'); do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=$i&retmode=text&rettype=fasta"; done > plasmids.fna

3. Remove concepts

From PlasmidID folder execute:

filter_fasta.sh -i PATH/TO/FILE/plasmids.fna -N -l gene -l partial -l putative -l protein -l hypothetical -l unnamed -o PATH/TO/FILE -n plasmids

A file named plasmids_term.fasta will be created with -o argument for the output directory and -n for file name.

4. Remove redundancy

From PlasmidID folder execute:

cdhit_cluster.sh -i PATH/TO/FILE/plasmids_term.fasta -p -c 100 -M 20000 -T 8

NOTE:

  • -i argument is the route to and plasmids.fna file
  • The output will be the same as the input
  • Memmory (-M) and number of threads (-T) can vary depending on the computer than execute this command

NOTE2:

This step is optional, PlasmidID works with any DNA database. Redundancy removal is useful in order to reduce execution time. Also, any other clustering software is welcome.