Skip to content

Adding local fasta files to database

masikol edited this page May 26, 2023 · 6 revisions

Adding local files to database

You can add sequences located in your local fasta files to database used by "barapost-local.py". You can do it either with -l option or by writing your file's path to hits_to_download.tsv.

You can insert a lineage into IDs of your reference sequences in fasta format, so that "barapost-local.py" can recognize this lineage and save to taxonomy file. It may be benefitial at binning step: barapost-binning will recognize saved taxonomy and will be able to apply "binning sensitivity" (-s option) to query sequences, which align against your added references sequence(s) with the highest score.

Format of lineage is following:

[ANYTHING BEFORE] <Domain>;<Phylum>;<Class>;<Order>;<Family>;<Genus>;<species> [ANYTHING AFTER]

Rules:

  • All taxons except species must start with capital letter. We aren't ignorant people, are we? :)
  • Spaces inside the lineage is NOT allowed.
  • If lineage is located in the middle of ID, it must be flanked with spaces.
  • There must be 6 semicolons in lineage, since Barapost considers 7 taxonomic ranks:
    • domain
    • phylum
    • class
    • order
    • family
    • genus
    • species
  • Taxons can be omitted (See Example #2 below)

Examples:

  1. Correct sequence ID in fasta format with lineage for CP027920:
>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;cereus it looks great
         ~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~ ~~~~~~
         ^        ^          ^       ^          ^           ^        ^
         domain   phylum     class   order      family      genus    species
  1. If any taxonomic rank is absent, you can merely omit it, like for MN908947 (phylum, class and species are absent):
>This is horrible coronavirus Viruses;;;Nidovirales;Coronaviridae;Betacoronavirus; be aware
                              ~~~~~~~   ~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ 
                              ^         ^           ^             ^          
                              domain    order       family        genus
  1. Incorrect sequence ID:
>This is Bacteria; Firmicutes; Bacilli;Bacillales;Bacillus;cereus it looks great
                  ^           ^                  ^
                space       space          Family omitted
                                           along with 5-th
                                              semicolon.

To make this ID correct, we will remove abandoned spaces and insert a semicolon before genus name in order to indicate that family name is missing:

>This is Bacteria;Firmicutes;Bacilli;Bacillales;;Bacillus;cereus it looks great
                 ^          ^                   ^
        remove space   remove space            insert semicolon
  1. Incorrect sequence ID (accession: AB723494):
>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;sp A1 it looks great
                                                                       ^
                                                          Space breaks species name.

It can be corrected with underscore:

>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;sp_A1 it looks great
                                                                       ^
                                                            replace space with underscore