-
Notifications
You must be signed in to change notification settings - Fork 1
Adding local fasta files to database
You can add sequences located in your local fasta files to database used by "barapost-local.py". You can do it either with -l
option or by writing your file's path to hits_to_download.tsv
.
You can insert a lineage into IDs of your reference sequences in fasta format, so that "barapost-local.py" can recognize this lineage and save to taxonomy file. It may be benefitial at binning step: barapost-binning will recognize saved taxonomy and will be able to apply "binning sensitivity" (-s
option) to query sequences, which align against your added references sequence(s) with the highest score.
Format of lineage is following:
[ANYTHING BEFORE] <Domain>;<Phylum>;<Class>;<Order>;<Family>;<Genus>;<species> [ANYTHING AFTER]
- All taxons except species must start with capital letter. We aren't ignorant people, are we? :)
- Spaces inside the lineage is NOT allowed.
- If lineage is located in the middle of ID, it must be flanked with spaces.
- There must be 6 semicolons in lineage, since Barapost considers 7 taxonomic ranks:
- domain
- phylum
- class
- order
- family
- genus
- species
- Taxons can be omitted (See Example #2 below)
- Correct sequence ID in fasta format with lineage for
CP027920
:
>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;cereus it looks great
~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~ ~~~~~~
^ ^ ^ ^ ^ ^ ^
domain phylum class order family genus species
- If any taxonomic rank is absent, you can merely omit it, like for
MN908947
(phylum, class and species are absent):
>This is horrible coronavirus Viruses;;;Nidovirales;Coronaviridae;Betacoronavirus; be aware
~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~
^ ^ ^ ^
domain order family genus
- Incorrect sequence ID:
>This is Bacteria; Firmicutes; Bacilli;Bacillales;Bacillus;cereus it looks great
^ ^ ^
space space Family omitted
along with 5-th
semicolon.
To make this ID correct, we will remove abandoned spaces and insert a semicolon before genus name in order to indicate that family name is missing:
>This is Bacteria;Firmicutes;Bacilli;Bacillales;;Bacillus;cereus it looks great
^ ^ ^
remove space remove space insert semicolon
- Incorrect sequence ID (accession:
AB723494
):
>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;sp A1 it looks great
^
Space breaks species name.
It can be corrected with underscore:
>This is Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;sp_A1 it looks great
^
replace space with underscore