The DNA library preparation and sequenging steps were made at LaCTAD sequencing facility. Details about Illumina sequenging strategies and the resulting data are described:
-
TruSeq Paired-end
- ~350 bp - average insert size
- 2 * 101 bp
- 1 Lane
- 1 Sample
Name Lane Index Sample #Reads % Bases >= Q30 DNAPE1 L006 CAGATC Drone 2 * 120,948,424 64.85 DNAPE2 L001 ACAGTG Drone 2 * 130,859,645 94.53 -
Nextera Mate-pair
- ~3 Kbp - average fragment size
- 2 * 101 bp
- 1 Lane
- 1 Sample
Name Lane Index Sample #Reads % Bases >= Q30 DNAMatePair L007 NoIndex Drone 2 * 85,513,161 92.10
The sequencing step was performed by LaCTAD and the above information was provided by MSc. Osvaldo Reis Júnior (LaCTAD).
The genome heterozygosity, repeat content, and size was evaluated from sequencing reads using a kmer-based statistical approach using GenomeScope software v. 1.0.
In order to perform the estimates we need to execute Jellyfish v. 2.2.0 (run_jellyfish.sh) for counting of k-mers in DNA. So, we first concatenated all the R1 files into one file (FVARIA_R1.fq) and all the R2 files into another file (FVARIA_R2.fq):
cat ./raw/*_R1_*.fastq > FVARIA_R1.fq
cat ./raw/*_R2_*.fastq > FVARIA_R2.fq
./run_jellyfish.sh
Jellyfish was executed using a range of k-mer sizes, from 19 to 63 with an increment of 4. The Jellyfish histogram of k-mer counts was also created with run_jellyfish.sh.
Execution of GenomeScope for all Jellyfish histogram files:
for i in `ls *.histo`; do k=`basename ${i} .histo | sed 's/mer_out//'`; genomescope.R ${i} ${k} 101 ./GenomeScope${k}; done
- Genome size
cat ./GenomeScope*/summary.txt | grep 'Genome Haploid Length' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\D+//g; $F[2]=~s/\D+//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.2f",($min/$n)/1000000)," Mbp","\t",sprintf("%.2f", ($max/$n)/1000000)," Mbp"; }'
Output (Min./Max.):
Genome Haploid Length 345.77 Mbp 345.90 Mbp
- Heterozygosity
The average of minimum and maximum heterozygosity estimates using distinct k-mer sizes:
cat ./GenomeScope*/summary.txt | grep 'Heterozygosity' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\%//g; $F[2]=~s/\%//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.4f",($min/$n))," %","\t",sprintf("%.4f", ($max/$n))," %"; }'
Output (Min./Max.):
Heterozygosity 0.1258 % 0.1277 %
- Genome Repeat Length
The average of minimum and maximum genome repeat length estimates using distinct k-mer sizes:
cat ./GenomeScope*/summary.txt | grep 'Genome Repeat Length' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\D+//g; $F[2]=~s/\D+//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.2f",($min/$n)/1000000)," Mbp","\t",sprintf("%.2f", ($max/$n)/1000000)," Mbp"; }'
Output (Min./Max.):
Genome Repeat Length 76.05 Mbp 76.08 Mbp
- Model Fit
The average of minimum and maximum model fit estimates using distinct k-mer sizes:
cat ./GenomeScope*/summary.txt | grep 'Model Fit' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\%//g; $F[2]=~s/\%//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.4f",($min/$n))," %","\t",sprintf("%.4f", ($max/$n))," %"; }'
Output (Min./Max.):
Model Fit 97.1158 % 99.3401 %
- Read Error Rate
The average of minimum and maximum read error rate estimates using distinct k-mer sizes:
cat ./GenomeScope*/summary.txt | grep 'Read Error Rate' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\%//g; $F[2]=~s/\%//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.8f",($min/$n))," %","\t",sprintf("%.8f", ($max/$n))," %"; }'
Output (Min./Max.):
Read Error Rate 0.12025125 % 0.12025125 %
The evaluation of contamination was performed on all fastq reads using Kraken2 software v. 2.0.7-beta with a custom database, with includes not only the default databases with RefSeq complete genomes (bacteria, plasmid, viral, human, fungi, plant and protozoa species), but also other not completed genomes of bacteria, fungi and protozoa species, in addition to the sequences of NCBI nucleotides (nt) database.
kraken2 --db /usr/local/bioinfo/kraken2/DB --confidence 0.51 --threads 50 --minimum-base-quality 30 --memory-mapping --paired --report kraken2_51_report.txt --output kraken2_51_output.txt FVARIA_R1.fq FVARIA_R2.fq
Output:
Loading database information... done. 337321230 sequences (68138.89 Mbp) processed in 543.791s (37218.8 Kseq/m, 7518.21 Mbp/m). 3411421 sequences classified (1.01%) 333909809 sequences unclassified (98.99%)
perl -F"\t" -lane ' next if ($F[3] !~ /^(?:R|U\d*|D|K)$/); print join("\t", $F[0],$F[3],$F[5]);' kraken2_51_report.txt
Output:
98.99 U unclassified 1.01 R root 0.41 D Eukaryota 0.11 K Metazoa 0.00 K Fungi 0.00 K Viridiplantae 0.00 D Bacteria 0.00 D Viruses
The current Kraken2 database doesn't have genomes of Insecta class.
The software Fastqc was used to perform the data quality control. Then the paired-end reads were analyzed with software Trimmomatic v. 0.35 for dentification of adapter sequences and quality filtering. The following steps were executed with Trimmomatic:
- Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
- Remove leading low quality or N bases (below quality 3) (LEADING:3)
- Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
- Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
- Drop reads below the 100 bases long (MINLEN:100)
The mate-pair reads were analyzed using NxTrim v. 0.4.1 to
to discard low quality reads and categorise reads according to the orientation implied by the adapter location
. Thus, it exploits the particular properties of Illumina Nextera Mate Pair data and evaluates the read on both sides
of the adapter sites. The NxTrim was executed with a more aggressive adapter search (--aggressive parameter).
The NxTrim builds 'virtual libraries' of mate pairs, paired-end reads and single-ended reads. NxTrim
also trims off adapter read‐through.
-
Post processing steps:
Protocol #Reads Paired-end 234,357,438 Mate-pair 9,617,088
The pre-processing step was performed by LaCTAD and the above information was provided by MSc. Osvaldo Reis Júnior (LaCTAD).
The assembly was performed using software SPAdes v. 3.9.0 using error correction (BayesHammer module) and using 13 k-mer lengths (33, 37, 41, 45, 49, 53, 57, 61, 65, 69, 73, 77 and 81).
#Scaffolds: 9,755 Assembly size: 280,951,552 bp Average size of scaffolds: 28,800 bp N50: 81,551 (980 contigs)
The assembly step was performed by LaCTAD and the above information was provided by MSc. Osvaldo Reis Júnior (LaCTAD).
First, the raw mate-pairs reads were processed using NxTrim according to the script run_nxtrim.sh:
./run_nxtrim.sh
To execute the script run_nxtrim.sh we need the ./raw directory containing the fastq files with DNAPE and DNAMatePair prefixes.
The software BESST v. 2.2.4 was used to increase scaffolding of the genomic assembly. In order to do this, we first aligned all the processed reads to the assembled genome using HISAT2 v. 2.0.5, and then we sorted and indexed the alignments using samtools v. 1.3.1.
cd ./ref
hisat2-build genome genome.fa
cd ../
To execute the script run_hisat2.sh, first we need to create an index for the the fasta file with genome assembly (genome.fa) using software hisat2-build.
./run_hisat2.sh
The script run_hisat2.sh must be executed in a directory containing the following input directories:
- "./in" containing fastq files with DNAPE and DNAMatePair prefixes processed with run_nxtrim.sh;
- "./ref" containing hisat2 index files (*.ht2);
So, we executed BESST pipeline with sorted and indexed alignments:
mkdir ./out/
runBESST \
-q \
--orientation rf fr \
--separate_repeats \
--min_mapq 30 \
-c ./ref/genome.fa \
-g \
-f \
./out/DNAMP_sorted.bam ./out/DNAPE_sorted.bam
The final assembly was named Fvar-1.1.fa and then copied to Fvar-1.2.fa, i.e. these fasta files are the same. The genome fasta file was compressed in Fvar-1.2.fa.gz.
#Scaffolds: 2,173 Assembly size: 275,421,029 bp Average size of scaffolds: 126,747 bp N50: 467,007 (176 contigs)
The genome evaluation was made using QUAST-LG v. 5.0.2. The estimated genome size of 345 Mbp was used in the NG50 related statistics.
Fvar-0.1.fa is the same as genome.fa (first assembly with SPADes) Fvar-1.2.fa is the final assembly (using BESST)
Statistics without reference | Fvar-0.1 | Fvar-1.2 |
---|---|---|
# contigs (>= 0 bp) | 9,755 | 2,173 |
Largest contig | 603,324 | 2,258,834 |
Total length (>= 100000 bp) | 116,286,290 | 255,941,895 |
N50 | 83,201 | 470,005 |
N75 | 43,559 | 244,533 |
L50 | 946 | 174 |
L75 | 2,087 | 379 |
GC (%) | 36.72 | 36.72 |
NG50 | 61,750 | 340,840 |
NG75 | 14,801 | 87,636 |
LG50 | 1,432 | 264 |
LG75 | 4,018 | 723 |
The full report is available in HTML or TXT.
The gene prediction was made with Maker2 v. 3.00 using the F. varia assembled genome (Fvar-1.1.fa). First, we used Maker2 with the default gene prediction protocol with Ab-initio gene prediction softwares and some EST Evidences.
- EST Evidence
- ests.fa - 20,064 Expressed Sequence Tags (ESTs) of F. varia obtained from NCBI GenBank;
- altests.fa - 169,511 Expressed Sequence Tags (ESTs) of Apis mellifera obtained from NCBI GenBank;
- Protein Homology Evidence
- GCF_000002195.4_Amel_4.5_protein.faa - Proteins of Apis mellifera obtained from NCBI Genome; Moreover, the following Ab-initio gene prediction softwares in Maker2 was also configured:
- SNAP v. 2006-07-28 - configured with A.mellifera.hmm;
- Augustus v. 3.3 - configured with fly model;
- genemark v. 3.52 - eukaryotic executable;
Full details can be obtained by consulting the Maker2 configuration files (maker2_1): maker_bopts.ctl maker_evm.ctl maker_exe.ctl maker_opts.ctl
So, we executed Maker2 again with another gene evidences. In addition to the previous configuration, we added another EST evidence based on transcriptome assembly together with the previous assembly:
- Re-annotation Using MAKER Derived GFF3
- Fvar-1.1.maker.output/Fvar-1.1.all.gff - previous MAKER Derived GFF3;
- EST evidence
- transcriptXgenome_filtered_sorted.gff - F. varia assembled transcriptome (contigs) mapped on F. varia genome assembly (Fvar-1.1.fa) using UCSC Genome Browser Tools (blat, pslCDnaFilter, pslToBed, bedToGenePred, genePredToGtf) and GBrowse Tools (gtf2gff3);
Full details can be obtained by consulting the Maker2 configuration files (maker2_2): maker_bopts.ctl maker_evm.ctl maker_exe.ctl maker_opts.ctl
We create the gene feature file from the output of Maker2 (an output folder with a datastore index) using gff3_merge script to combine all the GFFs.
gff3_merge -d Fvar-1.2.maker.output/Fvar-1.2_master_datastore_index.log
We filtered out all the complementary features contained in the generated GFF file.
grep -v -P '\t(contig|repeatmasker|snap_masked|augustus_masked_match|augustus_masked|blastx|protein2genome|evm|blastn|est2genome|cdna2genome|repeatrunner|tblastx)\t' Fvar-1.2.all.gff | grep -v '^[#>]' | grep -v -P '^[ACGTN]+$' > Fvar-1.2.cleaned.gff
We also converted GFF to GTF file:
cat Fvar-1.2.cleaned.gff | gff2gtf.pl > Fvar-1.2.cleaned.gtf
So, we renamed the features using the prefix Fvar using first the rename_gtf.pl script and then the rename_gff.pl script. After renaming with rename_gff.pl the checkGFFcoords.pl script was executed to check three_prime_UTR and five_prime_UTR features and adjust the first/last exon coordinates.
rename_gtf.pl -i Fvar-1.2.cleaned.gtf -o Fvar-1.2.gtf -p Fvar
./rename_gff.pl -i Fvar-1.2.cleaned.gff -o Fvar-1.2.tmp.gff
./checkGFFcoords.pl -i1 Fvar-1.2.all.gff -i2 Fvar-1.2.tmp > Fvar-1.2.gff
rm -f ./Fvar-1.2.tmp.gff
All these steps were made for the first execution of Maker2, but only the last one was represented in the commands above.
A BUSCO analysis was performed using the Hymenoptera dataset (OBP9) considering both modes, genome and transcriptome. The analysis was also performed comparing the results among Hymenoptera species in genomes or transcriptomes.
- genome (Abbrev./Species/File) Ador Apis dorsata GCF_000469605.1_Apis_dorsata_1.3_genomic.fna Aech Acromyrmex echinatior GCF_000204515.1_Aech_3.9_genomic.fna Aflo Apis florea GCF_000184785.2_Aflo_1.0_genomic.fna Amel Apis mellifera GCF_000002195.4_Amel_4.5_genomic.fna Bimp Bombus impatiens GCF_000188095.1_BIMP_2.0_genomic.fna Bter Bombus terrestris GCF_000214255.1_Bter_1.0_genomic.fna Dnov Dufourea novaeangliae GCF_001272555.1_ASM127255v1_genomic.fna Emex Eufriesea mexicana GCF_001483705.1_ASM148370v1_genomic.fna Fvar Frieseomelitta varia Fvar-1.2.fa Hlab Habropoda laboriosa GCF_001263275.1_ASM126327v1_genomic.fna Lalb Lasioglossum albipes GCA_000346575.1_ASM34657v1_genomic.fna Mqua Melipona quadrifasciata GCA_001276565.1_ASM127656v1_genomic.fna Mrot Megachile rotundata GCF_000220905.1_MROT_1.0_genomic.fna
- transcriptome (Abbrev./Species/File) Ador Apis dorsata GCF_000469605.1_Apis_dorsata_1.3_rna.fna Aflo Apis florea GCF_000184785.2_Aflo_1.0_rna.fna Amel Apis mellifera GCF_000002195.4_Amel_4.5_rna.fna Bimp Bombus impatiens GCF_000188095.1_BIMP_2.0_rna.fna Bter Bombus terrestris GCF_000214255.1_Bter_1.0_rna.fna Dnov Dufourea novaeangliae GCF_001272555.1_ASM127255v1_rna.fna Emex Eufriesea mexicana GCF_001483705.1_ASM148370v1_rna.fna Fvar Frieseomelitta varia Fvar-1.2_rna.fa Hlab Habropoda laboriosa GCF_001263275.1_ASM126327v1_rna.fna Mqua Melipona quadrifasciata GCA_001276565.1_ASM127656v1_rna.fna Mrot Megachile rotundata GCF_000220905.1_MROT_1.0_rna.fna Nvit Nasonia vitripennis GCF_000002325.3_Nvit_2.1_rna.fna
In the manuscript, only the transcriptome mode was attached.
A coding gene function annotation was performed with eggNOG-mapper v. 2.0.0 before submission to NCBI Genome Database. The protein sequences (Fvar-1.2-proteins.fa) were extracted from GFF (Fvar-1.2.gff) using gffread software v. 0.11.7.
mkdir -p em/
sed 's/\.$//' Fvar-1.2-proteins.fa > Fvar-1.2-proteins-cleaned.fa
perl -e 'use Bio::SeqIO; my $seqin = Bio::SeqIO->new(-file=>"Fvar-1.2-proteins-cleaned.fa", -format=>"FASTA"); my $seqout = Bio::SeqIO->new(-fh=>\*STDOUT,-format=>"FASTA",-width=>1000000); while (my $seq=$seqin->next_seq() ) { $seqout->write_seq($seq); } ' | split -l 20000 -a 3 -d - em/input_file.chunk_
for f in em/*.chunk_*; do
emapper.py -m diamond --no_annot --no_file_comments --cpu 40 -i ${f} -o ${f};
done
cat em/*.chunk_*.emapper.seed_orthologs > em/input_file.emapper.seed_orthologs
emapper.py --annotate_hits_table em/input_file.emapper.seed_orthologs --no_file_comments -o em/output_file --cpu 40
The evaluation of contamination was performed on assembled genome using Kraken2 software v. 2.0.7-beta with a custom database, with includes not only the default databases with RefSeq complete genomes (bacteria, plasmid, viral, human, fungi, plant and protozoa species), but also other not completed genomes of bacteria, fungi and protozoa species, in addition to the sequences of NCBI nucleotides (nt) database.
kraken2 --db /usr/local/bioinfo/kraken2/DB --confidence 0.51 --threads 50 --minimum-base-quality 30 --memory-mapping --report kraken2_51_report.txt --output kraken2_60_output.txt Fvar-1.2.fa
Output:
Loading database information... done. 2173 sequences (275.42 Mbp) processed in 17.291s (7.5 Kseq/m, 955.69 Mbp/m). 0 sequences classified (0.00%) 2173 sequences unclassified (100.00%)
perl -F"\t" -lane ' next if ($F[3] !~ /^(?:R|U\d*|D|K)$/); print join("\t", $F[0],$F[3],$F[5]);' kraken2_51_report.txt
Output:
100.00 U unclassified
The current Kraken2 database doesn't have genomes of Insecta class.
We load F. varia genome and gene prediction coordinates to our UCSC Genome Browser instalation at LBDA. The F. varia genome can be accessed here.
We tried four approaches for mitochondrial assembly evaluation:
- NOVOPlasty v. 2.5.9 (COX2 gene of Bombus hypocrita sapporensis as seed input sequence) - default parameters and all processed reads;
- MITObim v. 1.9 (COX2 gene of Bombus hypocrita sapporensis as seed input sequence) - default parametersand only the processed reads mapped against mitochondrial bee mitochondrial genomes;
- SPAdes v. 3.6.2 (default parameters and only the processed reads mapped against mitochondrial bee mitochondrial genomes);
- Platanus v. 1.2.4 (default parameters and only the processed reads mapped against mitochondrial bee mitochondrial genomes); The approach with NOVOPlasty was selected because it was the only resulting in one contig with evidence of circularization.
The organelle-specific software NOVOPlasty v. 2.5.9 was used for mitochondrial assembly with default parameters considering COX2 gene of Bombus hypocrita sapporensis. (NC_011923) as seed input sequence. All the paired-end reads were used as input for the NOVOPlasty procedures as recommended by its developers.
NOVOPlasty2.5.9.pl -c config.txt
The configuration file for NOVOPlasty (config.txt) contains all the parameters for the execution. The final alignment file is Fvar-MT-1.2.fa.
In order to validate the assembly, we selected only the high quality paired-end reads. So, we process these reads using the PrinSeq v. 0.20.3 under more stringent threshold parameters: trimming reads using the sliding window approach, considering the mean of quality score below 28 in a window size of 3 bp sliding 1 bp to the left, in addition to filtering sequences with at least one quality score below 30.
prinseq-lite.pl -fastq ./raw/DNAPE1_CAGATC_L006_R1.fastq \
-fastq2 ./raw/DNAPE1_CAGATC_L006_R1.fastq \
-out_format 3 \
-out_good ./processed/prinseq/mitogenome/DNAPE1_CAGATC_L006.prinseq \
-min_qual_score 30 \
-out_bad null \
-qual_noscale \
-no_qual_header \
-min_len 20 \
-ns_max_p 80 \
-noniupac \
-trim_qual_right 25 \
-trim_qual_type mean \
-trim_qual_rule lt \
-trim_qual_window 3 \
-trim_qual_step 1
prinseq-lite.pl -fastq ./raw/DNAPE2_ACAGTG_L001_R1.fastq \
-fastq2 ./raw/DNAPE2_ACAGTG_L001_R2.fastq \
-out_format 3 \
-out_good ./processed/prinseq/mitogenome/DNAPE2_ACAGTG_L001.prinseq \
-min_qual_score 30 \
-out_bad null \
-qual_noscale \
-no_qual_header \
-min_len 20 \
-ns_max_p 80 \
-noniupac \
-trim_qual_right 25 \
-trim_qual_type mean \
-trim_qual_rule lt \
-trim_qual_window 3 \
-trim_qual_step 1
This process resulted in 74,938,483 paired-end reads that were aligned against the assembled mitochondrial genome using bowtie v. 0.12.7 to immediately evaluate the mapping coverage. Then, we used the REAPR v. 1.0.18 to evaluate the genome assembly accuracy with the previous mapping.
The alignment with bowtie was configured to reported only end-to-end hits with up to 1 mismatch, and with a maximum insert size of 300 for paired-end alignment.
cat ./processed/prinseq/mitogenome/DNAPE*_1.fastq > ./processed/prinseq/mitogenome/DNAPE.prinseq_1.fastq
cat ./processed/prinseq/mitogenome/DNAPE*_1.fastq > ./processed/prinseq/mitogenome/DNAPE.prinseq_2.fastq
bowtie-build ./Fvar-MT-1.2.fa Fvar-MT-1.2
bowtie -v 1 -p 8 -X 300 -a --fr ./Fvar-MT-1.2.fa \
-1 ./processed/prinseq/mitogenome/DNAPE.prinseq_1.fastq \
-2 ./processed/prinseq/mitogenome/DNAPE.prinseq_2.fastq \
-S ./bowtiexPEreads_Fvar-MT-1.2.sam
samtools view -Sb ./bowtiexPEreads_Fvar-MT-1.2.sam | samtools sort - > ./bowtiexPEreads_Fvar-MT-1.2.bam
reapr facheck ./Fvar-MT-1.2.fa
reapr preprocess ./Fvar-MT-1.2.fa ./bowtiexPEreads_Fvar-MT-1.2.bam ./reapr_out
cd ./reapr_out
reapr stats ./ 01.stats
reapr fcdrate ./ 01.stats 02.fcdrate
fcdcutoff=`tail -n 1 ./02.fcdrate.info.txt | cut -f 1`
reapr score 00.assembly.fa.gaps.gz 00.in.bam 01.stats ${fcdcutoff} 03.score
reapr break 00.assembly.fa 03.score.errors.gff.gz 04.break
reapr summary 00.assembly.fa 03.score 04.break 05.summary
According to the REAPR results, there are 57.53% of error free bases in the assembled mitochondrial genome of 15,144 bp and no other errors were identified, such as the REAPR fragment coverage distribution (FCD) error, or even an evidence of local misassemblies.
The summary report of REAPR analysis has more details about the assembly evaluation.
Furthermore, we made a comparison of the F. varia assembled genome with Apis mellifera, Bombus and Melipona mitochondrial genomes using the software MAUVE v. 2.4.0 for constructing multiple genome alignment. From this alignment we defined six genome blocks in the assembled genome according to rearrangements observed in the mitochondrial gene order compared to other high eusocial bee species. So, we used an in-house script (mappings_checker_on_blocks.pl) to support the alignments of fragments across the blocks' junctions, i.e. if the two reads of each pair aligned one in each adjacent block, thus supporting the adjacency.
mappings_checker_on_blocks.pl -b fvaria_mauve_blocks.bed -a ./bowtiexPEreads_Fvar-MT-1.2.bam > output_adjacency.txt
The blocks' coordinates are in a bed file (fvaria_mauve_blocks.bed). And the output of this evaluation is also available (mappings_checker_on_blocks.pl). The number of mapped paired-end reads (R1 and R2) accross the blocks' junctions support the current assembly and blocks organization.
The genome annotation was initially done using the software MITOS2 and later it was done manually to correctly detect the start and stop codons using ORFfinder and BLAST tools, which were also used to identify rRNAs' coordinates. The tRNAs were identified by the softwares tRNAscan-SE v. 2.0 and ARWEN using default parameters. The genome map was generated by OGDRAW v. 1.3.1 from the GenBank flat file.
We also generated a TBL file to represent the genes mapping and its organization in the genome.
- Nuclear genome
We first adjusted the GFF to satisfy NCBI genome submission requirements, such as locus_tag attribute. We also added eggNOG-mapper annotation for the coding genes.
./adj_gff.pl -i Fvar-1.2.gff -e ./output_file.emapper.annotations > Fvar-1.2-ncbi.gff
The GFF obtained with Maker2 was annotated with eggNOG-mapper v. The GFF (Fvar-1.2-ncbi.gff) was converted to Sequin data file (Fvar-1.2.sqn) and was sent to NCBI Genome database.
table2asn -M n -J -c w -euk -t ./template.sbt -gaps-min 10 -l paired-ends -i ./Fvar-1.2.fa -f ./Fvar-1.2-ncbi.gff -o ./Fvar-1.2.sqn -n "Frieseomelitta varia" -taxid 561572 -V b -T -j "[organism=Frieseomelitta varia] [topology=linear] [location=genomic] [moltype=DNA] [gcode=1] [sex=male] [country=Brazil] [dev-stage=pharate-adult] [Tech=wgs] [completedness=partial] [common=marmelada] [strand=double]"
- Mitochondrial genome
The TBL file (Fvar-MT-1.2.1.tbl) obtained with Sequin was manually adjusted (for example, the names of gene products were adjusted to follow the pattern established by NCBI), then converted to Sequin file (Fvar-MT-1.2.1.sqn) and was also sent to NCBI Genome database.
table2asn -C lbda -Z -M n -J -c 'wsdD' -euk -t ./template.sbt -gap-type scaffold -l paired-ends -i ./Fvar-MT-1.2.fa -f ./Fvar-MT-1.2.1.tbl -o ./Fvar-MT-1.2.1.sqn -n "Frieseomelitta varia" -taxid 561572 -V b -T -j "[organism=Frieseomelitta varia] [topology=circular] [location=mitochondrion] [moltype=DNA] [mgcode=5] [gcode=1] [sex=male] [country=Brazil] [dev-stage=pharate-adult] [tech=wgs] [completedness=complete] [common=marmelada] [strand=double]"