SegFinder detection pipeline. a, Schematic overview of the discovery of RdRP for RNA viruses. The inputs are fastq files for multiple meta-transcriptome libraries. rRNA, ribosomal RNA; NR, Non-Redundant Protein Sequence Database; NT, Nucleotide Sequence Database. b, The processing pipeline of correlation calculation. L, library; C, contig; c, Schematic illustration of filtering of segmented RNA virus clusters. Cor, correlation; TPM, Transcripts Per Kilobase of exon model per Million mapped reads;
conda env create -f environment.yml
conda activate SegFinder
mkdir Seg_DB
cd Seg_DB
mkdir accession2taxid
cd accession2taxid
wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5
#Check for the file integrity
md5sum -c prot.accession2taxid.gz.md5
#Unzip the files and onfiguration
gunzip -c prot.accession2taxid.gz > Seg_DB/accession2taxid/prot.accession2taxid
cd Seg_DB
mkdir nr
cd nr
wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
cd Seg_DB
mkdir nt
cd nt
wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
#download nucl_gb.accession2taxid
cd Seg_DB/accession2taxid
wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip -c nucl_gb.accession2taxid.gz > Seg_DB/accession2taxid/nucl_gb.accession2taxid
#download taxdump
mdkir Seg_DB/taxdump
wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
cd Seg_DB/taxdump
tar -zxvf taxdump.tar.gz
#download ref_viruses_rep_genomes
mkdir Seg_DB/ref_viruses_rep_genomes
cd Seg_DB/ref_viruses_rep_genomes
wget -t 0 -c https://figshare.com/ndownloader/files/46795402
tar -zxvf ref_viruses_rep_genomes.tar.gz
#Handling nt database
python3 process_sequences.py --input Seg_DB/nt/ --out Seg_DB/nt/ --threads 40 --nucl_gb_accession2taxid_path Seg_DB/accession2taxid/nucl_gb.accession2taxid --taxdump_path Seg_DB/taxdump --ref_viruses_path Seg_DB/ref_viruses_rep_genomes
Note: --input:nt database location
./SegFinder.sh [option] --help
./SegFinder.sh --indata PATH/data \
--taxidDB Seg_DB/accession2taxid/prot.accession2taxid \
--nt_noViruses Seg_DB/nt/nt_noViruses \
--nt Seg_DB/nt \
--thread 20 \
--datatype 2 \
--method salmon \
--preprocess true \
--assemble megahit \
--nr Seg_DB/nr \
--only_rdrp_find 1
Note: file_list.txt contains the prefix name of the file;The file name must be prefixed_1/2.fq.gz format
./SegFinder.sh --indata PATH/data \
--taxidDB Seg_DB/accession2taxid/prot.accession2taxid \
--nt_noViruses Seg_DB/nt_noViruses \
--nt Seg_DB/nt \
--thread 20 \
--rm_length 600 \
--datatype 2 \
--cor 0.8 \
--library_ID $file \
--method salmon \
--nr Seg_DB/nr
Note: file:Indicates the input file prefix name, either a single file or a single file (list.txt);list.txt contains the prefix name of the file;The file name must be prefixed_1/2.fq.gz format