Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0 sequences from selection found #8

Open
chirrie opened this issue Nov 26, 2021 · 20 comments
Open

0 sequences from selection found #8

chirrie opened this issue Nov 26, 2021 · 20 comments

Comments

@chirrie
Copy link

chirrie commented Nov 26, 2021

Why I am getting below when I run the preprocessing script?

1323 sequences selected
searching fasta and writing sequences to output directory...
3679 sequences from input fasta processed
0 sequences from selection found

@jbaaijens
Copy link
Collaborator

This suggests that your metadata doesn't match the sequences provided. Could also be that the GISAID formatting has changed again. Could you check whether the sequence identifiers in your fasta file are of the form <Virus name>|<Collection date>|<Submission date> as given in the metadata?

@chirrie
Copy link
Author

chirrie commented Nov 27, 2021

Sure...There are quite some changes is which affect the pipeline. could please have a look at it. I am using the latest script in the pipeline folder

@jbaaijens
Copy link
Collaborator

Ok I see the problem, the sequence identifiers are again formatted differently. It's an easy fix, I'll try to do it tomorrow.

@chirrie
Copy link
Author

chirrie commented Nov 29, 2021

Did get a moment to fix it?
Many thanks

@jbaaijens
Copy link
Collaborator

Just made some changes, could you pull the latest commit and give it a try?

@jbaaijens
Copy link
Collaborator

I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the manuscript folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts in pipeline.

@chirrie
Copy link
Author

chirrie commented Nov 30, 2021

The N-Content line also affects select_sample script

@jbaaijens
Copy link
Collaborator

jbaaijens commented Nov 30, 2021

Yes, it uses this information. In the version from 9 months ago (see manuscript) we calculated N-content ourselves, but in the mean time it's part of the GISAID metadata.

@chirrie
Copy link
Author

chirrie commented Nov 30, 2021

Could you please download hcov_africa.fasta and hcov_africa.tsv and try running on the scripts without changing anything? That is what I am using and getting errors.
I downloaded from region-specific Auspice source files

There is data in the manuscript folder.. What I have is from GISAID and I think it latest since I downloaded it last week. I had narrow down to specific regions since I wanted just a few data to test the pipeline with my data firs

@jbaaijens
Copy link
Collaborator

Ah so what you're using is not actually the full GISAID data, also not for Africa. These are Auspice files which are used for visualisation with Nextstrain (https://docs.nextstrain.org/projects/auspice/en/stable/) and they only have very few sequences. For building a good reference set you need the full GISAID database [GISAID -> EpiCoV -> Downloads -> Download packages -> FASTA (for the sequences) and metadata (for the metadata)]

@jbaaijens
Copy link
Collaborator

If it helps, I can build an Africa-specific reference set and share the sequences / sequence identifiers.

@chirrie
Copy link
Author

chirrie commented Nov 30, 2021 via email

@chirrie
Copy link
Author

chirrie commented Nov 30, 2021

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

@chirrie
Copy link
Author

chirrie commented Dec 1, 2021

I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the manuscript folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts in pipeline.

I have tried downloading few sequences around 100 per lineage, but I am getting 0 sequences found from selection. Could please help out on this.

@jbaaijens
Copy link
Collaborator

Could you again post what your sequence identifiers look like in the fasta file?

@jbaaijens
Copy link
Collaborator

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?

@chirrie
Copy link
Author

chirrie commented Dec 2, 2021

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?

I have sent you an email. Please have a look at it

@chirrie
Copy link
Author

chirrie commented Dec 2, 2021

Could you again post what your sequence identifiers look like in the fasta file?

hCoV-19/Reunion/HCL021109894801/2021|EPI_ISL_2676670|2021-06-08

@Dipti-IISERpune
Copy link

I am facing the exact same issue. I couldn't get the desired details from GISAID, so I prepared the .tsv using details and accession IDs plus Fasta for those sequences. It gave me the above-mentioned error. Now, I am trying to use fasta and tsv for region Asia and country India from GISAID download section. Will it help me to run the code if I rearrange the .tsv as given in the example?

@jbaaijens
Copy link
Collaborator

Unfortunately the GISAID metadata headers have changed over time, so yes, it should be resolved by renaming columns in the metadata. You could also try VLQ-nf, a nextflow implementation of our pipeline: https://github.com/rki-mf1/vlq-nf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants