Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong classification sublineages? #11

Open
LauraVP1994 opened this issue Jun 16, 2023 · 4 comments
Open

Wrong classification sublineages? #11

LauraVP1994 opened this issue Jun 16, 2023 · 4 comments

Comments

@LauraVP1994
Copy link

Dear,

We are using your tool on spiked samples and come across some discrepancies between what we expect and what we get. For example we used a bought RNA virus at 100% that should be B.1.1.7 (this result was also confirmed on GISAID by the company and we ourselves confirmed this also with the latest version of Pangolin. However with VLQ (with database constructed with recent GISAID data) we get 38.1% B.1.1.7, 20.89% Q.1 and 31.37% Q.5 as output. This is somewhat surprising to us, because this sample should be more or less pure. Moreover based on the GISAID database with the Pangolin results we defined that the difference between B.1.1.7 and Q.1 is one mutation, namely C15400T, and when looking both in our consensus results and results at low-frequency we did not find this mutation and it was 100% C at this position.

Therefore, we were wondering whether you have an explanation for this difference?

Thank you!

@jbaaijens
Copy link
Collaborator

Hi Laura,

This most likely has to do with your reference set. How did you build it?

Reference set construction has a major impact on predictions with VLQ, since we are directly comparing sequencing reads to these reference sequences. We recommend building a reference set that represents the region and time of sampling (see the README). Alternatively, we find that using the nextregions sequence selections (available for download on GISAID) also gives very good predictions.

I'd be interested to hear which reference set you used and if the issues were resolved using e.g. the nextregions selection.

Jasmijn

@LauraVP1994
Copy link
Author

LauraVP1994 commented Sep 22, 2023

For the reference set, I used sequences and metadata from GISAID that I downloaded on 8/6/23. Subsequently, I used the steps you propose:

python pipeline/preprocess_references.py -m <GISAID_metadata.tsv> -f <GISAID_sequences.fasta> -k 1000 --seed 0 --country USA -o reference_set pipeline/call_variants.sh reference_set <full_path_to_main_ref_fasta> python pipeline/select_samples.py -m <GISAID_metadata.tsv> -f <GISAID_sequences.fasta> -o reference_set --vcf reference_set/*_merged.vcf.gz --freq reference_set/*_merged.frq

PS: the first command is wrong in your documentation it should be "-o" instead of "--o". Also for nextregion the column names are wrong to process the script of preprocess_references.py. Or is this script not necessary anymore?

I have not yet tried the nextregions sequence selections. These are the same principle for the building of the reference set? Also, you can use bootstrapping, will this have an influence on the results or will it only give how sure you are of the result? And how much bootstrapping is recommended?

Thanks!

@jbaaijens
Copy link
Collaborator

Sorry, I'll fix the typo in the instructions in the README! For the nextregions sequences we do apply preprocessing, so yes, the headers would need to be adjusted. We should have the right script for this, let me find out and get back to you.

Bootstrapping does not impact the results, it only gives an impression of uncertainty in the results. In our analysis we concluded that it's not very informative, because in case of wastewater samples the uncertainty is primarily due to stochasticity in sampling, which is not reflected in the bootstrap analysis.

@LauraVP1994
Copy link
Author

Would be great if you have adapted python scripts for the construction of the database based on next_regions. We can then have a look if this improves the results. I tried adapting the csv file with the info that is available in next_regions but I get stuck at the select_samples.py script...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants