Wrong classification sublineages? #11

LauraVP1994 · 2023-06-16T07:39:24Z

Dear,

We are using your tool on spiked samples and come across some discrepancies between what we expect and what we get. For example we used a bought RNA virus at 100% that should be B.1.1.7 (this result was also confirmed on GISAID by the company and we ourselves confirmed this also with the latest version of Pangolin. However with VLQ (with database constructed with recent GISAID data) we get 38.1% B.1.1.7, 20.89% Q.1 and 31.37% Q.5 as output. This is somewhat surprising to us, because this sample should be more or less pure. Moreover based on the GISAID database with the Pangolin results we defined that the difference between B.1.1.7 and Q.1 is one mutation, namely C15400T, and when looking both in our consensus results and results at low-frequency we did not find this mutation and it was 100% C at this position.

Therefore, we were wondering whether you have an explanation for this difference?

Thank you!

jbaaijens · 2023-09-21T12:05:22Z

Hi Laura,

This most likely has to do with your reference set. How did you build it?

Reference set construction has a major impact on predictions with VLQ, since we are directly comparing sequencing reads to these reference sequences. We recommend building a reference set that represents the region and time of sampling (see the README). Alternatively, we find that using the nextregions sequence selections (available for download on GISAID) also gives very good predictions.

I'd be interested to hear which reference set you used and if the issues were resolved using e.g. the nextregions selection.

Jasmijn

LauraVP1994 · 2023-09-22T07:18:13Z

For the reference set, I used sequences and metadata from GISAID that I downloaded on 8/6/23. Subsequently, I used the steps you propose:

python pipeline/preprocess_references.py -m <GISAID_metadata.tsv> -f <GISAID_sequences.fasta> -k 1000 --seed 0 --country USA -o reference_set pipeline/call_variants.sh reference_set <full_path_to_main_ref_fasta> python pipeline/select_samples.py -m <GISAID_metadata.tsv> -f <GISAID_sequences.fasta> -o reference_set --vcf reference_set/*_merged.vcf.gz --freq reference_set/*_merged.frq

PS: the first command is wrong in your documentation it should be "-o" instead of "--o". Also for nextregion the column names are wrong to process the script of preprocess_references.py. Or is this script not necessary anymore?

I have not yet tried the nextregions sequence selections. These are the same principle for the building of the reference set? Also, you can use bootstrapping, will this have an influence on the results or will it only give how sure you are of the result? And how much bootstrapping is recommended?

Thanks!

jbaaijens · 2023-09-27T11:17:08Z

Sorry, I'll fix the typo in the instructions in the README! For the nextregions sequences we do apply preprocessing, so yes, the headers would need to be adjusted. We should have the right script for this, let me find out and get back to you.

Bootstrapping does not impact the results, it only gives an impression of uncertainty in the results. In our analysis we concluded that it's not very informative, because in case of wastewater samples the uncertainty is primarily due to stochasticity in sampling, which is not reflected in the bootstrap analysis.

LauraVP1994 · 2023-09-27T15:51:55Z

Would be great if you have adapted python scripts for the construction of the database based on next_regions. We can then have a look if this improves the results. I tried adapting the csv file with the info that is available in next_regions but I get stuck at the select_samples.py script...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong classification sublineages? #11

Wrong classification sublineages? #11

LauraVP1994 commented Jun 16, 2023

jbaaijens commented Sep 21, 2023

LauraVP1994 commented Sep 22, 2023 •

edited

Loading

jbaaijens commented Sep 27, 2023

LauraVP1994 commented Sep 27, 2023

Wrong classification sublineages? #11

Wrong classification sublineages? #11

Comments

LauraVP1994 commented Jun 16, 2023

jbaaijens commented Sep 21, 2023

LauraVP1994 commented Sep 22, 2023 • edited Loading

jbaaijens commented Sep 27, 2023

LauraVP1994 commented Sep 27, 2023

LauraVP1994 commented Sep 22, 2023 •

edited

Loading