Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outcome of O2/O9 comparison can be random #46

Open
VanOverbeeke opened this issue Apr 5, 2022 · 5 comments
Open

Outcome of O2/O9 comparison can be random #46

VanOverbeeke opened this issue Apr 5, 2022 · 5 comments

Comments

@VanOverbeeke
Copy link

In some of our WGS datasets, SeqSero2 (v1.1.1 using the default microassembly approach) returns O2 or O9 seemingly at random:

Sample Predicted antigenic profile Predicted serotype
replicate_a_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_b_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_c_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_d_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_e_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_f_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_g_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_h_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana

We have traced the issue back to this decision block: https://github.com/denglab/SeqSero2/blob/master/bin/SeqSero2_package.py#L740.

The contigs stored in the special_genes dictionary are 3 relatively short contigs for O2, and 3 for O9. The contigs and the scores are identical in each run. However, when the scores for O2 and O9 hits are being compared on line 745 (see https://github.com/denglab/SeqSero2/blob/master/bin/SeqSero2_package.py#L745), only the last value of O2 is compared to the last value of O9. However, due to the random nature of iterating over dictionaries in Python3, the 'last' value in the special_genes dictionary is not always the highest. special_genes dictionary is random (as is known for some/most versions of Python3).

In the case of multiple short contigs for the O gene, that match both O2 and O9, this means that a higher scoring contig can be passed first in the iteration, followed by a lower scoring contig. The second score then replaces the first, even though the first score was higher. In a different SeqSero2 run, the lower score might be passed first in the iteration, followed by the higher score, leading to a different pair of scores to be compared on line 745 in the script.

I propose the following solution:

  • replace line 741 (if "tyr-O-9" in z:) with: if "tyr-O-9" in z and special_genes[z] > O9:
  • replace line 743 (elif "tyr-O-2" in z:) with: if "tyr-O-2" in z and special_genes[z] > O2:
@LSTUGA
Copy link
Collaborator

LSTUGA commented Jul 2, 2022

Sorry for the late response. Could you please share the accession numbers of your questionable genomes. We will look into this issue.

@VanOverbeeke
Copy link
Author

Hi, no problem. The issue occurs with FASTQ input of two of our own samples, in a custom Docker environment. I can share the anonymized samples and the Docker image over a filesharing platform. Do you have a preference for which platform?

@LSTUGA
Copy link
Collaborator

LSTUGA commented Jul 22, 2022

Hi, a Docker image would be fine. Just let me know how to access your samples. Thanks!

@VanOverbeeke
Copy link
Author

VanOverbeeke commented Jul 28, 2022 via email

@VanOverbeeke
Copy link
Author

One more question, does SeqSero2 really take absolutely raw and completely untrimmed reads as input? We have seen some improvement using completely raw input files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants