SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12

butterbee · 2025-01-13T08:42:12Z

How to set up the parameters to call a SNP that pass a particular ratio of reads and only present in certain % of isolates in a given dataset? Not sure, what are a the VSNP3 default values for these parameters. if there are any, how to change it while running step1 or step2?

stuber · 2025-01-13T11:25:35Z

When running step 2, there are three threshold parameters you may be interested in changing:

-w QUAL_THRESHOLD, --qual_threshold QUAL_THRESHOLD
Optional: Minimum QUAL threshold for calling a SNP
-x N_THRESHOLD, --n_threshold N_THRESHOLD
Optional: Minimum N threshold. SNPs between this and qual_threshold are reported as N
-y MQ_THRESHOLD, --mq_threshold MQ_THRESHOLD
Optional: At least one position per group must have this minimum MQ threshold to be called.

Default values:
-w [150] --> SNP: QUAL >150
-x [50] --> N: QUAL 50-150
-y [56] --> MQ: >56

butterbee · 2025-01-13T11:54:55Z

Thanks for the clarification. This is useful in setting up the quality threshold in step2.
Re SNP filtering, I'm struggling to understand how can we adjust the parameters to include SNPs that are only present in 90% (example) of isolates. Can we make such modifications in step1?
Also, can you clarify that the final SNP alignment produced by step2 is the core SNP alignment (snps that are present in all the isolates)?
Thanks!

stuber · 2025-01-13T12:24:51Z

There is no way to select a percentage of isolates. This is not within vSNP's scope. However, after running step 2 and examining the output SNP table, if you see a group of SNPs in the table that are being called for a subgroup of samples, a position in that group of SNPs can be selected to be used as a defining SNP. Defining SNPs are found in the *define_filter.xlsx dependency file. You can locate this file by showing your reference type locations with the command: vsnp3_path_adder.py -s

There is additional information on adding a defining SNP here:
https://github.com/USDA-VS/vSNP/blob/master/docs/detailed_usage.md#adding-new-groups-or-subgroups

The final SNP alignment is not a core SNP alignment. The SNP alignments output for the designated groups only include those SNPs that are parsimony informative. When looking at a group, if the same SNP has occurred in all samples within that group, it will not be shown in the SNP table. vSNP was designed to show differences between datasets. As datasets increase and new outbreaks emerge, new defining SNPs are used to group samples into relatively small subsets so the focus can be on SNP changes specific to an outbreak.

butterbee · 2025-01-15T13:09:15Z

Thank you very much for the clarification. Sounds like SNP table can be used to identify SNPs that define sub-groups or strains in a given dataset when mapped against the same reference genome. On the other hand, would it possible to use the define filter xlsx to mask snps in certain regions eg. repeat regions ?

I'm interested in building SNP core alignment based maximum likelihood phylogeny and compute the genetic distance between strains through SNP distance. Is there a way to utilise the vcf files (with zero coverage positions) generated in step1 to produce a core SNP alignment and a phylogenetic tree subsequently?
Re vcf files with zero coverage, it seems these files are presented with large number of variants. Any recommendations on filtering high quality snps (only) out from those vcf files?
Thanks!

stuber · 2025-01-15T13:34:28Z

Yes, you can use the define_filter.xlsx to mask SNPs in specific regions, including repeat regions. The xlsx file allows you to specify positions to exclude from SNP calling, effectively masking those areas. There is no direct way to build a core alignment. vSNP will remove uninformative SNPs (SNP occurring in all samples) from the comparisons. However, using the VCF files from step 1 you may be able to use other tools such as `snippy-core`. Something like the following can filter high quality SNPs. `bcftools view -i 'QUAL>1000' -v snps input.vcf.gz -O z -o filtered_highqual_snps.vcf.gz` From: butterbee ***@***.***> Date: Wednesday, January 15, 2025 at 6:09 AM To: USDA-VS/vSNP3 ***@***.***> Cc: Stuber, Tod - MRP-APHIS ***@***.***>, Comment ***@***.***> Subject: Re: [USDA-VS/vSNP3] SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP (Issue #12) Thank you very much for the clarification. Sounds like SNP table can be used to identify SNPs that define sub-groups or strains in a given dataset when mapped against the same reference genome. On the other hand, would it possible to use the define filter xlsx to mask snps in certain regions eg. repeat regions ? I'm interested in building SNP core alignment based maximum likelihood phylogeny and compute the genetic distance between strains through SNP distance. Is there a way to utilise the vcf files (with zero coverage positions) generated in step1 to produce a core SNP alignment and a phylogenetic tree subsequently? Re vcf files with zero coverage, it seems these files are presented with large number of variants. Any recommendations on filtering high quality snps (only) out from those vcf files? Thanks! — Reply to this email directly, view it on GitHub<#12 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABMMFMLDMZG5Y6S6JVQAKN32KZMZFAVCNFSM6AAAAABVCCGZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJSHAYTONJYGQ>. You are receiving this because you commented.Message ID: ***@***.***> This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12

SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12

butterbee commented Jan 13, 2025

stuber commented Jan 13, 2025

butterbee commented Jan 13, 2025

stuber commented Jan 13, 2025

butterbee commented Jan 15, 2025

stuber commented Jan 15, 2025 via email

SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12

SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12

Comments

butterbee commented Jan 13, 2025

stuber commented Jan 13, 2025

butterbee commented Jan 13, 2025

stuber commented Jan 13, 2025

butterbee commented Jan 15, 2025

stuber commented Jan 15, 2025 via email