-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP #12
Comments
When running step 2, there are three threshold parameters you may be interested in changing: -w QUAL_THRESHOLD, --qual_threshold QUAL_THRESHOLD Default values: |
Thanks for the clarification. This is useful in setting up the quality threshold in step2. |
There is no way to select a percentage of isolates. This is not within vSNP's scope. However, after running step 2 and examining the output SNP table, if you see a group of SNPs in the table that are being called for a subgroup of samples, a position in that group of SNPs can be selected to be used as a defining SNP. Defining SNPs are found in the *define_filter.xlsx dependency file. You can locate this file by showing your reference type locations with the command: There is additional information on adding a defining SNP here: The final SNP alignment is not a core SNP alignment. The SNP alignments output for the designated groups only include those SNPs that are parsimony informative. When looking at a group, if the same SNP has occurred in all samples within that group, it will not be shown in the SNP table. vSNP was designed to show differences between datasets. As datasets increase and new outbreaks emerge, new defining SNPs are used to group samples into relatively small subsets so the focus can be on SNP changes specific to an outbreak. |
Thank you very much for the clarification. Sounds like SNP table can be used to identify SNPs that define sub-groups or strains in a given dataset when mapped against the same reference genome. On the other hand, would it possible to use the define filter xlsx to mask snps in certain regions eg. repeat regions ? I'm interested in building SNP core alignment based maximum likelihood phylogeny and compute the genetic distance between strains through SNP distance. Is there a way to utilise the vcf files (with zero coverage positions) generated in step1 to produce a core SNP alignment and a phylogenetic tree subsequently? |
Yes, you can use the define_filter.xlsx to mask SNPs in specific regions, including repeat regions. The xlsx file allows you to specify positions to exclude from SNP calling, effectively masking those areas.
There is no direct way to build a core alignment. vSNP will remove uninformative SNPs (SNP occurring in all samples) from the comparisons. However, using the VCF files from step 1 you may be able to use other tools such as `snippy-core`.
Something like the following can filter high quality SNPs.
`bcftools view -i 'QUAL>1000' -v snps input.vcf.gz -O z -o filtered_highqual_snps.vcf.gz`
From: butterbee ***@***.***>
Date: Wednesday, January 15, 2025 at 6:09 AM
To: USDA-VS/vSNP3 ***@***.***>
Cc: Stuber, Tod - MRP-APHIS ***@***.***>, Comment ***@***.***>
Subject: Re: [USDA-VS/vSNP3] SNP calling/filtering: how to modify the ratio of reads to call a SNP and percentage of isolates containing a SNP (Issue #12)
Thank you very much for the clarification. Sounds like SNP table can be used to identify SNPs that define sub-groups or strains in a given dataset when mapped against the same reference genome. On the other hand, would it possible to use the define filter xlsx to mask snps in certain regions eg. repeat regions ?
I'm interested in building SNP core alignment based maximum likelihood phylogeny and compute the genetic distance between strains through SNP distance. Is there a way to utilise the vcf files (with zero coverage positions) generated in step1 to produce a core SNP alignment and a phylogenetic tree subsequently?
Re vcf files with zero coverage, it seems these files are presented with large number of variants. Any recommendations on filtering high quality snps (only) out from those vcf files?
Thanks!
—
Reply to this email directly, view it on GitHub<#12 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABMMFMLDMZG5Y6S6JVQAKN32KZMZFAVCNFSM6AAAAABVCCGZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJSHAYTONJYGQ>.
You are receiving this because you commented.Message ID: ***@***.***>
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
How to set up the parameters to call a SNP that pass a particular ratio of reads and only present in certain % of isolates in a given dataset? Not sure, what are a the VSNP3 default values for these parameters. if there are any, how to change it while running step1 or step2?
The text was updated successfully, but these errors were encountered: