-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool assumptions and behaviour #80
Comments
In general, I wouldn't recommend this. I would recommend regenerating the GTCs with the new manifest. You should be able to use illumina's new tool for that, DRAGEN Array: https://support.illumina.com/array/array_software/dragen-array-secondary-analysis/downloads.html. For the assumptions though if you proceed with using the old manifest:
Unfortunately, because the loci names are not actually encoded in the GTC (see here: https://github.com/Illumina/BeadArrayFiles/blob/develop/docs/GTC_File_Format_v5.pdf) and relies on an ordered match/lookup from the manifest, I don't see an easy way to treat the GTC like a dictionary with the loci names as keys. You could use https://github.com/Illumina/BeadArrayFiles to achieve that by parsing the GTCs and creating whatever data structure you want (or your own VCFs) . But you would need the original manifests for the old GTCs regardless. I still think it would be less work to regenerate the GTCs and then use either this or DRAGEN Array's gtc-to-vcf functionality. The IDATs should still work for the newer manifests (they are less sensitive to subsetting like the GTCs). Hope that helps. |
Thanks so much @jzieve for the prompt response. I agree on the issues of creating GTCs with the new manifest. My thought is to create GTCs with orignal manifest and then VCFs with the new manifest which may have fewer snps and updated annotations. I missed the point that GTCs do not have the loci_names encoded so the order is critical. So essentially, if I wish to use the new manifest to convert GTC>VCF, the new manifest should have the same snp set in the same exact order as the orignal manifest. The annotations (e.g. chrom coordinates, ref_strand etc) can be different but the snps can't be filtered out or ordering can't be changed between new and orignal manifest. I want to use DRAGEN Array's gtc-to-vcf but it's not open source and has restrictions which cannot be bypassed. For instance, it requires bpm and csv_manifest for execution. So in my case to use a different bpm and csv manifest version would need to extensively tested. Additionally, for quite a few issues, it throws an error instead of a warning, no option to Thanks a lot for the clarification. Appreciate it very much. |
Hello @jzieve May I please confirm if the impact of the warnings below is restricted to the said record? OR once a record fails anything downstream would be incorrect in vcf due to GTC order being highly important? Examples of the warnings I see in the log:
I attach the full log for your reference. Per above discussion, I tried IDAT>GTC with orignal manifest (m1). Subseqeuntly, for GTC>VCF (using GTCtoVCF) I used a different manifest (m2). m2 has updated GenomeBuild, RefStrand, SourceSeq annotations since these are missing in the m1. I ensured both m1 and m2 contains the identical number of records and in the same order as m1. I prepared m2 by a simple left join to m1 and importing the needed annotations. m2 is almost complete for needed annotations. About ~7% of records in m2 do not have the RefStrand set or the SourceSeq is fake with an understanding these records will fail and be skipped from the output. I ran GTCtoVCF with m2 and it executes fine as expected. The output VCF looks fine to me with those 7% records skipped as I expect. The log (attached) also indicates that these records failed. I am not concerned about these 7% records failed or skipped. However, since order of the gtc records is critical, I wanted to check would failure at any record would result in an incorrect genotype for all subsequent records in the manifest? Or is it a true skip such that both manifest and gtc handler will skip the failed record and move on to the next record such that both chr annotations and genotype still will be in sync and correct in the output vcf. Thank you. |
Reading through your log I see a lot of Regardless, the warnings/failed records are a "true skip" (i.e., it shouldn't impact records downstream). Though again, I would caution using different manifests for GTC creation and VCF creation. If they have the exact same number of loci and same order it should work in theory but I would imagine they don't unless they are a special manifest with only the additional annotations (SourceSeq, RefStrand, etc.). |
w.r.t DRAGEN Array vs. this tool, I understand the desire for open source but just a FYI, this tool is currently "semi-archived" (https://github.com/Illumina/GTCtoVCF?tab=readme-ov-file#semi-archived-state). Meaning, it won't be getting new development aside from critical bug fixes and will likely be fully archived at some point. [email protected] should be able to take feedback as well as respond to issues and we hope to improve DRAGEN Array's gtc-to-vcf to make the VCF creation more seamless. |
Thanks @jzieve. The reference genome is build37. The issue on the That's great that impact of warnings/failed records will be contained to the specific records and genotype calls for any subsequent records would still be accurately recorded in the vcf. In my case, I have quite a bit of old manifests/chips to deal with which do not have the RefStrand and SourceSeq columns in the orignal manifest. Some of them are custom chips too. So this seems the only feasible way forward I can think of is to use the orignal manifest (so the number of records and order is unchanged) and import the missing column from any updated manifest or external which has this information. In essence, I am creating a custom/special manifest just for this purpose keeping all the contraints in mind. My initial estimates are that I might be able to fill in the RefStrand and SourceSeq for most but not all of the records. Hence, confirmation on the impact of these failed records was important. Completely agree though. Would need to be cautious. I will test this strategy more and let you know if I run into additional questions. Thanks a lot for your help. |
Hi @jzieve,
As a fix to #78, I might need to use a different csv beadpool manifest version than the orignal manifest used to generate idat/gtc. Many of new manifests will have fewer loci than the manifest. To ensure the resulting vcf is correct, may I please confirm some assumptions of the tool:
Do you see any simple change to the code that will retrieve the genotypes from gtc based on the order in manifest (e.g. by querying through names) and not assuming the same order?
The text was updated successfully, but these errors were encountered: