Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially a serious issue with Eigenstrat to Plink conversions #76

Open
geneanalyst opened this issue Apr 1, 2021 · 2 comments
Open

Comments

@geneanalyst
Copy link

geneanalyst commented Apr 1, 2021

This issue affects conversions from Eigenstrat to Plink using convertf and par.PED.PACKEDPED. As you're aware Eigenstrat .snp follows VCF format with regards to listing the REF allele (col 5) in the column prior to the ALT allele (col 6).

However, Plink .bim does the opposite where ALT is listed in col 5 and REF in col 6.

It seems that convertf is not aware of this Plink .bim format because after I convert files from Eigenstrat .geno to Plink .bed using par.PED.PACKEDPED, ALT is still listed in col 6 and REF in col 5 of the .bim file that was just obtained from .snp.

So now when this is merged with other Plink files you have a totally mixed up final merged Plink file with REF being in col 5 for some positions and col 6 for other positions. This may not cause issues in Plink for minor allele frequency calculations but once converted back to Eigenstrat .geno may cause flawed downstream analysis.

My question is would this cause any issues with Admixtools code if REF is col 5 for some positions and col 6 for other positions and alternatively ALT in col 6 for some and col5 for other positions ?

I'm quite certain this has gone unnoticed by most researchers converting files back and forth.

@bumblenick
Copy link
Contributor

bumblenick commented Apr 1, 2021 via email

@geneanalyst
Copy link
Author

geneanalyst commented Apr 2, 2021

In Admixtools you should think of column 5 as the count allele -- not necessarily reference Similarly PLINK also uses this as the count allele -- so homozygous count is genotype 2 If you then use PLINK merge with other alleles chosen as count --maybe there is trouble. (I am not a PLINK expert) A trick: If possible always have human reference in a genotype data set. Makes sorting out these troubles much easier. By the way PLINK often chooses the count allele as the "majority allele" guaranteeing that different datasets use different conventions. I do not consider this an eigenstrat bug; it's a bug if convertf eigenstrat -> PLINK -> eigenstrat is not the identity map. Nick

On Thu, Apr 1, 2021 at 6:40 PM geneanalyst @.***> wrote: This issue affects conversions from Eigenstrat to Plink using convertf and par.PED.PACKEDPED. As you're aware Eigenstrat .snp follows VCF format with regards to listing the REF allele (col 5) in the column prior to the ALT allele (col 6). However, Plink .bim does the opposite where ALT is listed in col 5 and REF in col 6. It seems that convertf is not aware of this Plink .bim format because after I convert files from Eigenstrat .geno to Plink .bed using par.PED.PACKEDPED, ALT is still listed in col 6 and REF in col 5 of the .bim file that was just obtained from .snp. So now when this is merged with other Plink files you have a totally mixed up final merged Plink file with REF being in col 5 for some positions and col 6 for other positions. This may not cause issues in Plink for minor allele frequency calculations but once converted back to Eigenstrat .geno may cause flawed results. My question is would this cause any issues with Admixtools code if REF is col 5 for some positions and col 6 for other positions and alternatively ALT in col 6 for some and col5 for other positions ? I'm quite certain this has gone unnoticed by most researchers converting files back and forth. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#76>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77B3AYWSA54PQN6UTDT3TGTY6TANCNFSM42H4VAJQ .

I'm not referring to the .geno or .bed files where count alleles are stored. I'm referring to .snp and .bim. For example,
This is the .snp file I just converted to Plink .bim:

snp

       rs3094315     1        0.020130          752566 G A
      rs12124819     1        0.020242          776546 A G

.bim

1 rs3094315 0.02013 752566 G A
1 rs12124819 0.020242 776546 G A

Trouble is Admixtools looks at this .snp and considers G as REF and A as ALT in 1st row.
However, Plink looks at .bim and considers A as REF and G as ALT in 1st row.

So if this converted Plink is merged with other Plink files not originating as Eigenstrat then some positions will have correct REF / ALT and others will have switched REF/ALT. Does this present any issues when converted back to Eigenstrat ? In other words is it relevant to Admixtools whether some positions have correct REF/ALT and others have reversed REF/ALT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants