Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convertf exporting to PACKEDPED #105

Open
zmaroti opened this issue Jun 18, 2024 · 0 comments
Open

convertf exporting to PACKEDPED #105

zmaroti opened this issue Jun 18, 2024 · 0 comments

Comments

@zmaroti
Copy link

zmaroti commented Jun 18, 2024

PLINK specification stets that in the .bim file 5th column stands for the minor (ALT) allele and 6th column stands for major (REF) allele.

https://www.cog-genomics.org/plink/1.9/formats#bim
https://www.cog-genomics.org/plink/2.0/formats#bim

In EIGENSTRAT .snp file the 5th and 6th columns are just the opposite (REF, ALT order).

Convertf generates a .bim where the 5th and 6th column is in the same order as in the EIGENSTRAT .snp file. Accordingly it also flips the GTS, so the resulting data set is coding the same alleles still. When working with only the plink data set it makes no difference, however when exporting PLINK data set to VCF format it results in a VCF file, where REF/ALT alleles are flipped (not conforming with the real REF/ALT alleles of the reference genome the data is) and also the GTs are coded accordingly, so it means the same GTs however comparability with true NGS data is problematic (can be solved in a "stupid" way only).

To show it with an exmple. If we did the following transformations:

#1) use convertf to convert AADR data to (.bed, .bim, .fam) with the approptiate par file:

convertf -p par.PACKEDANCESTRYMAP.PACKEDPED

We would get the .bim file like:

1 rs3094315 0.020130 752566 G A
1 rs12124819 0.020242 776546 A G
1 rs28765502 0.022137 832918 T C
...

Where the 5th and 6th column is the same order as in the EIGNESTRAT .snp file:

       rs3094315     1        0.020130          752566 G A
      rs12124819     1        0.020242          776546 A G
      rs28765502     1        0.022137          832918 T C

...

#2) convert the PLINK data set to VCF file:
plink --bfile AADR_1240K ----recode vcf --keep-allele-order --real-ref-allele --out AADR_1240K

(note the '-real-ref-allele' to not have the REF based on the major allele in the data set but the 6th column of the .bim file, and the '--keep-allele-order' that is best used for safety with plink1.9)

Showing the first two SNPs from the original 1KG phase III data set for samples HG00110-HG00112 (bcftools view -H --types snps -r 1:752566,1:776546 -s HG00110,HG00111,HG00112 /DATABASES/1KGphase3VCF/ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz ):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00110 HG00111 HG00112
1 752566 . G A 100 PASS AC=5;AF=0.718251;AN=6;NS=2504;DP=21293;EAS_AF=0.8839;AMR_AF=0.804;AFR_AF=0.3873;EUR_AF=0.84;SAS_AF=0.8088;AA=.|||;VT=SNP GT 1|1 1|0 1|1
1 776546 . A G 100 PASS AC=4;AF=0.0756789;AN=6;NS=2504;DP=21217;EAS_AF=0.002;AMR_AF=0.1066;AFR_AF=0.0136;EUR_AF=0.2286;SAS_AF=0.0562;AA=.|||;VT=SNP GT 1|1 0|0 1|1

The same SNPs from the AADR_1240K converted VCF for the same samples (bcftools view -H --types snps -t 1:752566,1:776546 -s 13601_HG00110.DG,13602_HG00111.DG,14780_HG00112.DG AADR_1240K.vcf):

1 752566 rs3094315 A G . . AC=1;AN=6 GT 0/0 0/1 0/0
1 776546 rs12124819 G A . . AC=2;AN=6 GT 0/0 1/1 0/0

Compared to the original REF based 1KG data in the AADR->PLINK->VCF file, the minor/major (ALT/REF) alleles are flipped, and the GTs are recoded accordingly. So technically both data describes the same genotypes. However in case convertf would use the PLINK specification, then exporting/importing to/from VCF data would be much easier as the minor/major (ALT/REF) alleles and the corresponding GTs would be same as in the GRCh37 reference.

Solution supposed to be trivial, it would be nice if you could modify convertf to comply with the PLINK specs for compatibility,

ps: I noticed that there is some convertf option that might be used here

flipstrandname: fname
fname should consist of a list of SNP IDs (1 perl line). The alleles for
this SNP will be complemented (moved to other strand). This can be useful as
preparation for mergeit.

From the description I am not sure whether this could fix the above mentioned issue (I wrote a tool earlier to deal with this issue before). However, I am sure it would save some headache, and additional non trivial steps/extra tools for an easier workflow if convertf by default would honour the PLINK specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant