Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone ancestry module #246

Closed
iranmdl opened this issue Feb 18, 2024 · 6 comments · Fixed by #244
Closed

Standalone ancestry module #246

iranmdl opened this issue Feb 18, 2024 · 6 comments · Fixed by #244
Labels
user-query User queries & requests

Comments

@iranmdl
Copy link

iranmdl commented Feb 18, 2024

Description of feature

Is it possible to run ancestry adjustment on a set of previously calculated scores without recalculating the scores?

The idea would be to be able to run --run_ancestry in standalone mode, giving as input the raw PGS values, reference panel, and target genotyped data.

@iranmdl iranmdl added the enhancement New feature or request label Feb 18, 2024
@nebfield
Copy link
Member

This is a nice idea, but one of the important stages of the ancestry adjustment is to calculate PGS for the reference panel and target cohort using the same set of variants (the intersection), so my feeling is that recalculating PGS is almost always worth doing. What do you think @smlmbrt 🧙?

@iranmdl
Copy link
Author

iranmdl commented Feb 19, 2024

Why do you need to calculate PGS for the reference panel? I understand why you need the intersection of reference/target variants for the PCA, but for PGS my understanding is that one could use all the target variants that intersect with the specific PGS variants, regardless of the reference (e.g. 1000G) panel?

@smlmbrt
Copy link
Member

smlmbrt commented Feb 19, 2024

Hi @iranmdl, what you're describing is what happens in the normal mode (without --run_ancestry), it uses all the variants of the scoring file that intersect your input samples genotypes.

To get the ancestry-adjusted Z-scores or compare PGS from your target to a reference panel the PGS (weighted SUM of variants*weights) will need to be calculated on an identical set of variants. If the target PGS includes 20 high-effect high-frequency variants that are not included the reference panel PGS it would bias the comparison (and the regression fit PGS ~ PCs). The intersections are cached (work/intersected) so it shouldn't re-calculate every time you change scores, but some improvements are required (#239).

If you want to just run the ancestry adjustment on your own data we have that implemented with the ancestry_analysis script in https://github.com/PGScatalog/pgscatalog_utils.

@smlmbrt smlmbrt added user-query User queries & requests and removed enhancement New feature or request labels Feb 19, 2024
@iranmdl
Copy link
Author

iranmdl commented Feb 22, 2024

Thank you for your response, @smlmbrt . To confirm my understanding based on your explanation, when adjusting a study individual's PGS for ancestry using the --run_ancestry option, involves the intersection of three variant sources: (1) the individual's genotyped variants; (2) the specific variants included in the PGS; and (3) variants from the reference panel. If I have understood correctly, a variant must be present in all three sources to be retained for both the initial and the adjusted PGS calculation? In other words, if a variant is found in (1) and (2) but not in (3), it is excluded.

Wouldn't it be more efficient to consider the intersection of all genotyped variants of the individual with those in the reference panel, irrespective of whether they are involved in the PGS of interest? This way, principal components (PCs) need only be computed once, rather than separately for each PGS. This way, principal components (PCs) can be calculated just once for all variants, and then used to adjust whichever PGS is specified by the user. This approach seems like it could save time by eliminating the need to calculate PCs for each individual PGS, as these components would be based on the complete set of overlapped variants from the individual and the reference panel.

@smlmbrt
Copy link
Member

smlmbrt commented Feb 22, 2024

If I have understood correctly, a variant must be present in all three sources to be retained for both the initial and the adjusted PGS calculation? In other words, if a variant is found in (1) and (2) but not in (3), it is excluded.

Correct.

Wouldn't it be more efficient to consider the intersection of all genotyped variants of the individual with those in the reference panel, irrespective of whether they are involved in the PGS of interest? [...] This approach seems like it could save time by eliminating the need to calculate PCs for each individual PGS, as these components would be based on the complete set of overlapped variants from the individual and the reference panel.

Yes, this is what the pipeline actually does - the PCA is calculated from an LD-thinned subset of variants that intersect between your target genotyped variants (1) and the reference panel (3). We are going to improve the cacheing to make sure it doesn't do this every run (#239), the cacheing that is currently implemented should do this, but it's slightly unreliable.

@smlmbrt smlmbrt linked a pull request Feb 26, 2024 that will close this issue
7 tasks
@smlmbrt
Copy link
Member

smlmbrt commented Feb 27, 2024

The cacheing has been fixed by @nebfield in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user-query User queries & requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants