Standalone ancestry module #246

iranmdl · 2024-02-18T20:46:16Z

Description of feature

Is it possible to run ancestry adjustment on a set of previously calculated scores without recalculating the scores?

The idea would be to be able to run --run_ancestry in standalone mode, giving as input the raw PGS values, reference panel, and target genotyped data.

The text was updated successfully, but these errors were encountered:

nebfield · 2024-02-19T09:19:41Z

This is a nice idea, but one of the important stages of the ancestry adjustment is to calculate PGS for the reference panel and target cohort using the same set of variants (the intersection), so my feeling is that recalculating PGS is almost always worth doing. What do you think @smlmbrt 🧙?

iranmdl · 2024-02-19T09:38:54Z

Why do you need to calculate PGS for the reference panel? I understand why you need the intersection of reference/target variants for the PCA, but for PGS my understanding is that one could use all the target variants that intersect with the specific PGS variants, regardless of the reference (e.g. 1000G) panel?

smlmbrt · 2024-02-19T10:09:59Z

Hi @iranmdl, what you're describing is what happens in the normal mode (without --run_ancestry), it uses all the variants of the scoring file that intersect your input samples genotypes.

To get the ancestry-adjusted Z-scores or compare PGS from your target to a reference panel the PGS (weighted SUM of variants*weights) will need to be calculated on an identical set of variants. If the target PGS includes 20 high-effect high-frequency variants that are not included the reference panel PGS it would bias the comparison (and the regression fit PGS ~ PCs). The intersections are cached (work/intersected) so it shouldn't re-calculate every time you change scores, but some improvements are required (#239).

If you want to just run the ancestry adjustment on your own data we have that implemented with the ancestry_analysis script in https://github.com/PGScatalog/pgscatalog_utils.

iranmdl · 2024-02-22T09:28:28Z

Thank you for your response, @smlmbrt . To confirm my understanding based on your explanation, when adjusting a study individual's PGS for ancestry using the --run_ancestry option, involves the intersection of three variant sources: (1) the individual's genotyped variants; (2) the specific variants included in the PGS; and (3) variants from the reference panel. If I have understood correctly, a variant must be present in all three sources to be retained for both the initial and the adjusted PGS calculation? In other words, if a variant is found in (1) and (2) but not in (3), it is excluded.

Wouldn't it be more efficient to consider the intersection of all genotyped variants of the individual with those in the reference panel, irrespective of whether they are involved in the PGS of interest? This way, principal components (PCs) need only be computed once, rather than separately for each PGS. This way, principal components (PCs) can be calculated just once for all variants, and then used to adjust whichever PGS is specified by the user. This approach seems like it could save time by eliminating the need to calculate PCs for each individual PGS, as these components would be based on the complete set of overlapped variants from the individual and the reference panel.

smlmbrt · 2024-02-22T14:15:33Z

If I have understood correctly, a variant must be present in all three sources to be retained for both the initial and the adjusted PGS calculation? In other words, if a variant is found in (1) and (2) but not in (3), it is excluded.

Correct.

Wouldn't it be more efficient to consider the intersection of all genotyped variants of the individual with those in the reference panel, irrespective of whether they are involved in the PGS of interest? [...] This approach seems like it could save time by eliminating the need to calculate PCs for each individual PGS, as these components would be based on the complete set of overlapped variants from the individual and the reference panel.

Yes, this is what the pipeline actually does - the PCA is calculated from an LD-thinned subset of variants that intersect between your target genotyped variants (1) and the reference panel (3). We are going to improve the cacheing to make sure it doesn't do this every run (#239), the cacheing that is currently implemented should do this, but it's slightly unreliable.

smlmbrt · 2024-02-27T10:23:07Z

The cacheing has been fixed by @nebfield in the next release.

iranmdl added the enhancement New feature or request label Feb 18, 2024

smlmbrt added user-query User queries & requests and removed enhancement New feature or request labels Feb 19, 2024

smlmbrt linked a pull request Feb 26, 2024 that will close this issue

v2.0.0-alpha.5 #244

Merged

7 tasks

nebfield closed this as completed in #244 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone ancestry module #246

Standalone ancestry module #246

iranmdl commented Feb 18, 2024

nebfield commented Feb 19, 2024

iranmdl commented Feb 19, 2024

smlmbrt commented Feb 19, 2024 •

edited

Loading

iranmdl commented Feb 22, 2024

smlmbrt commented Feb 22, 2024

smlmbrt commented Feb 27, 2024 •

edited

Loading

Standalone ancestry module #246

Standalone ancestry module #246

Comments

iranmdl commented Feb 18, 2024

Description of feature

nebfield commented Feb 19, 2024

iranmdl commented Feb 19, 2024

smlmbrt commented Feb 19, 2024 • edited Loading

iranmdl commented Feb 22, 2024

smlmbrt commented Feb 22, 2024

smlmbrt commented Feb 27, 2024 • edited Loading

smlmbrt commented Feb 19, 2024 •

edited

Loading

smlmbrt commented Feb 27, 2024 •

edited

Loading