PGScatalog · nebfield · Feb 28, 2024 · Feb 26, 2024 · Feb 26, 2024 · Feb 26, 2024
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
@@ -194,4 +194,8 @@ requirements to run on these smaller datasets are:
 For information on how to run the pipelines on larger datasets/computers/job-schedulers,
 see :ref:`big job`.
 
+If you are running the pipeline multiple times on the same dataset (e.g. different sets of
+PGS) you can speed the pipeline up by cacheing the genotype harmonization and ancestry steps,
+see :ref:`cache`.
+
 If you are using an newer Mac computer with an M-series chip, see :ref:`arm`.
diff --git a/docs/how-to/cache.rst b/docs/how-to/cache.rst
@@ -0,0 +1,21 @@
+.. _cache:
+
+How do I speed up `pgsc_calc` computation times and avoid re-running code?
+==========================================================================
+
+If you intend to run `pgsc_calc` multiple times on the same target samples (e.g.
+on different sets of PGS, with different variant matching flags) it is worth cacheing
+information on invariant steps of the pipeline:
+
+- Genotype harmonzation (variant relabeling steps)
+- Steps of `--run_ancestry` that: match variants between the target and reference panel and
+  generate PCA loadings that can be used to adjust the PGS for ancestry.
+
+To do this you must specify a directory that can store these information across runs using the
+`--genotypes_cache` flag to the nextflow command (also see :ref:`param ref`). Future runs of the
+pipeline that use the same cache directory should then skip these steps and proceed to run only the
+steps needed to calculate new PGS. This is slightly different than using the `-resume command in
+nextflow <https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_ which mainly checks the
+`work` directory and is more often used for restarting the pipeline when a specific step has failed
+(e.g. for exceeding memory limits).
+
diff --git a/docs/how-to/index.rst b/docs/how-to/index.rst
@@ -14,6 +14,7 @@ Calculating polygenic scores
    calculate_pgscatalog
    multiple
    ancestry
+   cache
 
 Making genomic data and scorefiles compatible
 ---------------------------------------------

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -55,7 +55,7 @@
                 "genotypes_cache": {
                     "type": "string",
                     "default": "None",
-                    "description": "A path to a directory that should contain relabelled genotypes",
+                    "description": "Path to a directory that can store relabelled genotypes (and the reference panel intersections and PCA with --run_ancestry) to speed up new PGS calculations on previously harmonized samples",
                     "format": "directory-path"
                 },
                 "outdir": {