- KAGE2 released, which adds support for structural variation genotyping.
- GLIMPSE can now be run directly through KAGE and is our recommended way of running KAGE (see section about running KAGE with GLIMPSE below). Our tests show that this gives much higher accuracy than just running KAGE, even for structural variation.
KAGE is a tool for efficiently genotyping short SNPs and indels from short genomic reads.
As of version 0.1.11, KAGE also supports GPU-acceleration (referred to as GKAGE) and is able to genotype a human sample in a minutes with a GPU having as little as 4 GB of RAM. See guide for running with GPU-acceleration further down.
A manuscript describing the method can be found here.
KAGE requires Python 3 and is tested to work with version 3.8, 3.9, and 3.10. Currently, Python 3.11 and 3.12 does not work due to some dependencies not supporting it, but this will likely be fixed in the future.
KAGE can be installed using Pip:
pip install kage-genotyper
Test that the installation worked:
kage test
The above will perform genotyping on some dummy data and should finish without any errors.
You will need:
- A reference genome in fasta format
- A set of variants with genotypes of known individuals in vcf-format (
.vcf
or.vcf.gz
)
Variants should be biallelic (you can easily convert them to biallelic with bcftools norm
). Structural variants are supported, but note however that all variants must have actual sequences in the ref and alt fields.
Genotypes should be phased (e.g. 0|0
, 0|1
or 1|1
) and there should ideally be few missing genotypes (e.g. .|.
or .
). If there are structural variants present, KAGE will prioritize those, meaning that accuracy on SNPs and indels may be lower (especially for SNPs and indels that are covered by SVs). If your aim is to only genotype SNPs and indels, you should not include SVs in your VCF.
Building an index is somewhat time consuming, but only needs to be done once for each set of variants you want to genotype. Indexing time scales approximately linearly with number of variants and the size of the reference genome. Creating and index of the Draft Human Pangenome takes approximately a day. It's always a good idea to start out with a smaller set of variants, e.g. a single chromosome first to see if things work as expected. Feel free to ask us if you are having trouble making an index (we are happy to try to help making it for you) or if you are unsure whether KAGE will work on your data.
kage index -r reference.fa -v variants.vcf.gz -o index -k 31
The above command will create an index.npz
file.
Genotyping with kage is extremely fast once you have an index:
kage genotype -i index -r reads.fq.gz -t 16 --average-coverage 30 -k 31 -o genotypes.vcf
Note:
-k
must be set to the same that was used when creating the index--average-coverage
should be set to the expected average coverage of your input reads (doesn't need to be exact)- KAGE puts data and arrays in shared memory to speed up computation. It automatically frees this memory when finished, but KAGE gets accidentally killed or stops before finishing, you might end up with allocated memory not being freed. You can free this memory by calling
kage free_memory
.
KAGE uses information from the population to improve accuracy, a bit similarily to imputation. However, the model used by KAGE is quite simple. It works well for SNPs and indels, but for SVs, we have found that using GLIMPSE for the imputation-step works much better. To run KAGE with GLIMPSE instead of the builtin KAGE imputation, simpy add --glimpse variants.vcf.gz
when running kage genotype
. KAGE will automatically install GLIMPSE by downloading binaries and run GLIMPSE for you. One should expect some longer runtime, but not much.
Note: GLIMPSE requires that you have BCFTools installed.
As of version 0.1.11, KAGE supports GPU-acceleration for GPUs supporting the CUDA-interface. You will need to have CUDA installed on your system along with CuPy (not automatically installed as part of KAGE). Follow these steps to run KAGE with GPU-support:
- Make sure you have CUDA installed and working.
- Install a CuPy version that matches your CUDA installation. You will need at least version 12.0 of the cupy pip package.
- Install Cucounter (
pip install kage-cucounter
). - Run kage with
--gpu True
to tell KAGE to use the GPU.
Note: GKAGE has been tested to work with GPUs with 4 GBs of RAM. Depending on the size of the index you have, you might need more GPU memory. Let us know if you run into any problems, there are some tricks that can be used to reduce memory usage.
Recent changes:
- KAGE2 released. Structural variation genotyping should now work well, and KAGE can be run with GLIMPSE directly.
- November 20 2023: Indexing process rewritten and experimental support for structural variation.
- January 30 2023: Release of GPU support (version 0.0.30).
- October 7 2022: Minor update. Now using BioNumPy do parse input files and hash kmers.
- June 2022: Release of version for manuscript in Genome Biology
Please post an issue or email [email protected] if you encounter any problems or have any questions.