fake-vcf generates fake vcf files for testing purposes.
This is still a work in progress, originally created for the DeepRVAT project.
git clone https://github.com/endast/fake-vcf.git
cd fake-vcf
make poetry-download
make install
If you want to write bgzip files instead gzip when writing compressed gzip files use make install-all
. This
install the optional dependencies.
git clone https://github.com/endast/fake-vcf.git
cd fake-vcf
make poetry-download
make install-all
By default fake-vcf
writes to stdout
poetry run fake-vcf generate -s 2 -r 2
##fileformat=VCFv4.2
##source=VCFake 0.2.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##contig=<ID=chr1>
##reference=ftp://ftp.example.com/sample.fa
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S0000001 S0000002
chr1 63 rs143 C A 96 PASS DP=10;AF=0.5;NS=2 GT 0|0 0|0
chr1 71 rs31 A T 37 PASS DP=10;AF=0.5;NS=2 GT 0|0 0|0
You can write to a vcf file by piping the output to a file:
poetry run fake-vcf generate -s 2 -r 2 > fake_file.vcf
ls -lah
total 1
-rw-r--r-- 1 magnus staff 682B Jul 28 16:48 fake_file.vcf
Or let the script write to a file directly using -o
:
poetry run fake-vcf generate -s 2 -r 2 -o fake_file.vcf
Writing to file fake_file.vcf
(No compression)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 50942.96it/s]
Done, data written to fake_file.vcf
ls -lah
total 1
-rw-r--r-- 1 magnus staff 682B Jul 28 16:48 fake_file.vcf
And if you want the file compressed add .gz to the file name:
(if you installed using install-all
the file will be compressde using bgzip, otherwise using gzip).
poetry run fake-vcf generate -s 2 -r 2 -o fake_file.vcf.gz
Writing to file fake_file.vcf
(No compression)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 50942.96it/s]
Done, data written to fake_file.vcf
ls -lah
total 2
-rw-r--r-- 1 magnus staff 682B Jul 28 16:56 fake_file.vcf
-rw-r--r-- 1 magnus staff 436B Jul 28 16:57 fake_file.vcf.gz
You can also pipe the output to bgzip (or gzip) to compress it.
poetry run fake-vcf generate -s 2 -r 2 | bgzip > fake_file.vcf.gz
ls -lah
total 1
-rw-r--r-- 1 magnus staff 716 Jan 30 13:38 bgzip.chr.vcf.gz
To see all options use --help
poetry run fake-vcf generate --help
Usage: fake-vcf generate [OPTIONS]
Generate fake VCF data
Args: fake_vcf_path (Path): Path to fake VCF file or None to write to standard output. num_rows (int): Number of rows. num_samples (int): Number of samples. chromosome (str): Chromosome identifier. seed (int): Random seed for reproducibility. sample_prefix (str): Prefix for sample
names. phased (bool): Simulate phased genotypes. large_format (bool): Write large format VCF. print_version (bool): Flag to print the version of the fake-vcf package. reference_dir (Path): Path to directory containing imported reference_data.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --fake_vcf_path -o PATH Path to fake vcf file. If the path ends with .gz the file will be gzipped. [default: None] │
│ --num_rows -r INTEGER Nr rows to generate (variants) [default: 10] │
│ --num_samples -s INTEGER Nr of num_samples to generate. [default: 10] │
│ --chromosome -c TEXT chromosome default chr1 [default: chr1] │
│ --seed INTEGER Random seed to use, default none. [default: None] │
│ --sample_prefix -p TEXT Sample prefix ex: SAM => SAM0000001 SAM0000002 [default: S] │
│ --phased --no-phased Simulate phased [default: phased] │
│ --large-format --no-large-format Write large format vcf [default: large-format] │
│ --version -v Prints the version of the fake-vcf package. │
│ --reference-dir-path -f PATH Path to imported refernce directory. [default: None] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
If you want to use a fasta file as reference when generating the fake vcf files you can use the fake-vcf import-reference
cmd to prepare the data for usage witn fake-vcf genererate
.
poetry run fake-vcf import-reference --help
Usage: fake-vcf import-reference [OPTIONS] REFERENCE_FILE_PATH
REFERENCE_STORAGE_PATH
Import reference fasta file and extract specified chromosomes if provided.
Parameters: reference_file_path (Path): Path to reference fasta file. reference_storage_path (Path): Where to store the references. included_chromosomes (Optional[List[str]], optional): List of chromosomes to extract from reference. If not specified, all will be imported.
Example: To import a reference file and extract specific chromosomes: ``` vcf_reference_import("path/to/reference.fasta", "output/directory", included_chromosomes=["chr1", "chr2"]) ```
To import a reference file without extracting specific chromosomes: ``` vcf_reference_import("path/to/reference.fasta", "output/directory") ```
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * reference_file_path PATH Path to reference fasta file. [default: None] [required] │
│ * reference_storage_path PATH Where to store the references. [default: None] [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --included_chromosomes -c TEXT List of chromosomes to extract from reference, if not specified all will be imported [default: None] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This project is licensed under the terms of the MIT
license.
See LICENSE for more details.
@misc{fake-vcf,
author = {Magnus Wahlberg},
title = { A fake vcf file generator },
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/endast/fake-vcf}}
}