Releases: googlegenomics/gcp-variant-transforms
Release v0.11.0
Highlights
This release fixes several high priority items and updates various dependencies
New Features / Improvements
- updated beam to 2.37.0
- added support for a
--pipeline-mode
flag to force the mode - updated the VEP image to 104 and moved to the life sciences API
- To improve working in restricted environments, we now ship a custom dataflow runner docker image built with the dependencies needed to run variant transforms already in it, thus negating the need to reach out to the internet.
- custom service accounts
- sample optimized tables which should noticeably decrease load time
- various documentation fixes
Release v0.10.0
Highlights
With this release we have officially migrated Variant Transforms to Python 3.7; starting from October 7th 2020, Dataflow will halt their support for Python 2.
As part of the migration, we switched to use of the fastavro package as the primary avro schema parsing library. The combination of these two changes have resulted in 40-70% cost reduction of running Variant Transforms.
If you are using VT directly from github, please refer to our docs on how to setup Python 3 virtual environment.
New Features / Improvements
- Upgraded Beam to 2.24.0.
- Added functionality to attempt installing pysam multiple times on the workers to avoid failures due to missing packages.
Release v0.9.0
Highlights
In this release, we offer a new schema for output BigQuery tables. The new schema utilizes BigQuery's integer range partitioning which significantly reduces the query costs. We also allow users to store BigQuery tables which are highly optimized for sample lookup queries, such as:
Find all variants of Patient X
Note this release contains backwards incompatible changes. Please see details below.
New Features / Improvements
- By default one BigQuery table per chromosome is created; each table is integer range partitioned.
- Output tables have suffixes such as
__chr1
,__chr2
, … - Output tables can be changed by modifying the sharding config file.
- Output tables have suffixes such as
call.name
is replaced withcall.sample_id
, wheresample_id
is the hash of sample name.- In cases where multiple VCF files have the same
name
, file path can be included in the hash value to distinguish between samples.
- In cases where multiple VCF files have the same
- An extra BQ table with
__sample_info
suffix is created. This table contains the mapping betweensample_id
tosample_name
andvcf_file_path
.- We also include an
ingestion_datetime
column in sample info table to record the ingestion datetime of each VCF file.
- We also include an
- 1-based coordinate is used by default for genomic indexing to make BigQuery tables more compatible with VCF files.
- If
--append
is set, we ensure all expected output tables already exist before we append them.
New flags
vcf_to_bq
:--sample_lookup_optimized_output_table
: to store a second copy of variants in BigQuery tables that are optimized for sample lookup queries. This feature is particularly useful when the input VCF file contains joint genotyped samples.--keep_intermediate_avro_files
: to store intermediate Avro files in your temp directory on GCS bucket.--use_1_based_coordinate
: By default start position will be 1-based, and end position will be inclusive. You can set this flag to False to use 0-based coordinate.--sample_name_encoding
: determines the waysample_id
is hashed. Default value isWITHOUT_FILE_PATH
. If set toWITH_FILE_PATH
, thensample_id
will be a hash of[vcf_file_path, sample_name]
.--sharding_config_path
: replaces--partition_config_path
.
bq_to_vcf
:--bq_uses_1_based_coordinate
: set to False, if--use_1_based_coordinate
was set to False when generating the BQ tables, and hence, start positions are 0-based.--sample_names
: replaces--call_names
.--preserve_sample_order
: replaces--preserve_call_names_order
.
docker run
flags:- All the following flags are required:
--project
--regions
--temp_location
- If you need to run Variant Transforms in a subnetwork using private IP addresses:
--subnetwork ${CUSTOM_SUBNETWORK}
--use_public_ips false
- All the following flags are required:
Deprecated flags
The following flags are deprecated and will be removed in the next release:
--optimize_for_large_inputs
: because sharing is done by default for all inputs.--num_bigquery_write_shards
: because we are using Avro sink in the Dataflow pipeline.--output_avro_path
: replaced with--keep_intermediate_avro_files
.--reference_names
: You can achieve the same goal by modifying default sharding config file.
Underlying improvements
- Switched our default VCF parser from PyVcf to PySam.
- Update to Beam 2.22.
- Launcher VM is changed to g1-small to reduce the overall cost of running VT.
Breaking Changes
- By default 1-based coordinate is used for genomic indexing. We use the same default value for
bq_to_vcf
soVCF -> BigQuery -> VCF
with default flags should work. call.name
is replaced withcall.sample_id
--partition_config_path
is replaced with--sharding_config_path
- Sharding config YAML format has changed.
- output table name cannot contain
__
because we reserve this string for separating table base name from the suffixes that we read from sharding config file.
The following flags have been removed in this release:
--vcf_parser
--partition_config_path
: replaced with--sharding_config_path
--call_names
: replaced with--sample_names
--preserve_call_names_order
: replaced with--preserve_sample_order
Release v0.8.1
Release v0.8.0
Main changes since last release:
- The docker images have been moved to a new place gcr.io/cloud-lifesciences/gcp-variant-transforms. Subsequent releases will also be at this URI. Please update your command by replacing gcr.io/gcp-variant-transforms/gcp-variant-transforms with gcr.io/cloud-lifesciences/gcp-variant-transforms.
- Support loading BGZF files (alpha release). The index files (.gz.tbi or .bgz.tbi) must be in the same directory as the compressed VCF files.
Release v0.7.1
- Modify the docker container entry point script to execute variant transform commands directly if they are passed as arguments (for backwards compatibility).
Release v0.7.0
- Provide functionality to ingest a list of input patterns from a file, using the
--input_file
flag. - Support using
docker run
command to run the pipeline. Running with Google Genomics Pipelines API has been deprecated. Check out the documentation for more details.
Release v0.6.1
This is a patch release that makes the following improvements:
- Fixed some issues in BQ to VCF pipeline (Issue #446, Issue #447).
- Improved the robustness of annotation pipeline.
Release v0.6.0
This release mainly improves the VEP annotation performance. Check out the documentation for more details.
Release v0.5.1
This release mostly has usability improvements (better argument validation and option to allow number=A fields with too few values) and internal cleanups. Also updates BigQuery row limit from 10 MB to 100 MB.
The main new feature is the option of writing into Avro files using the --output_avro_path
flag.