Skip to content

Releases: googlegenomics/gcp-variant-transforms

Release v0.11.0

31 Mar 21:01
47844f7
Compare
Choose a tag to compare

Highlights

This release fixes several high priority items and updates various dependencies

New Features / Improvements

Release v0.10.0

23 Sep 21:17
bd19a80
Compare
Choose a tag to compare

Highlights

With this release we have officially migrated Variant Transforms to Python 3.7; starting from October 7th 2020, Dataflow will halt their support for Python 2.

As part of the migration, we switched to use of the fastavro package as the primary avro schema parsing library. The combination of these two changes have resulted in 40-70% cost reduction of running Variant Transforms.

If you are using VT directly from github, please refer to our docs on how to setup Python 3 virtual environment.

New Features / Improvements

  • Upgraded Beam to 2.24.0.
  • Added functionality to attempt installing pysam multiple times on the workers to avoid failures due to missing packages.

Release v0.9.0

09 Jul 23:17
ee3b767
Compare
Choose a tag to compare

Highlights

In this release, we offer a new schema for output BigQuery tables. The new schema utilizes BigQuery's integer range partitioning which significantly reduces the query costs. We also allow users to store BigQuery tables which are highly optimized for sample lookup queries, such as:

Find all variants of Patient X

Note this release contains backwards incompatible changes. Please see details below.

New Features / Improvements

  • By default one BigQuery table per chromosome is created; each table is integer range partitioned.
    • Output tables have suffixes such as __chr1, __chr2, …
    • Output tables can be changed by modifying the sharding config file.
  • call.name is replaced with call.sample_id, where sample_id is the hash of sample name.
    • In cases where multiple VCF files have the same name, file path can be included in the hash value to distinguish between samples.
  • An extra BQ table with __sample_info suffix is created. This table contains the mapping between sample_id to sample_name and vcf_file_path.
    • We also include an ingestion_datetime column in sample info table to record the ingestion datetime of each VCF file.
  • 1-based coordinate is used by default for genomic indexing to make BigQuery tables more compatible with VCF files.
  • If --append is set, we ensure all expected output tables already exist before we append them.

New flags

  • vcf_to_bq:
    • --sample_lookup_optimized_output_table: to store a second copy of variants in BigQuery tables that are optimized for sample lookup queries. This feature is particularly useful when the input VCF file contains joint genotyped samples.
    • --keep_intermediate_avro_files: to store intermediate Avro files in your temp directory on GCS bucket.
    • --use_1_based_coordinate: By default start position will be 1-based, and end position will be inclusive. You can set this flag to False to use 0-based coordinate.
    • --sample_name_encoding: determines the way sample_id is hashed. Default value is WITHOUT_FILE_PATH. If set to WITH_FILE_PATH, then sample_id will be a hash of [vcf_file_path, sample_name].
    • --sharding_config_path: replaces --partition_config_path.
  • bq_to_vcf:
    • --bq_uses_1_based_coordinate: set to False, if --use_1_based_coordinate was set to False when generating the BQ tables, and hence, start positions are 0-based.
    • --sample_names: replaces --call_names.
    • --preserve_sample_order: replaces --preserve_call_names_order.
  • docker run flags:
    • All the following flags are required:
      • --project
      • --regions
      • --temp_location
    • If you need to run Variant Transforms in a subnetwork using private IP addresses:
      • --subnetwork ${CUSTOM_SUBNETWORK}
      • --use_public_ips false

Deprecated flags

The following flags are deprecated and will be removed in the next release:

  • --optimize_for_large_inputs: because sharing is done by default for all inputs.
  • --num_bigquery_write_shards: because we are using Avro sink in the Dataflow pipeline.
  • --output_avro_path: replaced with --keep_intermediate_avro_files.
  • --reference_names: You can achieve the same goal by modifying default sharding config file.

Underlying improvements

  • Switched our default VCF parser from PyVcf to PySam.
  • Update to Beam 2.22.
  • Launcher VM is changed to g1-small to reduce the overall cost of running VT.

Breaking Changes

  • By default 1-based coordinate is used for genomic indexing. We use the same default value for bq_to_vcf so VCF -> BigQuery -> VCF with default flags should work.
  • call.name is replaced with call.sample_id
  • --partition_config_path is replaced with --sharding_config_path
  • Sharding config YAML format has changed.
  • output table name cannot contain __ because we reserve this string for separating table base name from the suffixes that we read from sharding config file.

The following flags have been removed in this release:

  • --vcf_parser
  • --partition_config_path: replaced with --sharding_config_path
  • --call_names: replaced with --sample_names
  • --preserve_call_names_order: replaced with --preserve_sample_order

Release v0.8.1

20 Aug 16:57
d0b77a6
Compare
Choose a tag to compare

This is a patch release that makes the following improvements:

  • Updated to Beam 2.14.0 to fix the Dataflow jobs stuck issue (#519).
  • Fixed bugs in loading BGZF (#500, #507).

Release v0.8.0

25 Jun 18:44
7796d40
Compare
Choose a tag to compare

Main changes since last release:

  • The docker images have been moved to a new place gcr.io/cloud-lifesciences/gcp-variant-transforms. Subsequent releases will also be at this URI. Please update your command by replacing gcr.io/gcp-variant-transforms/gcp-variant-transforms with gcr.io/cloud-lifesciences/gcp-variant-transforms.
  • Support loading BGZF files (alpha release). The index files (.gz.tbi or .bgz.tbi) must be in the same directory as the compressed VCF files.

Release v0.7.1

23 Apr 19:44
daf34df
Compare
Choose a tag to compare
  • Modify the docker container entry point script to execute variant transform commands directly if they are passed as arguments (for backwards compatibility).

Release v0.7.0

11 Apr 21:52
40b4b03
Compare
Choose a tag to compare
  • Provide functionality to ingest a list of input patterns from a file, using the --input_file flag.
  • Support using docker run command to run the pipeline. Running with Google Genomics Pipelines API has been deprecated. Check out the documentation for more details.

Release v0.6.1

14 Mar 18:49
Compare
Choose a tag to compare

This is a patch release that makes the following improvements:

  • Fixed some issues in BQ to VCF pipeline (Issue #446, Issue #447).
  • Improved the robustness of annotation pipeline.

Release v0.6.0

21 Feb 15:14
1ff9484
Compare
Choose a tag to compare

This release mainly improves the VEP annotation performance. Check out the documentation for more details.

Release v0.5.1

28 Nov 16:47
6f089ab
Compare
Choose a tag to compare

This release mostly has usability improvements (better argument validation and option to allow number=A fields with too few values) and internal cleanups. Also updates BigQuery row limit from 10 MB to 100 MB.

The main new feature is the option of writing into Avro files using the --output_avro_path flag.