V0.2.0 (#12)

- Automated and updated gender prediction - deprecated gender split references, unified references - code clean up WARNING: - Breaks backwards compatibility with NPZ files generated with previous versions. A conversion script will be provided to convert older NPZ files.
CenterForMedicalGeneticsGhent · Jun 28, 2018 · 5a51436 · 5a51436
1 parent 74d3777
commit 5a51436
Show file tree

Hide file tree

Showing 9 changed files with 532 additions and 352 deletions.
diff --git a/README.md b/README.md
@@ -7,8 +7,8 @@ as is the case with every tool, WISECONDOR has limitations of its own: the Stouf
 dealing with large amounts of aberrations, the algorithm is extremely slow (24h) when using small bin sizes (15 kb) and
 sex chromosomes are not included in the analysis. Here, I present WisecondorX, an evolved WISECONDOR that aims at dealing with
 previous difficulties. Main adaptations include the additional (and consistent) analysis of the X and Y chromosomes,
-a CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times,
-allowing daily diagnostic use. WisecondorX should be applicable not only to NIPT, but also to PGD, FFPE, LQB, ... etc.
+a weighted CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times,
+allowing daily diagnostic use. WisecondorX is meant to be applicable not only to NIPT, but also to gDNA, PGD, FFPE, LQB, ... etc.
 
 # Manual
 
@@ -44,26 +44,23 @@ python setup.py install
 
 ### Running WisecondorX
 
-There are three main stages for using WisecondorX:
-- Converting .bam files to .npz files (both reference and test samples)
-- Creating a reference (using reference .npz files)  
+There are three main stages (converting, reference creating & predicting) for using WisecondorX:  
+- Convert .bam files to .npz files (both reference and test samples)  
+- Create a reference (using reference .npz files)  
     - **Important notes**
-        - Reference samples should be divided in two distinct groups, one for males and one for females. This is required to correctly
-        normalize the X and/or Y chromosome.  
-        - When the female reference is given to the [`predict`](#stage-3-predict-cnas) function, chromosome X will be analysed;
-        when on the other hand the male reference is used, chromosomes X & Y are analysed. This regardless of the gender of the test case,
-        although I would **never** advice to use a female reference and a male test case, or vice versa &mdash; this because numerous Y reads
-        wrongly map the X chromosome. Using a matching reference, the latter is accounted for.
-        - For NIPT, exclusively a female reference should be created. This implies that for NIPT, WisecondorX is not able
-        to analyse the Y chromosome. Furthermore, obtaining consistent shifts in the X chromosome is only possible when the reference
-        is created using pregnancies of female fetuses only, irrespective of the gender of your test cases. When this cannot
-        be achieved, you risk blacklisting the entire X chromosome due to its variability because of fetal fraction dependence.  
-        - It is of paramount importance that the reference set consists of exclusively healthy samples that originate from the same 
-        sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb, think of
-        all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted, the better.  
-        - Try to include at least 50 samples per reference. The more the better, yet, from 200 on it is
-        unlikely to observe additional improvement concerning normalization.  
-- Predicting CNAs (using the reference and test .npz cases of interest)
+        - WisecondorX will internally generate a male and female gonosomal reference. It is advisable that both male and female
+        samples are represented in the reference set. If e.g. no male samples are included, the Y chromosome will not be
+        part of the analysis when testing male cases.  
+        - For NIPT analysis, an important exception on previous rule holds: only pregnancies of female feti should be used to 
+        generate the reference. This implies that for NIPT, WisecondorX is not able to analyse the Y chromosome. Additionally,
+        this goes without saying, make sure you do not manually annotate samples as male during the convert phase.  
+        - It is of paramount importance that the reference set consists of exclusively healthy samples that originate from
+        the same sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb,
+        think of all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted,
+        the better.  
+        - Try to include at least 50 samples per reference. The more the better, yet, from 300 on it is unlikely to observe
+        additional improvement concerning normalization.  
+- Predict CNAs (using the reference and test .npz cases of interest)  
 
 ### Stage (1) Convert .bam to .npz
 
@@ -72,11 +69,14 @@ There are three main stages for using WisecondorX:
 WisecondorX convert input.bam output.npz [--optional arguments]
 ```
 
-<br>Optional argument<br><br> | Function
+<br>Optional argument &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Function  
 :--- | :---  
 `--binsize x` | Size per bin in bp, the reference bin size should be a multiple of this value (default: x=5e3)  
 `--retdist x` | Max amount of bp's between reads to consider them part of the same tower (default: x=4)  
 `--retthres x` | Threshold for a group of reads to be considered a tower. These will be removed (default: x=4)  
+`--gender x` | When not used (which is recommended), WisecondorX will predict the gender. Manually assigning gender could be useful for highly aberrant samples (choices: x=F, x=M)  
+`--gonmapr x` | Represents the overall mappability ratio between X and Y. Concerning short single-end read mapping, an X bin is twice (default) as mappable compared to a Y bin. Used to predict gender (default: x=2)  
+
 
 &rarr; Bash recipe (example for NIPT) at `./pipeline/convert.sh`
 
@@ -89,24 +89,12 @@ WisecondorX newref reference_input_dir/*.npz reference_output.npz [--optional ar
 
 <br>Optional argument<br><br> | Function
 :--- | :---  
-`--gender x` | The gender of the samples at the `reference_input_dir`, female (F) or male (M) (default: x=F)  
 `--binsize x` | Size per bin in bp, defines the resolution of the output (default: x=1e5)  
 `--refsize x` | Amount of reference locations per target (default: x=300)  
 `--cpus x` | Number of threads requested (default: x=1)  
 
 &rarr; Bash recipe (example for NIPT) at `./pipeline/newref.sh`
 
-##### When the gender is not known, WisecondorX can predict it
-
-```bash
-
-WisecondorX gender input.npz [--optional arguments]
-```
-
-<br>Optional argument &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Function
-:--- | :---  
-`--cutoff x` | Y-read permille cut-off: above is male, below is female. Note that for NIPT, this will always return 'female' (default: x=3.5; optimized for mapping as [described above](#mapping))  
-
 ### Stage (3) Predict CNAs  
 
 ```bash
@@ -131,26 +119,41 @@ WisecondorX predict test_input.npz reference_input.npz output_id [--optional arg
 # Parameters
 
 The default parameters are optimized for shallow whole-genome sequencing data (0.1x - 1x depth; sWGS) and reference bin sizes 
-ranging from 50 to 200 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations 
+ranging from 50 to 500 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations 
 per target (`--refsize`) and the minimum amount of sensible reference bins per target bin (`--minrefbins`). Further note that a
-reference bin size lower than 15 kb is not advisable, unless a higher sequencing depth was used.  
+reference bin size lower than 15 kb is not advisable.  
 **Important note**  
 Concerning the vast majority of applications, the `--alpha` parameter should not be tweaked. The `--beta` parameter on the contrary
 should depend on your type of analysis. For NIPT, its default value should be fine. However, for gDNA, when mosaicisms are of no interest,
 it could be increased to its maximum, being 1. When the fetal (NIPT) or tumor (LQB, fresh material, FFPE, ...) fraction is known, this parameter is optimally
 close to this fraction. If you have any doubts about this argument, a default `--beta` should still be fine when a good and large reference set was created,
-irrespective the type of analysis.  
+irrespective of the type of analysis.  
 
 # Underlying algorithm
 
 To understand the underlying algorithm, I highly recommend reading [Straver et al (2014)](https://www.ncbi.nlm.nih.gov/pubmed/24170809); and its
 update shortly introduced in [Huijsdens-van Amsterdam et al (2018)](https://www.nature.com/articles/gim201832.epdf).
-Some adaptations to this algorithm have been made, e.g. additional variance stabilization (log2) on final ratios, removal of
-less useful plot and Stouffer's z-score codes, addition of the X and Y chromosomes, inclusion of CBS, table and plot codes, and &mdash; last but not least &mdash;
-restrictions on within-sample referencing, an important feature for NIPT:  
+Numerous adaptations to this algorithm have been made, yet the central principles remain. Changes include e.g. the inclusion of a gender
+prediction algorithm, gender handling prior to normalization (ultimately enabling X and Y predictions), extensive
+blacklisting options, inclusion of a weighted CBS algorithm, improved codes for output tables and plots, and &mdash; last but
+not least &mdash; restrictions on within-sample referencing, an important feature for NIPT:  
 
 ![Alt text](./figures/within-sample-normalization-2.png?raw=true "Within-sample normalization in WisecondorX")
 
+# Additional scripts & features
+
+Some files might not be compatible between versions. A small script (`fix_convert_npz.py`) enables reformatting
+.npz files resulting from the `convert` stage to files that are compatible with the newest version. This can also be used
+to transform original WISECONDOR .npz files. Note that the `newref` function might require a re-run to make `predict` functional
+between versions.  
+
+```bash
+
+fix.convert.npz.py input.npz output.npz
+```
+
+To get the (predicted) gender of a sample, one can use `WisecondorX gender input.npz`.  
+
 # Dependencies
 
 - R (v3.4) packages

diff --git a/pipeline/newref.sh b/pipeline/newref.sh
@@ -4,9 +4,8 @@
 CORES=6
 INPUT_DIR="path/to/convert.npz" # existing (non-empty) folder, containing reference .npz files
 OUTPUT_DIR="path/to/newref.npz" # existing (empty) folder
-REF_SIZES="15 50 100 200 500 1000" # space separated list (kb)
+REF_SIZES="1000 500 200 100" # space separated list (kb)
 RELEASE="hg38" # reference used to create bam files (solely used for reference filename)
-GENDER="F" # gender of the of the cases at NPZ_INPUT_DIR
 
 # SCRIPT
 
@@ -15,6 +14,6 @@ do
     echo "Creating reference at bins size ${REF} kb"
 
     WisecondorX newref ${INPUT_DIR}/*.npz \
-    ${OUTPUT_DIR}/reference.${RELEASE}.${GENDER}.${REF}kb.npz \
-    -binsize ${REF}000 -cpus ${CORES} -gender ${GENDER}
+    ${OUTPUT_DIR}/reference.${RELEASE}.${REF}kb.npz \
+    --binsize ${REF}000 --cpus ${CORES}
 done
diff --git a/pipeline/predict.sh b/pipeline/predict.sh
@@ -6,13 +6,13 @@ NPZ_FILES="path/to/samples.txt" # cases that will be tested versus the given ref
 # ID_1 path/to/convert.npz/ID_1.npz
 # ID_2 path/to/convert.npz/ID_2.npz
 # ...
-REF="path/to/newref.npz/reference.hg38.F.50kb.npz" # the cases in the NPZ_FILES document are thus expected to be female
+REF="path/to/newref.npz/reference.hg38.50kb.npz"
 OUTPUT_DIR="path/to/predict.output" # existing output folder
 
 
 # OPTIONAL PARAMETERS
 
-OPT="-plot -bed"
+OPT="--plot --bed"
 
 # SCRIPT
 

diff --git a/scripts/fix_convert_npz.py b/scripts/fix_convert_npz.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python
+
+import argparse
+import numpy as np
+
+def toolConvert():
+    # Added for compatibility with with original WiSeCondor files
+    return True
+
+def get_gender(sample):
+    tot_reads = float(sum([sum(sample[str(x)]) for x in range(1, 25)]))
+    x_reads = float(sum(sample["23"]))
+    x_len = float(len(sample["23"]))
+    y_reads = float(sum(sample["24"]))
+    y_len = float(len(sample["24"]))
+
+    X = (x_reads / tot_reads) / x_len * 0.5
+    Y = (y_reads / tot_reads) / y_len
+
+    # X/Y               = ?
+    # 1/1 (MALE)        = 1
+    # 2/noise (FEMALE)  = [4,8]
+    # cut-off 3 -- should be robust vs noise and mosaic large subchromosomal duplication/deletions
+    if X/Y < 3:
+        return "M"
+    else:
+        return "F"
+
+
+def reformat(sample):
+    for chrom in sample.keys():
+        data = sample[chrom]
+        if chrom == "X":
+            chrom = "23"
+            sample.pop("X")
+        if chrom == "Y":
+            chrom = "24"
+            sample.pop("Y")
+        sample[chrom] = data
+    return sample
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="\
+            This script reformats samples to make them backwards compatible between WisecondorX versions. \
+            It can also be used to transform .npz files from the original WISECONDOR to WisecondorX. \
+            Note that is only compatible with .npz files resulting from the convert version, \
+            other functions (newref, predict) should be re-run."
+                                     )
+    parser.add_argument('in_npz', help='Input npz from Wisecondor or WisecondorX <0.2.0')
+    parser.add_argument('out_npz', help='Output npz')
+
+    args = parser.parse_args()
+
+    npz = np.load(args.in_npz)
+    sample = npz["sample"].item()
+    sample = reformat(sample)
+    gender = get_gender(sample)
+
+    if "binsize" in npz.keys():
+        binsize = npz["binsize"].item()
+    else:
+        binsize = npz["arguments"].item()["binsize"]
+
+    np.savez_compressed(args.out_npz, binsize=binsize, sample=sample, gender=gender, quality=npz["quality"].item())
+    print("Succes!")
diff --git a/setup.py b/setup.py
@@ -1,7 +1,7 @@
 #! /usr/bin/env python
 from setuptools import setup, find_packages
 
-version = '0.1.2dev'
+version = '0.2.0'
 dl_version = 'master' if 'dev' in version else '{}'.format(version)
 
 setup(
@@ -28,6 +28,7 @@
     entry_points={
         'console_scripts': ['WisecondorX = wisecondorX.main:main']
     },
+    scripts=['scripts/fix_convert_npz.py'],
     classifiers=[
         'Development Status :: 3 - Alpha',
         'Environment :: Console',

diff --git a/wisecondorX/include/CBS.R b/wisecondorX/include/CBS.R
@@ -5,30 +5,16 @@
 arguments = "
 
 --infile          - (mandatory argument) input .json file
---sexchroms       - (mandatory argument) which sex chromosomes will be analyzed? (X or XY)
 --outfile         - (mandatory argument) output .json file
---alpha           - (mandatory argument) p-value for segmentation
 
 "
 
 args <- commandArgs(TRUE)
 
 
-if (all(c(args[1], args[3], args[5], args[7]) %in% c("--infile", "--sexchroms", "--outfile", "--alpha"))){
+if (all(c(args[1], args[3]) %in% c("--infile", "--outfile"))){
   in.file <- paste0(args[which(args == "--infile")+1])
-  sex.chrom <- paste0(args[which(args == "--sexchroms")+1])
   out.file <- paste0(args[which(args == "--outfile")+1])
-  alpha <- paste0(args[which(args == "--alpha")+1])
-}
-
-# -----
-# param
-# -----
-
-if (sex.chrom == "X"){
-    exclude.chr = c(24)
-} else {
-    exclude.chr = c()
 }
 
 # -----
@@ -48,8 +34,15 @@ input <- read_json(in.file)
 ratio <- as.numeric(unlist(input$results_r))
 weights <- as.numeric(unlist(input$weights))
 
-chrs = c(1:24)
-chrs = chrs[which(!(chrs  %in% exclude.chr))]
+gender <- input$reference_gender
+alpha <- input$alpha
+
+if (gender == "M"){
+    chrs = 1:24
+} else {
+    chrs = 1:23
+}
+
 bins.per.chr <- sapply(chrs, FUN = function(x) length(unlist(input$results_r[x])))
 chr.end.pos <- c(0)
 for (chr in chrs){