Skip to content

Commit

Permalink
V0.2.0 (#12)
Browse files Browse the repository at this point in the history
- Automated and updated gender prediction
- deprecated gender split references, unified references
- code clean up

WARNING:
- Breaks backwards compatibility with NPZ files generated with previous versions. A conversion script will be provided to convert older NPZ files.
  • Loading branch information
matthdsm authored Jun 28, 2018
1 parent 74d3777 commit 5a51436
Show file tree
Hide file tree
Showing 9 changed files with 532 additions and 352 deletions.
83 changes: 43 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ as is the case with every tool, WISECONDOR has limitations of its own: the Stouf
dealing with large amounts of aberrations, the algorithm is extremely slow (24h) when using small bin sizes (15 kb) and
sex chromosomes are not included in the analysis. Here, I present WisecondorX, an evolved WISECONDOR that aims at dealing with
previous difficulties. Main adaptations include the additional (and consistent) analysis of the X and Y chromosomes,
a CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times,
allowing daily diagnostic use. WisecondorX should be applicable not only to NIPT, but also to PGD, FFPE, LQB, ... etc.
a weighted CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times,
allowing daily diagnostic use. WisecondorX is meant to be applicable not only to NIPT, but also to gDNA, PGD, FFPE, LQB, ... etc.

# Manual

Expand Down Expand Up @@ -44,26 +44,23 @@ python setup.py install

### Running WisecondorX

There are three main stages for using WisecondorX:
- Converting .bam files to .npz files (both reference and test samples)
- Creating a reference (using reference .npz files)
There are three main stages (converting, reference creating & predicting) for using WisecondorX:
- Convert .bam files to .npz files (both reference and test samples)
- Create a reference (using reference .npz files)
- **Important notes**
- Reference samples should be divided in two distinct groups, one for males and one for females. This is required to correctly
normalize the X and/or Y chromosome.
- When the female reference is given to the [`predict`](#stage-3-predict-cnas) function, chromosome X will be analysed;
when on the other hand the male reference is used, chromosomes X & Y are analysed. This regardless of the gender of the test case,
although I would **never** advice to use a female reference and a male test case, or vice versa — this because numerous Y reads
wrongly map the X chromosome. Using a matching reference, the latter is accounted for.
- For NIPT, exclusively a female reference should be created. This implies that for NIPT, WisecondorX is not able
to analyse the Y chromosome. Furthermore, obtaining consistent shifts in the X chromosome is only possible when the reference
is created using pregnancies of female fetuses only, irrespective of the gender of your test cases. When this cannot
be achieved, you risk blacklisting the entire X chromosome due to its variability because of fetal fraction dependence.
- It is of paramount importance that the reference set consists of exclusively healthy samples that originate from the same
sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb, think of
all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted, the better.
- Try to include at least 50 samples per reference. The more the better, yet, from 200 on it is
unlikely to observe additional improvement concerning normalization.
- Predicting CNAs (using the reference and test .npz cases of interest)
- WisecondorX will internally generate a male and female gonosomal reference. It is advisable that both male and female
samples are represented in the reference set. If e.g. no male samples are included, the Y chromosome will not be
part of the analysis when testing male cases.
- For NIPT analysis, an important exception on previous rule holds: only pregnancies of female feti should be used to
generate the reference. This implies that for NIPT, WisecondorX is not able to analyse the Y chromosome. Additionally,
this goes without saying, make sure you do not manually annotate samples as male during the convert phase.
- It is of paramount importance that the reference set consists of exclusively healthy samples that originate from
the same sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb,
think of all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted,
the better.
- Try to include at least 50 samples per reference. The more the better, yet, from 300 on it is unlikely to observe
additional improvement concerning normalization.
- Predict CNAs (using the reference and test .npz cases of interest)

### Stage (1) Convert .bam to .npz

Expand All @@ -72,11 +69,14 @@ There are three main stages for using WisecondorX:
WisecondorX convert input.bam output.npz [--optional arguments]
```

<br>Optional argument<br><br> | Function
<br>Optional argument &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Function
:--- | :---
`--binsize x` | Size per bin in bp, the reference bin size should be a multiple of this value (default: x=5e3)
`--retdist x` | Max amount of bp's between reads to consider them part of the same tower (default: x=4)
`--retthres x` | Threshold for a group of reads to be considered a tower. These will be removed (default: x=4)
`--gender x` | When not used (which is recommended), WisecondorX will predict the gender. Manually assigning gender could be useful for highly aberrant samples (choices: x=F, x=M)
`--gonmapr x` | Represents the overall mappability ratio between X and Y. Concerning short single-end read mapping, an X bin is twice (default) as mappable compared to a Y bin. Used to predict gender (default: x=2)


&rarr; Bash recipe (example for NIPT) at `./pipeline/convert.sh`

Expand All @@ -89,24 +89,12 @@ WisecondorX newref reference_input_dir/*.npz reference_output.npz [--optional ar

<br>Optional argument<br><br> | Function
:--- | :---
`--gender x` | The gender of the samples at the `reference_input_dir`, female (F) or male (M) (default: x=F)
`--binsize x` | Size per bin in bp, defines the resolution of the output (default: x=1e5)
`--refsize x` | Amount of reference locations per target (default: x=300)
`--cpus x` | Number of threads requested (default: x=1)

&rarr; Bash recipe (example for NIPT) at `./pipeline/newref.sh`

##### When the gender is not known, WisecondorX can predict it

```bash

WisecondorX gender input.npz [--optional arguments]
```

<br>Optional argument &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Function
:--- | :---
`--cutoff x` | Y-read permille cut-off: above is male, below is female. Note that for NIPT, this will always return 'female' (default: x=3.5; optimized for mapping as [described above](#mapping))

### Stage (3) Predict CNAs

```bash
Expand All @@ -131,26 +119,41 @@ WisecondorX predict test_input.npz reference_input.npz output_id [--optional arg
# Parameters

The default parameters are optimized for shallow whole-genome sequencing data (0.1x - 1x depth; sWGS) and reference bin sizes
ranging from 50 to 200 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations
ranging from 50 to 500 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations
per target (`--refsize`) and the minimum amount of sensible reference bins per target bin (`--minrefbins`). Further note that a
reference bin size lower than 15 kb is not advisable, unless a higher sequencing depth was used.
reference bin size lower than 15 kb is not advisable.
**Important note**
Concerning the vast majority of applications, the `--alpha` parameter should not be tweaked. The `--beta` parameter on the contrary
should depend on your type of analysis. For NIPT, its default value should be fine. However, for gDNA, when mosaicisms are of no interest,
it could be increased to its maximum, being 1. When the fetal (NIPT) or tumor (LQB, fresh material, FFPE, ...) fraction is known, this parameter is optimally
close to this fraction. If you have any doubts about this argument, a default `--beta` should still be fine when a good and large reference set was created,
irrespective the type of analysis.
irrespective of the type of analysis.

# Underlying algorithm

To understand the underlying algorithm, I highly recommend reading [Straver et al (2014)](https://www.ncbi.nlm.nih.gov/pubmed/24170809); and its
update shortly introduced in [Huijsdens-van Amsterdam et al (2018)](https://www.nature.com/articles/gim201832.epdf).
Some adaptations to this algorithm have been made, e.g. additional variance stabilization (log2) on final ratios, removal of
less useful plot and Stouffer's z-score codes, addition of the X and Y chromosomes, inclusion of CBS, table and plot codes, and &mdash; last but not least &mdash;
restrictions on within-sample referencing, an important feature for NIPT:
Numerous adaptations to this algorithm have been made, yet the central principles remain. Changes include e.g. the inclusion of a gender
prediction algorithm, gender handling prior to normalization (ultimately enabling X and Y predictions), extensive
blacklisting options, inclusion of a weighted CBS algorithm, improved codes for output tables and plots, and &mdash; last but
not least &mdash; restrictions on within-sample referencing, an important feature for NIPT:

![Alt text](./figures/within-sample-normalization-2.png?raw=true "Within-sample normalization in WisecondorX")

# Additional scripts & features

Some files might not be compatible between versions. A small script (`fix_convert_npz.py`) enables reformatting
.npz files resulting from the `convert` stage to files that are compatible with the newest version. This can also be used
to transform original WISECONDOR .npz files. Note that the `newref` function might require a re-run to make `predict` functional
between versions.

```bash

fix.convert.npz.py input.npz output.npz
```

To get the (predicted) gender of a sample, one can use `WisecondorX gender input.npz`.

# Dependencies

- R (v3.4) packages
Expand Down
7 changes: 3 additions & 4 deletions pipeline/newref.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@
CORES=6
INPUT_DIR="path/to/convert.npz" # existing (non-empty) folder, containing reference .npz files
OUTPUT_DIR="path/to/newref.npz" # existing (empty) folder
REF_SIZES="15 50 100 200 500 1000" # space separated list (kb)
REF_SIZES="1000 500 200 100" # space separated list (kb)
RELEASE="hg38" # reference used to create bam files (solely used for reference filename)
GENDER="F" # gender of the of the cases at NPZ_INPUT_DIR

# SCRIPT

Expand All @@ -15,6 +14,6 @@ do
echo "Creating reference at bins size ${REF} kb"

WisecondorX newref ${INPUT_DIR}/*.npz \
${OUTPUT_DIR}/reference.${RELEASE}.${GENDER}.${REF}kb.npz \
-binsize ${REF}000 -cpus ${CORES} -gender ${GENDER}
${OUTPUT_DIR}/reference.${RELEASE}.${REF}kb.npz \
--binsize ${REF}000 --cpus ${CORES}
done
4 changes: 2 additions & 2 deletions pipeline/predict.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ NPZ_FILES="path/to/samples.txt" # cases that will be tested versus the given ref
# ID_1 path/to/convert.npz/ID_1.npz
# ID_2 path/to/convert.npz/ID_2.npz
# ...
REF="path/to/newref.npz/reference.hg38.F.50kb.npz" # the cases in the NPZ_FILES document are thus expected to be female
REF="path/to/newref.npz/reference.hg38.50kb.npz"
OUTPUT_DIR="path/to/predict.output" # existing output folder


# OPTIONAL PARAMETERS

OPT="-plot -bed"
OPT="--plot --bed"

# SCRIPT

Expand Down
66 changes: 66 additions & 0 deletions scripts/fix_convert_npz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/usr/bin/env python

import argparse
import numpy as np

def toolConvert():
# Added for compatibility with with original WiSeCondor files
return True

def get_gender(sample):
tot_reads = float(sum([sum(sample[str(x)]) for x in range(1, 25)]))
x_reads = float(sum(sample["23"]))
x_len = float(len(sample["23"]))
y_reads = float(sum(sample["24"]))
y_len = float(len(sample["24"]))

X = (x_reads / tot_reads) / x_len * 0.5
Y = (y_reads / tot_reads) / y_len

# X/Y = ?
# 1/1 (MALE) = 1
# 2/noise (FEMALE) = [4,8]
# cut-off 3 -- should be robust vs noise and mosaic large subchromosomal duplication/deletions
if X/Y < 3:
return "M"
else:
return "F"


def reformat(sample):
for chrom in sample.keys():
data = sample[chrom]
if chrom == "X":
chrom = "23"
sample.pop("X")
if chrom == "Y":
chrom = "24"
sample.pop("Y")
sample[chrom] = data
return sample


if __name__ == '__main__':
parser = argparse.ArgumentParser(description="\
This script reformats samples to make them backwards compatible between WisecondorX versions. \
It can also be used to transform .npz files from the original WISECONDOR to WisecondorX. \
Note that is only compatible with .npz files resulting from the convert version, \
other functions (newref, predict) should be re-run."
)
parser.add_argument('in_npz', help='Input npz from Wisecondor or WisecondorX <0.2.0')
parser.add_argument('out_npz', help='Output npz')

args = parser.parse_args()

npz = np.load(args.in_npz)
sample = npz["sample"].item()
sample = reformat(sample)
gender = get_gender(sample)

if "binsize" in npz.keys():
binsize = npz["binsize"].item()
else:
binsize = npz["arguments"].item()["binsize"]

np.savez_compressed(args.out_npz, binsize=binsize, sample=sample, gender=gender, quality=npz["quality"].item())
print("Succes!")
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#! /usr/bin/env python
from setuptools import setup, find_packages

version = '0.1.2dev'
version = '0.2.0'
dl_version = 'master' if 'dev' in version else '{}'.format(version)

setup(
Expand All @@ -28,6 +28,7 @@
entry_points={
'console_scripts': ['WisecondorX = wisecondorX.main:main']
},
scripts=['scripts/fix_convert_npz.py'],
classifiers=[
'Development Status :: 3 - Alpha',
'Environment :: Console',
Expand Down
27 changes: 10 additions & 17 deletions wisecondorX/include/CBS.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,16 @@
arguments = "
--infile - (mandatory argument) input .json file
--sexchroms - (mandatory argument) which sex chromosomes will be analyzed? (X or XY)
--outfile - (mandatory argument) output .json file
--alpha - (mandatory argument) p-value for segmentation
"

args <- commandArgs(TRUE)


if (all(c(args[1], args[3], args[5], args[7]) %in% c("--infile", "--sexchroms", "--outfile", "--alpha"))){
if (all(c(args[1], args[3]) %in% c("--infile", "--outfile"))){
in.file <- paste0(args[which(args == "--infile")+1])
sex.chrom <- paste0(args[which(args == "--sexchroms")+1])
out.file <- paste0(args[which(args == "--outfile")+1])
alpha <- paste0(args[which(args == "--alpha")+1])
}

# -----
# param
# -----

if (sex.chrom == "X"){
exclude.chr = c(24)
} else {
exclude.chr = c()
}

# -----
Expand All @@ -48,8 +34,15 @@ input <- read_json(in.file)
ratio <- as.numeric(unlist(input$results_r))
weights <- as.numeric(unlist(input$weights))

chrs = c(1:24)
chrs = chrs[which(!(chrs %in% exclude.chr))]
gender <- input$reference_gender
alpha <- input$alpha

if (gender == "M"){
chrs = 1:24
} else {
chrs = 1:23
}

bins.per.chr <- sapply(chrs, FUN = function(x) length(unlist(input$results_r[x])))
chr.end.pos <- c(0)
for (chr in chrs){
Expand Down
Loading

0 comments on commit 5a51436

Please sign in to comment.