Skip to content

Test for a molecular clock based on the phylogenetic tree inferred from single-cell DNA sequenzing data

License

Notifications You must be signed in to change notification settings

cbg-ethz/scSomMerClock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Single-Cell SOMatic MolEculaR Clock testing

This repo contains a Poisson Tree (PT) Test for the existence of a somatic clock in single-cell phylogenies. In short, it tests if different cell lineages evolve at a similar rate, accumulating mutations according to a molecular clock. As input the test requires a mutation matrix, a phylogeny of contemporaneously sampled cells, and error rates.

This repo contains scripts for running

  • the PT Test
  • and, in a subfolder (AnalysisPipelines), scripts for
    • the processing of real scDNA-seq data
    • the analysis of real scDNA-seq data
    • the simulation of scDNA-seq data (via coalescent)
    • the analysis and plotting of simulated scDNA-seq data

Installation

Requirements

  • python3.X:
    • ete3
    • numpy
    • pandas
    • scipy

The requirements cant be installed using pip:

python -m pip install ete3 pandas scipy

Usage

The PT test can be run with the following shell command:

python run_PT_test.py <VCF_FILE> <NEWICK_TREE_FILE> [-o] [-excl] [-incl] [-w] [-FN] [-FP]

Input files

The PT test requires two input files:

  • Called variants in VCF format (VCF info), where each sample is a cell
  • An inferred phylogenetic tree in newick format (cell names need to be the same as in the VCF).

Note

Trees can be inferred, for example, with CellPhy or infSCITE; both outputs are compatible with the PT test

Optional Arguments

  • -o <str>, Output file. Default = <VCF_FILE>.poissonTree_LRT.tsv.
  • -excl <str>, Regex pattern for samples/cells to exclude. Default = none.
  • -incl <str>, Regex pattern for samples/cells to include. If set, only these samples/cells are included. Default = all cells.
  • -w <list of int>, Maximum weight values. Default = 100, 200, ..., 1000'.
  • -FN <float>, Estimated FN rate (for CellPhy and infSCITE: inferred from .log/stdout file).
  • -FP <float>, Estimated FP rate (for CellPhy and infSCITE: inferred from .log/stdout file).

Example

To run the PT test on the simulated data in the example_data folder, execute

python run_PT_test.py example_data example_data/data_simulated_clock.vcf.gz example_data/data_simulated_clock.raxml.bestTree

or

python run_PT_test.py example_data example_data/data_simulated_noclock.vcf.gz example_data/data_simulated_noclock.raxml.bestTree

The former data is simulated under a molecular clock, the later with a deviation from the clock (evolutionary rate amplified by 5x in a subtree)

Note

FN and FP rate are inferred from the .raxml.log file