Skip to content

Latest commit

 

History

History
446 lines (325 loc) · 21.8 KB

README.md

File metadata and controls

446 lines (325 loc) · 21.8 KB

NoisET* NOIse sampling learning & Expansion detection of T-cell receptors using Bayesian inference.

High-throughput sequencing of T- and B-cell receptors makes it possible to track immune repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However quantitative comparison between repertoires is confounded by variability in the read count of each receptor clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python package NoisET that implements and generalizes a previously developed Bayesian method in Puelma Touzel et al, 2020. It can be used to learn experimental noise models for repertoire sequencing from replicates, and to detect responding clones following a stimulus. The package was tested on different repertoire sequencing technologies and datasets. NoisET package is desribed here.

* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).


Extensive documentation can be found here.

Installation

Python 3

NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:

$ pip install noisets

Watch out, Data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below.

To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. Using the terminal, go to NoisET directory and write the following command :

$ sudo python setup.py install

If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: :

python -m pip install -U pip
python -m pip install -U matplotlib
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install -U scikit-learn

Documentation

A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . Three commands are available to use :

  • noiset-noise To infer Null noise model: NoisET first function (1)
  • noiset-nullgenerator To qualitatively check consistency of NoisET first function
  • noiset-detection To detect responding clones to a stimulus: NoisET second function (2)

All options are described typing one of the previous commands + --helpor -h. Options are also described in the following READme. Notebooks are also available.

1/ Data pre-processing

The python package enables to manipulate longitudinal RepSeq data and find the relevant time points to compare to study TCR repertoire dynamics after an acute stimulus. In the notebook given as an example, we give the analysis published in https://elifesciences.org/articles/63502 (PCA analysis of clonal abundance trajectories) and additional tools to manipulate longitudinal RepSeq data. Go check : here.

data_pre = ns.longitudinal_analysis(patient, data_folder)

This object has the following methods:

.import_clones(args)

to import all the clonotypes of a given patient and store them in a dictionary. It returns also the list of #ordered time points of the longitudinal dataset.

Parameters
----------
patient : str
    The ID of the patient
data_folder : str
    The name of the folder to find data

Returns
-------
dictionary
    a dictionary of data_frames giving all the samples of the patient.
   
numpy vector
    a vector containing all the RepSeq sampling times ordered.
.get_top_clones_set(args)

to get the n_top_clones TCR nucleotides sequence of patient of interest at every time point.

Parameters
----------
n_top_clones : int
    the n_top_clones TCR abundance you want to extract from each RepSeq sample

Returns
-------
list of str
    list of TCR nucleotide sequencesof each RepSeq sample
.build_traj_frame(args) 

build a dataframe with abundance trajectories of the n_top_clones TCR of patient of interest at every time point

Parameters
----------
top_clones : list 
    list of TCR nucleotide sequences you want to build the trajectories dataframe, it is the output of .get_top_clones_set()

Returns
-------
data-frame
    abundance trajectories of the n_top_clones TCR of patient of interest at every time point
.PCA_traj(args)

to get pca and clustering objects as in the scikit-learn-PCA and scikit-learn-clustering.

Parameters
----------
n_top_clones : int
    the n_top_clones TCR abundance you want to extract from each RepSeq sample

Returns
-------
scikit learn objects

Other methods to manipulate and visualize longitudinal RepSeq abundance data are provided.

2/ Infer noise model

A/ Command line

To infer null noise model: NoisET first function (1), use the command noiset-noise

At the command prompt, type:

$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel)

Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:

1/ Data information:

  • --path 'PATHTODATA': set path to data file
  • --f1 'FILENAME1_X_REP1': filename for individual X replicate 1
  • --f2 'FILENAME2_X_REP2': filename for individual X replicate 2

If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using:

  • --specify
  • --freq 'frequency' : Column label associated to clonal fraction
  • --counts 'counts': Column label associated to clonal count
  • --ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence
  • --AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence

2/ Choice of noise model: (parameters meaning described in Methods section)

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters
  • --NB: Negative Binomial - 4 parameters
  • --Poisson: Poisson - 2 parameters

3/ Example:

At the command prompt, type:

$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB

This command line will learn four parameters associated to negative binomial null noise Model --NB for individual Q1 at day 0. A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository). You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS

4/ Example with --specify:

At the command prompt, type:

$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB

As previously this command enables us to learn four parameters associated to negative binomial null noise model --NB for one individual in cohort produced in Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE.

B/ Python package

For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - Null model learning.ipynb

import noisets
from noisets import noisettes as ns

You can download the Jupyter notebook and modify it with your own PATHTODATA / datafile specificities.

3/ Diversity estimator:

A diversity estimator can be used from the knowledge of the noise model which has been learnt in a first step: Go check : here

null_model = ns.Noise_Model()
null_model.diversity_estimate(args)

Compute the diversity estimate from data and the infered noise model.

Parameters
----------
df : data-frame 
    The data-frame which has been used to learn the noise model
paras : numpy array
    vector containing the noise parameters
noise_model : int
    choice of noise model 

Returns
-------
float
    diversity estimate from the noise model inference.

4/ Generate synthetic data from null model learning:

To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section). One can also generalte healthy RepSeq samples dynamics using the noise model which has been learnt in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check here.

A/ Command line

To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command noiset-nullgenerator

$ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA'  

1/ Choice of noise model:

The user must chose one of the three possible models for the probability that a TCR has an empirical count n knowing that its true frequency is f , P(n|f): a Poisson distribution --Poisson, a negative binomial distribution --NB, or a two-step model combining Negative-Binomial and a Poisson distribution --NBP. n is the empirical clone size and depends on the experimental protocol. For each P(n|f), a set of parameters is learnt.

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in Puelma Touzel et al, 2020: power-law exponent of clonotypes frequencies distributions 'alph_rho', minimum of clonotype frequencies distribution 'fmin', 'beta' and 'alpha', parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and 'm_total' the total number of cells in the sample of interest..
  • --NB: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020 'alph_rho', minimum of clonotype frequencies distribution 'fmin', 'beta' and 'alpha', parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. NB(fNreads, fNreads + betafNreadsalpha) . (Nreads is the total number of reads in the sample of interest.)
  • --Poisson: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho' and minimum of clonotype frequencies distribution 'fmin'. P(n|f) is a Poisson distribution of parameter fNreads . (Nreads is the total number of reads in the sample of interest.)

2/ Specify learnt parameters:

  • --nullpara 'PATHTOFOLDER/NULLPARAS.txt': parameters learnt thanks to NoisET function (1)
    !!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for --NBP, 4 parameters for --NBand 2 parameters for --Poisson.

3/ Sequencing properties of data:

  • --NreadsI NNNN: total number of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'.
  • --Nreads2 NNNN: total number of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'.
  • --Nclones NNNN: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'.

4/ Output file

--output 'SYNTHETICDATA': name of the output file where you can find the synthetic data set.

At the command prompt, type

$ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test'  

Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learnt parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'.

B/ Python package

For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - Null model learning.ipynb

import noisets
from noisets import noisettes as ns

You can download the Jupyter notebook and modify it with your own PATHTODATA / datafile specificities - visualization tools are also provided.

cl_rep_gen_gDNA = ns.Generator()

To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics.

cl_rep_gen_gDNA.gen_synthetic_data_Null(args)
Parameters
----------
paras  : numpy array
    parameters of the noise model 
noise_model : int
    choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson 
NreadsI      : float
total number  of reads in first replicate
NreadsII     : float
total number  of reads in second replicate
Nsamp    : float
total number of clones in union of two replicates 

Returns
-------
data-frame - csv file
    the output is a csv file of columns : 'Clone_count_1' (first replicate) 'Clone_count_2' (second replicate) and the frequency counterparts 'Clone_fraction_1', and 'Clone_fraction_2'
cl_neutral_dyn = ns.Generator()
cl_neutral_dyn.generate_trajectories(args)

To generate synthetic neutral dynamics of TCR RepSeq data.

Parameters
----------
paras_1  : numpy array
    parameters of the noise model that has been learnt at time_1
paras_2  : numpy array
    parameters of the noise model that has been learnt at time_2
method   : str
'negative_binomial' or 'poisson'
tau      : float
first time-scale parameter of the dynamics
theta    : float
second time-scale parameter of the dynamics
t_ime    : float
number of years between both synthetic sampling (between time_1 and time_2)
filename : str
name of the file in which the dataframe is stored  


Returns
-------
data-frame - csv file
    the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1'                and 'Clone_frequency_2'

5/ Detect responding clones:

Detects responding clones to a stimulus: NoisET second function (2)

A/ Command line

To detect responding clones from two RepSeq data at time_1 and time_2, use the command noiset-detection

$ noiset-detection --(noisemodel)  --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA' 

Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:

1/ Choice of noise model:

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters
  • --NB: Negative Binomial - 4 parameters
  • --Poisson: Poisson - 2 parameters

2/ Specify learnt parameters for both time points:

(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE])

  • --nullpara1 'PATH/FOLDER/NULLPARAS1.txt': parameters learnt thanks to NoisET function (1) for time 1
  • --nullpara2 'PATH/FOLDER/NULLPARAS2.txt': parameters learnt thanks to NoisET function (1) for time 2

!!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for --NBP, 4 parameters for --NBand 2 parameters for --Poisson.

3/ Data information:

  • --path 'PATHTODATA': set path to data file
  • --f1 'FILENAME1_X_time1': filename for individual X time 1
  • --f2 'FILENAME2_X_time2': filename for individual X time 2

If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using:

  • --specify
  • --freq 'frequency' : Column label associated to clonal fraction
  • --counts 'counts': Column label associated to clonal count
  • --ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence
  • --AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence

4/ Detection thresholds: (More details in Methods section).

  • --pval XXX : p-value threshold for the expansion/contraction - use 0.05 as a default value.
  • --smedthresh XXX : log fold change median threshold for the expansion/contraction - use 0 as a default value.

5/ Output file

--output 'DETECTIONDATA': name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section).

At the command prompt, type

$ noiset-detection --NB  --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection' 

Ouput: table containing all putative detected clones with statistics features about logfold-change variable s : more theoretical description Puelma Touzel et al, 2020.

B/ Python package

For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - detection responding clones.ipynb

import noisets
from noisets import noisettes as ns

expansion = ns.Expansion_Model()
expansion.expansion_table(args)

To detect expanded clones from longitudinal data-set

Parameters
----------
outpath  : str
    Name of the directory where to store the output table
paras_1  : numpy array
    parameters of the noise model that has been learnt at time_1
paras_2  : numpy array
    parameters of the noise model that has been learnt at time_2
df       : pandas dataframe 
    pandas dataframe merging the two RepSeq data at time_1 and time_2

noise_model : int
    choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson  

pval_threshold : float
    P-value threshold to detect and discriminate if a TCR clone has expanded 

smed_threshold : float
    median of the log-fold change threshold to detect if a TCR clone has expanded 

Returns
-------
data-frame - csv file
    the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value' 

The posterior log-fold change distribution computed after optimizing equation 10 is used to compute the dynamics of each particular clone population (or frequency). Here we explain what are the different features displayed in ouput file 'detectionQ1_0_F1_.txt.gzQ1_15_F1_.txt.gztop_expanded.csv' (noiset-detectionexample command line).

Identifying clones paragraph of [Puelma Touzel et al, 2020].

You can download a Jupyter notebook and modify it with your own PATHTODATA / datafile specificities - visualization tools are also provided.

Contact

Any issues or questions should be addressed to us.

LICENSE

Free use of NoisET is granted under the terms of the GNU General Public License version 3 (GPLv3).