High-throughput sequencing of T- and B-cell receptors makes it possible to track immune repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However quantitative comparison between repertoires is confounded by variability in the read count of each receptor clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python package NoisET that implements and generalizes a previously developed Bayesian method in Puelma Touzel et al, 2020. It can be used to learn experimental noise models for repertoire sequencing from replicates, and to detect responding clones following a stimulus. The package was tested on different repertoire sequencing technologies and datasets. NoisET package is desribed here.
* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).
Extensive documentation can be found here.
Python 3
NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:
$ pip install noisets
Watch out, Data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below.
To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. Using the terminal, go to NoisET directory and write the following command :
$ sudo python setup.py install
If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: :
python -m pip install -U pip
python -m pip install -U matplotlib
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install -U scikit-learn
A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . Three commands are available to use :
noiset-noise
To infer Null noise model: NoisET first function (1)noiset-nullgenerator
To qualitatively check consistency of NoisET first functionnoiset-detection
To detect responding clones to a stimulus: NoisET second function (2)
All options are described typing one of the previous commands + --help
or -h
. Options are also described in the following READme.
Notebooks are also available.
The python package enables to manipulate longitudinal RepSeq data and find the relevant time points to compare to study TCR repertoire dynamics after an acute stimulus. In the notebook given as an example, we give the analysis published in https://elifesciences.org/articles/63502 (PCA analysis of clonal abundance trajectories) and additional tools to manipulate longitudinal RepSeq data. Go check : here.
data_pre = ns.longitudinal_analysis(patient, data_folder)
This object has the following methods:
.import_clones(args)
to import all the clonotypes of a given patient and store them in a dictionary. It returns also the list of #ordered time points of the longitudinal dataset.
Parameters
----------
patient : str
The ID of the patient
data_folder : str
The name of the folder to find data
Returns
-------
dictionary
a dictionary of data_frames giving all the samples of the patient.
numpy vector
a vector containing all the RepSeq sampling times ordered.
.get_top_clones_set(args)
to get the n_top_clones TCR nucleotides sequence of patient of interest at every time point.
Parameters
----------
n_top_clones : int
the n_top_clones TCR abundance you want to extract from each RepSeq sample
Returns
-------
list of str
list of TCR nucleotide sequencesof each RepSeq sample
.build_traj_frame(args)
build a dataframe with abundance trajectories of the n_top_clones TCR of patient of interest at every time point
Parameters
----------
top_clones : list
list of TCR nucleotide sequences you want to build the trajectories dataframe, it is the output of .get_top_clones_set()
Returns
-------
data-frame
abundance trajectories of the n_top_clones TCR of patient of interest at every time point
.PCA_traj(args)
to get pca and clustering objects as in the scikit-learn-PCA and scikit-learn-clustering.
Parameters
----------
n_top_clones : int
the n_top_clones TCR abundance you want to extract from each RepSeq sample
Returns
-------
scikit learn objects
Other methods to manipulate and visualize longitudinal RepSeq abundance data are provided.
To infer null noise model: NoisET first function (1), use the command noiset-noise
At the command prompt, type:
$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel)
Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
--path 'PATHTODATA'
: set path to data file--f1 'FILENAME1_X_REP1'
: filename for individual X replicate 1--f2 'FILENAME2_X_REP2'
: filename for individual X replicate 2
If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using:
--specify
--freq 'frequency'
: Column label associated to clonal fraction--counts 'counts'
: Column label associated to clonal count--ntCDR3 'ntCDR3'
: Column label associated to clonal CDR3 nucleotides sequence--AACDR3 'AACDR3'
: Column label associated to clonal CDR3 amino acid sequence
--NBPoisson
: Negative Binomial + Poisson Noise Model - 5 parameters--NB
: Negative Binomial - 4 parameters--Poisson
: Poisson - 2 parameters
At the command prompt, type:
$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB
This command line will learn four parameters associated to negative binomial null noise Model --NB
for individual Q1 at day 0.
A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository).
You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS
At the command prompt, type:
$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB
As previously this command enables us to learn four parameters associated to negative binomial null noise model --NB
for one individual in cohort produced in Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE.
For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - Null model learning.ipynb
import noisets
from noisets import noisettes as ns
You can download the Jupyter notebook and modify it with your own PATHTODATA / datafile specificities.
A diversity estimator can be used from the knowledge of the noise model which has been learnt in a first step: Go check : here
null_model = ns.Noise_Model()
null_model.diversity_estimate(args)
Compute the diversity estimate from data and the infered noise model.
Parameters
----------
df : data-frame
The data-frame which has been used to learn the noise model
paras : numpy array
vector containing the noise parameters
noise_model : int
choice of noise model
Returns
-------
float
diversity estimate from the noise model inference.
To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section). One can also generalte healthy RepSeq samples dynamics using the noise model which has been learnt in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check here.
To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command noiset-nullgenerator
$ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA'
The user must chose one of the three possible models for the probability that a TCR has an empirical count n knowing that its true frequency is f , P(n|f): a Poisson distribution --Poisson
, a negative binomial distribution --NB
, or a two-step model combining Negative-Binomial and a Poisson distribution --NBP
. n is the empirical clone size and depends on the experimental protocol.
For each P(n|f), a set of parameters is learnt.
--NBPoisson
: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in Puelma Touzel et al, 2020: power-law exponent of clonotypes frequencies distributions'alph_rho'
, minimum of clonotype frequencies distribution'fmin'
,'beta'
and'alpha'
, parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and'm_total'
the total number of cells in the sample of interest..--NB
: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho'
, minimum of clonotype frequencies distribution'fmin'
,'beta'
and'alpha'
, parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. NB(fNreads, fNreads + betafNreadsalpha) . (Nreads is the total number of reads in the sample of interest.)--Poisson
: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho'
and minimum of clonotype frequencies distribution'fmin'
. P(n|f) is a Poisson distribution of parameter fNreads . (Nreads is the total number of reads in the sample of interest.)
--nullpara 'PATHTOFOLDER/NULLPARAS.txt'
: parameters learnt thanks to NoisET function (1)
!!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for--NBP
, 4 parameters for--NB
and 2 parameters for--Poisson
.
--NreadsI NNNN
: total number of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'.--Nreads2 NNNN
: total number of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'.--Nclones NNNN
: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'.
--output 'SYNTHETICDATA'
: name of the output file where you can find the synthetic data set.
At the command prompt, type
$ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test'
Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learnt parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'.
For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - Null model learning.ipynb
import noisets
from noisets import noisettes as ns
You can download the Jupyter notebook and modify it with your own PATHTODATA / datafile specificities - visualization tools are also provided.
cl_rep_gen_gDNA = ns.Generator()
To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics.
cl_rep_gen_gDNA.gen_synthetic_data_Null(args)
Parameters
----------
paras : numpy array
parameters of the noise model
noise_model : int
choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson
NreadsI : float
total number of reads in first replicate
NreadsII : float
total number of reads in second replicate
Nsamp : float
total number of clones in union of two replicates
Returns
-------
data-frame - csv file
the output is a csv file of columns : 'Clone_count_1' (first replicate) 'Clone_count_2' (second replicate) and the frequency counterparts 'Clone_fraction_1', and 'Clone_fraction_2'
cl_neutral_dyn = ns.Generator()
cl_neutral_dyn.generate_trajectories(args)
To generate synthetic neutral dynamics of TCR RepSeq data.
Parameters
----------
paras_1 : numpy array
parameters of the noise model that has been learnt at time_1
paras_2 : numpy array
parameters of the noise model that has been learnt at time_2
method : str
'negative_binomial' or 'poisson'
tau : float
first time-scale parameter of the dynamics
theta : float
second time-scale parameter of the dynamics
t_ime : float
number of years between both synthetic sampling (between time_1 and time_2)
filename : str
name of the file in which the dataframe is stored
Returns
-------
data-frame - csv file
the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'
Detects responding clones to a stimulus: NoisET second function (2)
To detect responding clones from two RepSeq data at time_1 and time_2, use the command noiset-detection
$ noiset-detection --(noisemodel) --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA'
Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
--NBPoisson
: Negative Binomial + Poisson Noise Model - 5 parameters--NB
: Negative Binomial - 4 parameters--Poisson
: Poisson - 2 parameters
(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE])
--nullpara1 'PATH/FOLDER/NULLPARAS1.txt'
: parameters learnt thanks to NoisET function (1) for time 1--nullpara2 'PATH/FOLDER/NULLPARAS2.txt'
: parameters learnt thanks to NoisET function (1) for time 2
!!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for --NBP
, 4 parameters for --NB
and 2 parameters
for --Poisson
.
--path 'PATHTODATA'
: set path to data file--f1 'FILENAME1_X_time1'
: filename for individual X time 1--f2 'FILENAME2_X_time2'
: filename for individual X time 2
If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using:
--specify
--freq 'frequency'
: Column label associated to clonal fraction--counts 'counts'
: Column label associated to clonal count--ntCDR3 'ntCDR3'
: Column label associated to clonal CDR3 nucleotides sequence--AACDR3 'AACDR3'
: Column label associated to clonal CDR3 amino acid sequence
--pval XXX
: p-value threshold for the expansion/contraction - use 0.05 as a default value.--smedthresh XXX
: log fold change median threshold for the expansion/contraction - use 0 as a default value.
--output 'DETECTIONDATA'
: name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section).
At the command prompt, type
$ noiset-detection --NB --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection'
Ouput: table containing all putative detected clones with statistics features about logfold-change variable s : more theoretical description Puelma Touzel et al, 2020.
For Python users, it is possible to use NoisET as a package importing it as mentioned before. A jupyter notebook explaining the use of all the functions of interest is provided: NoisET example - detection responding clones.ipynb
import noisets
from noisets import noisettes as ns
expansion = ns.Expansion_Model()
expansion.expansion_table(args)
To detect expanded clones from longitudinal data-set
Parameters
----------
outpath : str
Name of the directory where to store the output table
paras_1 : numpy array
parameters of the noise model that has been learnt at time_1
paras_2 : numpy array
parameters of the noise model that has been learnt at time_2
df : pandas dataframe
pandas dataframe merging the two RepSeq data at time_1 and time_2
noise_model : int
choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson
pval_threshold : float
P-value threshold to detect and discriminate if a TCR clone has expanded
smed_threshold : float
median of the log-fold change threshold to detect if a TCR clone has expanded
Returns
-------
data-frame - csv file
the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
The posterior log-fold change distribution computed after optimizing equation 10 is used to compute the dynamics of each particular clone population (or frequency). Here we explain what are the different features displayed in ouput file 'detectionQ1_0_F1_.txt.gzQ1_15_F1_.txt.gztop_expanded.csv' (noiset-detection
example command line).
Identifying clones paragraph of [Puelma Touzel et al, 2020].
You can download a Jupyter notebook and modify it with your own PATHTODATA / datafile specificities - visualization tools are also provided.
Any issues or questions should be addressed to us.
Free use of NoisET is granted under the terms of the GNU General Public License version 3 (GPLv3).