Skip to content

Latest commit

 

History

History
193 lines (173 loc) · 7.7 KB

README.md

File metadata and controls

193 lines (173 loc) · 7.7 KB

rfamgen

We tested all the following commands in macOS(v12.1) and Linux(v3.10) environment.

install with conda

conda env create -f requirements_rfamgen.yaml
conda activate rfamgen

RfamGen works with python3.7 and pytorch 1.8, but probaboly works with newer versions. Most processes are using the following packages. Please download one-by-one if necessary.

pytorch
numpy
pandas
biopython
sklearn
nltk
infernal
viennarna
h5py

All the following commands are examples using RF00234. Computationa times by Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz are commented.

prep data

Data prep is perfomed based on a cmfile. A cmfile is a file format used in infernal software. Here, we used a cmfile downloaded from Rfam DB. If you want to build, you can generated with cmbuild command of infernal. cmbuild requires SS_cons(consensus secondary structure) in an alignment file of stockholm format. SS_cons can be added by manually or consensus structure estimation such as R-scape.

# Optional. For cleaning sequences in Rfam DB.

python scripts/get_tidy_sequences_from_fasta_rfam.py \
--seed_file ./datasets/Rfam.seed \
--rfam RF00234 \
--output_dir datasets/RF00234 \
--cpu 1 
help of `scripts/make_onehot_from_traceback.py`
usage: make_onehot_from_traceback.py [-h] [--fasta FASTA]
                                     [--traceback TRACEBACK] --cmfile CMFILE
                                     [--cpu CPU]

optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA         path to fasta file. Fasta file is automatically
                        aligned to cmfile and its traceback will be converted
                        to onehot.
  --traceback TRACEBACK
                        path to gzipped tracebackfile
  --cmfile CMFILE       path to cm file
  --cpu CPU             CPU cores for cmalign program. (default: 4)
python scripts/make_onehot_from_traceback.py \
--fasta datasets/RF00234/RF00234_unique_seed_removed.fa \
--cmfile ./datasets/RF00234/RF00234.cm \
--cpu 1
# real	3m24.649s
# user	3m0.349s
# sys	0m3.531s

After conversion to onehot expressions, split to train/valid/test.

python scripts/split_onehot_train_valid_test.py \
-i datasets/RF00234/RF00234_unique_seed_removed_notrunc_traceback_onehot_cm.h5 \
--train_ratio 0.7 --random_state 42

generate weight

You can generate a weight file from the training data prepared above [Marks, 2011; Morcos, 2011]. Typically, threshold is 0.01~0.2.
The computation time highly depends on the data size.

python scripts/generate_weight.py \
--mode cm \
-i datasets/RF00234/RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--threshold 0.1
# real    1m57.974s
# user    1m7.283s
# sys     1m15.031s

train

train.py trains a VAE model, generates config file of training, and optionally generates log/checkpoint files.

help of `scripts/train.py`
usage: train.py [-h] --data_dir DATA_DIR --X_train X_TRAIN --w_train W_TRAIN
                --X_valid X_VALID --w_valid W_VALID [--suffix SUFFIX]
                [--hidden HIDDEN] [--z_dim Z_DIM] [--stride STRIDE]
                [--ker1 KER1] [--ch1 CH1] [--ker2 KER2] [--ch2 CH2]
                [--ker3 KER3] [--ch3 CH3] [--beta BETA] [--use_anneal]
                [--anneal_saturate_rate ANNEAL_SATURATE_RATE]
                [--anneal_rate ANNEAL_RATE] [--batch_size BATCH_SIZE]
                [--epoch EPOCH] [-lr LEARNING_RATE] [--clip CLIP]
                [--only_training] [--use_early_stopping]
                [--tolerance TOLERANCE] [--save_ckpt] [--ckpt_iter CKPT_ITER]
                [--random_seed RANDOM_SEED] [--log] [--log_dir LOG_DIR]
                [--print_every PRINT_EVERY]

optional arguments:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   directory containing X_train etc.
  --X_train X_TRAIN     file name of training data
  --w_train W_TRAIN     file name of weight for training data
  --X_valid X_VALID     file name of validation data
  --w_valid W_VALID     file name of weight for validation data
  --suffix SUFFIX       suffix of output file name
  --hidden HIDDEN       dimension of hidden space flanking latent space
                        (default: 128)
  --z_dim Z_DIM         dimension of latent space (default: 16)
  --stride STRIDE       stride parameter of all convs (default: 1)
  --ker1 KER1           kernel size parameter of 1st conv (default: 5)
  --ch1 CH1             channel num parameter of 1st conv (default: 5)
  --ker2 KER2           kernel parameter of 2nd conv (default: 5)
  --ch2 CH2             channel num parameter of 2nd conv (default: 5)
  --ker3 KER3           kernel parameter of 3rd conv (default: 7)
  --ch3 CH3             channel num parameter of 3rd conv (default: 8)
  --beta BETA           maximum weight of KL divergence
  --use_anneal          flag for cyclic annealing for KL divergence
  --anneal_saturate_rate ANNEAL_SATURATE_RATE
                        maximum rate for cyclic annealing (default: 0.4)
  --anneal_rate ANNEAL_RATE
                        number of iteration for one increment of annealing
                        (default: 1)
  --batch_size BATCH_SIZE
                        batch size (defailt: 8)
  --epoch EPOCH         maximum epochs (default: 200)
  -lr LEARNING_RATE, --learning_rate LEARNING_RATE
                        learning rate (default: 1e-3)
  --clip CLIP           clipping value (default: 20)
  --only_training       only training
  --use_early_stopping  flag for early stopping
  --tolerance TOLERANCE
                        tolerance for early stopping (default: 3)
  --save_ckpt           flag for checkpoint saving
  --ckpt_iter CKPT_ITER
                        save checkpoint every this iteration (default: 3)
  --random_seed RANDOM_SEED
  --log                 flag for log
  --log_dir LOG_DIR     directory for log output
  --print_every PRINT_EVERY
                        iteration num to print log of learning (default: 20)

Example:

python scripts/train.py \
--data_dir datasets/RF00234 \
--X_train RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--w_train RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train_weight_threshold0p1.h5 \
--X_valid RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_valid.h5 \
--w_valid RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_valid_weight_threshold0p1.h5 \
--epoch 3 \
--beta 1e-3 --use_anneal --use_early_stopping \
--log --log_dir ./outputs/RF00234

# real    0m16.900s
# user    0m12.500s
# sys     0m3.155s

generation by sampling in the latent space.

scripts/sampling_from_gauss.py generates sequences from normal distribution in latent space. You need to input model params, config, and cmfile.
Generation runs very fast unless the cmfile is very long.

python scripts/sampling_from_gauss.py \
--config ./outputs/RF00234/config.yaml \
--ckpt ./outputs/RF00234/model_epoch3.pt \
--cmfile ./datasets/RF00234/RF00234.cm \
--outfasta ./outputs/RF00234/sampled_5seq.fa \
--n_samples 5

# real    0m7.447s
# user    0m6.201s
# sys     0m0.635s

Bayesian estimation of cleavage kinetics

install with conda

conda env create -f requirements_bayesiankin.yaml

see more details in /notebooks/bayesiankinetics/kinetics_analysis_bypyro.ipynb
It takes less than 1 day to estimate the kinetics of 2000 sequences x2 replicates with SVI. The NGS data are available on PRJNA1044007 of SRA.