- should be deployed on Linux in python 3.8.
- Main requirements:
. - To use GPU, please install the GPU version of
Download source codes.
Should be deployed on Linux.
Python environment preparation
We provide three packed conda environments for users to construct Python dependencies using Anaconda.
# operate in your own conda envs path, usullaly, in `~/anaconda3/envs` by default.
mkdir ~/anaconda3/envs/xmol
tar -zxvf ./_conda_envs/xmol.tar.gz -C ~/anaconda3/envs/xmol
mkdir ~/anaconda3/envs/esm2
tar -zxvf ./_conda_envs/esm2.tar.gz -C ~/anaconda3/envs/esm2
mkdir ~/anaconda3/envs/iTarget
tar -zxvf ./_conda_envs/iTarget.tar.gz -C ~/anaconda3/envs/iTarget
1. Prepare LLM representation for proteins and compounds using Large Language Models (ESM-2 and X-MOL in this study)
python _data_preprocess.py
# the produced '{type}_drugs.csv' and '{type}_prots.csv' files could be used in step 1.2
cd ./_ForFeatures/esm2/ # for proteins
cd ./_ForFeatures/xmol/ # for compounds
# after finishing representaion, back to the project root path
2.1 For template images, move the produced LLM feature files in step 1.2 to the working path ./data/original_data/scale/
mv ./_ForFeatures/esm2/data/{--esm2type}/{--datatype}/{--datatype}_all-data-merge-prot.csv ./data/original_data/scale/
# for proteins' template, by default, {--esm2type}='esm2_t36_3B_UR50D', {--datatype}='uniprot'
mv ./_ForFeatures/xmol/FT_to_embedding/data/for_output/{--datatype}_all-data-merge-drug.csv ./data/original_data/scale/
# for compounds' template, by default, {--datatype}='fullchembl'
2.2 The moved feature files in ./data/original_data/scale/
should be renamed using same {--scale_source} for {--datatype} according to the corresponding settings in downstream file 0_feadist.sh
. Here, we use 'uniprot+fullchembl' as an example, and then result in uniprot+fullchembl_all-data-merge-prot.csv
and uniprot+fullchembl_all-data-merge-drug.csv
two files in ./data/original_data/scale/
2.3 For feature images, move the produced LLM feature files to the working path ./data/original_data/
mv ./_ForFeatures/esm2/data/{--esm2type}/{--datatype}/{--datatype}_all-data-merge-prot.csv ./data/original_data/
# for proteins' features, by default, {--esm2type}='esm2_t36_3B_UR50D', {--datatype}='example' or user-defined
mv ./_ForFeatures/xmol/FT_to_embedding/data/for_output/{--datatype}_all-data-merge-drug.csv ./data/original_data/
# for compounds' features, by default, {--datatype}='example' or user-defined
cd bashes
conda activate iTarget
# calculate feature distance
sh 0_feadist.sh # by default, {--scale_method}='standard', {--scale_source}='uniprot+fullchembl'
# copy calculated configs to work path
cp ../data/processed_data/drug_fea/scale/standard/*.cfg ./feamap/config/trans_from_{--scale_source}/
cp ../data/processed_data/protein_fea/scale/standard/*.cfg ./feamap/config/trans_from_{--scale_source}/
sh 1_trans_drug.sh # for compounds, by default, {--scale_method}='standard', {--disttype}='uniprot+fullchembl', {--source}='example' or user-defined sh 1_trans_prot.sh # for proteins, by default, {--scale_method}='standard', {--disttype}='uniprot+fullchembl', {--source}='example' or user-defined
sh 2_split_cvdata.sh
# optional, or you can directly prepare files following the examples in `./data/processed_data/split_cvdata/`.
# This step is not required for bindingdb benchmark, which has been done in step 1.1
sh 3_train_cv.sh # by defalut, {--kfold_num}=5, {--task}='cv', {--n_epochs}=128, {--gpu}=0, {--batch_size}=512, {--lr}=5e-4, {--monitor}='auc_val', {--source}='example'
The manuscript is currently under peer review. Should you have any questions, please contact Dr. Zhang at [email protected] and Dr. Mou at [email protected]