Skip to content

Commit

Permalink
protTrans deepgoplus
Browse files Browse the repository at this point in the history
  • Loading branch information
remkv6 committed Nov 14, 2024
1 parent ac16efd commit 2118e66
Show file tree
Hide file tree
Showing 4 changed files with 354 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,11 @@ export PATH="/work/gif3/masonbrink/Software/:$PATH"

**I had to modify the update.py script to get around the check for the uniprot header. This header must not exist anymore.**
```
vi update.py
# Release information
Current version is 1.0.21. The model in the current release was trained using the Gene Ontology
released on 2024-09-08 and the SwissProt data with version 2024_05.
python update.py
#allow check for header to pass
if last_release_version == new_release_version:
Expand Down
343 changes: 343 additions & 0 deletions dataAnalysis/GenomeAnnotation/Protein_Classification_with_ProtTrans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,343 @@
---
title: DeepGoPlus: Using AI to classify proteins
layout: single
author: Rick Masonbrink
author_profile: true
header:
overlay_color: "444444"
overlay_image: /assets/images/dna.jpg
---


# Create a ProtTrans pipeline to differentiate kinases from other proteins
### Introduction to ProtTrans for Bioinformatics Applications

ProtTrans is a powerful tool that combines cutting-edge artificial intelligence (AI) with protein sequence analysis. ProtTrans enables bioinformaticians to analyze and predict protein functions, structures, and classifications in ways that were previously unattainable. Whether you're working with kinases, functional domains, or even evolutionary questions, ProtTrans opens up new possibilities.

### What is ProtTrans?

ProtTrans is a transformer-based deep learning model designed to analyze protein sequences. It’s inspired by natural language processing (NLP) tools like BERT (Bidirectional Encoder Representations from Transformers) and adapts them for biological sequences. The model is trained on massive datasets of protein sequences from databases like UniProt, learning patterns, relationships, and evolutionary signals encoded in the amino acid sequences. These learned patterns are used to generate embeddings—numerical representations of proteins that capture their functional and structural properties.

### How ProtTrans Works and Where AI Fits in

**Deep Learning for Embeddings**
* ProtTrans uses transformers, a type of deep learning AI, to generate protein embeddings.
* This step is analogous to converting raw text into dense numerical representations in NLP. Here, proteins are the "text," and the embeddings capture biochemical and functional properties.
* Machine Learning for Downstream Tasks:
* After generating embeddings, you can feed them into traditional machine learning algorithms like logistic regression, random forests, or neural networks for classification, clustering, or prediction tasks.
* Example: Classifying proteins as kinases or non-kinases using a supervised model.
* Data Preprocessing:
* AI-based preprocessing, like scaling and dimensionality reduction, ensures the embeddings are suitable for downstream machine learning models.

Each step combines insights from AI with domain knowledge, enabling you to tackle complex biological questions effectively.

### Applications of ProtTrans

* ProtTrans has a wide range of applications in bioinformatics, including but not limited to:

* Protein Function Prediction:
* Determine the function of uncharacterized proteins by analyzing their sequence embeddings.

* Protein Classification:
* Classify proteins into functional categories, such as kinases, receptors, or transporters.

* Protein Structure and Folding:
* Leverage embeddings to predict structural features like secondary structure or intrinsic disorder.

* Variant Effect Prediction:
* Study the effects of mutations by comparing embeddings of wild-type and mutant sequences.

* Drug Discovery and Target Identification:
* Identify potential therapeutic targets by analyzing protein families, domains, or functional sites.

* Comparative Genomics:
* Use embeddings for clustering and similarity analysis across species to study protein evolution and conservation.

* Pathway Analysis:
* Annotate proteins involved in specific biochemical pathways by integrating ProtTrans outputs with pathway databases.

### Why Use ProtTrans

* Scalability:
* Analyze thousands to millions of sequences efficiently.

* Flexibility:
* Apply embeddings to diverse machine learning workflows for predictions, clustering, or annotations.

* State-of-the-Art AI:
* ProtTrans leverages deep learning, ensuring high accuracy and biological relevance.

* Biological Insights:
* Embeddings capture complex relationships in protein sequences that go beyond traditional sequence alignment methods.


### Learning Goals

* Understand how ProtTrans uses AI to generate protein embeddings.
* Learn how to integrate these embeddings into machine learning workflows for specific biological questions.
* Gain insights into the versatility of ProtTrans for bioinformatics research.


### Installation of ProtTrans
```
#create a python virtual environment
python3 -m venv ProtTrans_pyenv
#activate the environment
source ProtTrans_pyenv/bin/activate
#install these three packages
pip install -q transformers
pip install --upgrade pip
pip install -q transformers
```

### Create a dataset for training
```
Here I just went to Uniprot.org to find proteins that were representative kinases from UniRef (manually reviewed). Since we have been working with Arabidopsis, I have kept with that lineage by extracting only proteins from the Brassicaceae. To avoid biasing the model, we want to keep both protein types at an equal representation.
```
* Since kinases are a minute fraction of all proteins, I only extracted proteins that were in groups with at least 8 sequences with a cluster.
```
#number of kinase sequences
grep -c ">" KinasesTax3700Uniprot.fasta
58918
```
* For non-kinases I grabbed the cluster representative protein for each Brassicaceae cluster with at least 8 members in a cluster and Cluster name could not contain the word kinase.
```
#number of nonkinases
grep -c ">" CleanNamesNonKinasesTax3700Uniprot.fasta
77216
```
* Gets rid of the extra header information in the fasta files, which can drastically affect the time to embed the proteins
```
awk '{print $1}' uniref_NOT_name_kinase_AND_count_8_T_2024_11_11.fasta >CleanNamesNonKinasesTax3700Uniprot.fasta
awk '{print $1}' uniref_taxonomy_id_3700_AND_name_kin_2024_11_08.fasta >CleanNamesKinasesTax3700Uniprot.fasta
cat CleanNames*fasta >ProteinDataset.fasta
```




### Prepare sequences for embedding the best protein model provided


generate_embeddings.py
```
import torch
from transformers import AutoTokenizer, AutoModel
import pickle
import argparse
# Input Parameters
parser = argparse.ArgumentParser(description="Generate ProtTrans embeddings")
parser.add_argument('--input', type=str, required=True, help="Input FASTA file")
parser.add_argument('--output', type=str, required=True, help="Output pickle file for embeddings")
args = parser.parse_args()
#Here the tokenizer is converting the protein sequence into tokens, with each token representing a single amino acid using prot_bert_bfd, a deep learning transformer model.
def generate_embeddings(input_fasta, output_pkl):
tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd")
model = AutoModel.from_pretrained("Rostlab/prot_bert_bfd")
#This dictionary will store proteinIDs as keys and embeddings as values
embeddings = {}
# Read sequences line by line. If the line starts with ">" , extract the ID; otherwise, read amino acid sequence.
with open(input_fasta) as f:
for line in f:
if line.startswith(">"):
prot_id = line.strip().lstrip(">")
else:
sequence = line.strip()
#protein sequence is tokenized into input features suitable for the transformer model. return_tensors="pt" converts the input into a PyTorch tensor, the format required.
inputs = tokenizer(sequence, return_tensors="pt")
#This prevents computation of gradients, processes tokens and outputs hidden states for each amino acid.
with torch.no_grad():
#last_hidden_state extracts the output of the last layer in the transformer, which contains the learned representation fo reach token. mean averages the token embeddings along the sequence length dimension. squeeze().numpy() converts the pytorch tensor into a numpy array. Averaging across the entire sequence condenses the data into a single vector representing each whole protein.
embedding = model(**inputs).last_hidden_state.mean(1).squeeze().numpy()
#Protein ID is key, and computed embedding (numerical vector) is the value.
embeddings[prot_id] = embedding
# Save embeddings to file
with open(output_pkl, "wb") as f:
pickle.dump(embeddings, f)
generate_embeddings(args.input, args.output)
```




**Run generate_embeddings.py**
```
#embed the training dataset. This is the most time consuming step of the pipeline, and can take >24 hours for 100k proteins. Thus I will provide this output.
python generate_embeddings.py --input ProteinDataset.fasta --output embeddings.pkl
#embed the dataset you want to identify kinases. These need to be small,maybe 100 proteins, so embedding doesnt take too long for the workshop.
python generate_embeddings.py --input HeadCleanTN10FinalManualAnnotation_proteins.fasta --output embeddings4PredictionHead.pkl
```

### Create labels for classification
```
#creates header
echo -e "Protein_ID\tLabel" >labels.tsv
grep ">" KinasesTax3700Uniprot.fasta |awk '{print $1"\tkinase"}' |sed 's/>//g' >>labels.tsv
grep ">" NonKinasesTax3700Uniprot.fasta |awk '{print $1"\tnon-kinase"}' |sed 's/>//g' >>labels.tsv
```

**train_classifier.py**
```
import pickle
import pandas as pd
import argparse
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
# Argument parser for shell inputs
parser = argparse.ArgumentParser(description="Train a classifier to identify kinases")
parser.add_argument('--embeddings', type=str, required=True, help="Path to embeddings pickle file")
parser.add_argument('--labels', type=str, required=True, help="Path to labels file (TSV format)")
parser.add_argument('--output', type=str, required=True, help="Path to save trained classifier pickle file")
parser.add_argument('--scaler_output', type=str, required=True, help="Path to save the scaler pickle file")
args = parser.parse_args()
# Load embeddings and labels files
with open(args.embeddings, "rb") as f:
embeddings = pickle.load(f)
labels_df = pd.read_csv(args.labels, sep="\t")
labels_df = labels_df[labels_df["Protein_ID"].isin(embeddings.keys())]
# Loads each Protein_ID and Lable into a dataframe, converting kinase and non-kinase to zeros and ones.
X = [embeddings[prot_id] for prot_id in labels_df["Protein_ID"]]
y = labels_df["Label"].apply(lambda x: 1 if x == "kinase" else 0).values
# Prints the distribution of the two classes being classified to record if your datasets were balanced or not.
num_kinases = sum(y)
num_non_kinases = len(y) - num_kinases
print(f"Class Distribution: {num_kinases} kinases, {num_non_kinases} non-kinases")
# Split the data into training and testing to evaluate model performance. random_state=42 ensure reproducability by fixing randomness.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale the feature matrix. This standardizes the embeddings by removin gthe mean and scaling to unit variance. The scaler computes mean and standard deviation and applies the transformation.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression model
model = LogisticRegression() # Handles class imbalance
model.fit(X_train_scaled, y_train)
# Evaluate the model with Accuracy and a classification report (Precision, Recall, F1-Score, and Support).
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Save the trained classifier
with open(args.output, "wb") as f:
pickle.dump(model, f)
print(f"Classifier saved as {args.output}")
# Save the scaler
with open(args.scaler_output, "wb") as f:
pickle.dump(scaler, f)
print(f"Scaler saved as {args.scaler_output}")
# Print model coefficients
print("Model coefficients (weights):")
print(model.coef_)
# Print feature matrix stats
print("Feature matrix stats:")
print(f"Mean: {np.mean(X)}, Variance: {np.var(X)}")
# Confusion matrix
print("Confusion Matrix")
print("Non-kinase correctly identified, Non-kinase misclassified")
print("Kinase correctly identified, Kinase misclassified")
print(confusion_matrix(y_test, y_pred))
```

**Train the classifier**
```
#Here we use the embeddings we obtained from generate_embeddings.py and the labels we created to classify our sequences.
python train_classifier.py --embeddings embeddings4PredictionDataset.pkl --labels labels4PredictionDataset.tsv --output logistic_regression_model.pkl --scaler_output scaler.pkl
```

### Predict kinases from your dataset

**predict_kinases.py**
```
import pickle
import argparse
import numpy as np
import os
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Argument parser to specify the embeddings and classifier files
parser = argparse.ArgumentParser(description="Predict kinases using a trained classifier")
parser.add_argument('--embeddings', type=str, required=True, help="Embeddings pickle file for prediction")
parser.add_argument('--model', type=str, required=True, help="Trained classifier pickle file")
parser.add_argument('--output', type=str, required=True, help="Output file to save predictions")
args = parser.parse_args()
# Load the new embeddings
with open(args.embeddings, "rb") as f:
embeddings = pickle.load(f)
# load the trained classifier
with open(args.model, "rb") as f:
model = pickle.load(f)
# Load the scaler
with open(scaler_path, "rb") as f:
scaler = pickle.load(f)
# Prepare the embeddings for prediction
# Convert the embeddings dictionary into a list of feature vectors
X_new = np.array(list(embeddings.values()))
# Scale the new embeddings (X_new) for prediction
X_new_scaled = scaler.transform(X_new)
# Predict probabilities and class labels for each protein
predictions = {}
y_pred_probs = model.predict_proba(X_new_scaled)
# classify your prediction based on the probability that it is a kinase
for prot_id, prob in zip(embeddings.keys(), y_pred_probs[:, 1]):
# Classify based on the probability threshold
pred = "kinase" if prob > 0.6 else "non-kinase"
predictions[prot_id] = (pred, prob)
# Print the prediction and the probability for the kinase class
print("Protein\tPrediction\tKinase Probability")
for prot_id, (pred, prob) in predictions.items():
print(f"{prot_id}\t{pred}\t{prob:.4f}")
# Write predictions to the specified output file
with open(args.output, "w") as f:
for prot_id, (label, prob) in predictions.items():
f.write(f"{prot_id}\t{label}\t{prob:.4f}\n")
print(f"Predictions saved to {args.output}")
```

**Run the prediction**
```
python predict_kinases.py --embeddings SmallTrainingembeddings.pkl --model logistic_regression_model.pkl --output predictions.tsv
```

[Back to the Assembly and Annotation Index page](annotation_and_assembly_index.md)
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,6 @@ header:
* [Introduction to Braker2 Gene Prediction](Intro_to_Braker2.md)
* [Tutorial for NCBI PGAP](PGAP_tutorial.md)
* [Motif Identification and Finding with MEME and FIMO](MEME_Motif_Finding_In_Genomes.md)
* [How to identify the secreted protein from an annotation and predict subcellular localization](Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
* [Assign GO Terms to Proteins using Deep Learning with DeepGoPlus](DeepGoPlus_AI_Functional_Prediction_of_Proteins.md)
* [How to Identify the Secreted Protein from an Annotation and Predict Subcellular Localization](Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
* [How to Functionally Classify Proteins using ProtTrans, a Kinase Example ](Protein_Classification_with_ProtTrans)
3 changes: 3 additions & 0 deletions index.md.bk
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,10 @@
* [Intro to Maker Gene Prediction](dataAnalysis/GenomeAnnotation/Intro_To_Maker.md)
* [Intro to Braker2 Gene Prediction](dataAnalysis/GenomeAnnotation/Intro_to_Braker2.md)
* [Motif Identification and Finding with MEME and FIMO](dataAnalysis/GenomeAnnotation/MEME_Motif_Finding_In_Genomes.md)
* [Assign GO Terms to Proteins using Deep Learning with DeepGoPlus](dataAnalysis/GenomeAnnotation/DeepGoPlus_AI_Functional_Prediction_of_Proteins.md)
* [How to identify the secreted protein from an annotation and predict subcellular localization](dataAnalysis/GenomeAnnotation/Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
* [How to Functionally Classify Proteins using ProtTrans, a Kinase Example ](dataAnalysis/GenomeAnnotation/Protein_Classification_with_ProtTrans)

* RNA-Seq Analysis
* [RNA-Seq Example with a Genome Assembly](dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.md): Arabidopsis with a Genome
* [RNA-Seq Example without a Genome Assembly](dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-without-a-genome.md): Arabidopsis without a Genome
Expand Down

0 comments on commit 2118e66

Please sign in to comment.