protTrans deepgoplus

ISUgenomics · Nov 14, 2024 · 2118e66 · 2118e66
1 parent ac16efd
commit 2118e66
Show file tree

Hide file tree

Showing 4 changed files with 354 additions and 2 deletions.
diff --git a/dataAnalysis/GenomeAnnotation/DeepGoPlus_AI_Functional_Prediction_of_Proteins.md b/dataAnalysis/GenomeAnnotation/DeepGoPlus_AI_Functional_Prediction_of_Proteins.md
@@ -67,7 +67,11 @@ export PATH="/work/gif3/masonbrink/Software/:$PATH"
 
 **I had to modify the update.py script to get around the check for the uniprot header. This header must not exist anymore.** 
 ```
-vi update.py
+# Release information
+Current version is 1.0.21. The model in the current release was trained using the Gene Ontology
+released on 2024-09-08 and the SwissProt data with version 2024_05.
+
+python update.py
 
 #allow check for header to pass
 if last_release_version == new_release_version:

diff --git a/dataAnalysis/GenomeAnnotation/Protein_Classification_with_ProtTrans.md b/dataAnalysis/GenomeAnnotation/Protein_Classification_with_ProtTrans.md
@@ -0,0 +1,343 @@
+---
+title: DeepGoPlus: Using AI to classify proteins
+layout: single
+author: Rick Masonbrink
+author_profile: true
+header:
+  overlay_color: "444444"
+  overlay_image: /assets/images/dna.jpg
+---
+
+
+# Create a ProtTrans pipeline to differentiate kinases from other proteins
+### Introduction to ProtTrans for Bioinformatics Applications
+
+ProtTrans is a powerful tool that combines cutting-edge artificial intelligence (AI) with protein sequence analysis. ProtTrans enables bioinformaticians to analyze and predict protein functions, structures, and classifications in ways that were previously unattainable. Whether you're working with kinases, functional domains, or even evolutionary questions, ProtTrans opens up new possibilities.
+
+### What is ProtTrans?
+
+ProtTrans is a transformer-based deep learning model designed to analyze protein sequences. It’s inspired by natural language processing (NLP) tools like BERT (Bidirectional Encoder Representations from Transformers) and adapts them for biological sequences. The model is trained on massive datasets of protein sequences from databases like UniProt, learning patterns, relationships, and evolutionary signals encoded in the amino acid sequences. These learned patterns are used to generate embeddings—numerical representations of proteins that capture their functional and structural properties.
+
+### How ProtTrans Works and Where AI Fits in
+
+**Deep Learning for Embeddings**
+*  ProtTrans uses transformers, a type of deep learning AI, to generate protein embeddings.
+    *  This step is analogous to converting raw text into dense numerical representations in NLP. Here, proteins are the "text," and the embeddings capture biochemical and functional properties.
+*  Machine Learning for Downstream Tasks:
+    *  After generating embeddings, you can feed them into traditional machine learning algorithms like logistic regression, random forests, or neural networks for classification, clustering, or prediction tasks.
+    *  Example: Classifying proteins as kinases or non-kinases using a supervised model.
+*  Data Preprocessing:
+    *  AI-based preprocessing, like scaling and dimensionality reduction, ensures the embeddings are suitable for downstream machine learning models.
+
+Each step combines insights from AI with domain knowledge, enabling you to tackle complex biological questions effectively.
+
+### Applications of ProtTrans
+
+*  ProtTrans has a wide range of applications in bioinformatics, including but not limited to:
+
+    *  Protein Function Prediction:
+        *  Determine the function of uncharacterized proteins by analyzing their sequence embeddings.
+
+    *  Protein Classification:
+        *  Classify proteins into functional categories, such as kinases, receptors, or transporters.
+
+    *  Protein Structure and Folding:
+        *  Leverage embeddings to predict structural features like secondary structure or intrinsic disorder.
+
+    *  Variant Effect Prediction:
+        *  Study the effects of mutations by comparing embeddings of wild-type and mutant sequences.
+
+    *  Drug Discovery and Target Identification:
+        *  Identify potential therapeutic targets by analyzing protein families, domains, or functional sites.
+
+    *  Comparative Genomics:
+        *  Use embeddings for clustering and similarity analysis across species to study protein evolution and conservation.
+
+    *  Pathway Analysis:
+        *  Annotate proteins involved in specific biochemical pathways by integrating ProtTrans outputs with pathway databases.
+
+### Why Use ProtTrans
+
+*  Scalability:
+    *  Analyze thousands to millions of sequences efficiently.
+
+*  Flexibility:
+    *  Apply embeddings to diverse machine learning workflows for predictions, clustering, or annotations.
+
+*  State-of-the-Art AI:
+    *  ProtTrans leverages deep learning, ensuring high accuracy and biological relevance.
+
+*  Biological Insights:
+    *  Embeddings capture complex relationships in protein sequences that go beyond traditional sequence alignment methods.
+
+
+### Learning Goals
+
+*  Understand how ProtTrans uses AI to generate protein embeddings.
+*  Learn how to integrate these embeddings into machine learning workflows for specific biological questions.
+*  Gain insights into the versatility of ProtTrans for bioinformatics research.
+
+
+### Installation of ProtTrans
+```
+#create a python virtual environment
+python3 -m venv ProtTrans_pyenv
+#activate the environment
+source ProtTrans_pyenv/bin/activate
+
+
+#install these three packages
+pip install -q transformers
+pip install --upgrade pip
+pip install -q transformers
+```
+
+### Create a dataset for training
+```
+Here I just went to Uniprot.org to find proteins that were representative kinases from UniRef (manually reviewed). Since we have been working with Arabidopsis, I have kept with that lineage by extracting only proteins from the Brassicaceae. To avoid biasing the model, we want to keep both protein types at an equal representation. 
+```
+*  Since kinases are a minute fraction of all proteins, I only extracted proteins that were in groups with at least 8 sequences with a cluster. 
+```
+#number of kinase sequences
+grep -c ">" KinasesTax3700Uniprot.fasta
+58918
+```
+*  For non-kinases I grabbed the cluster representative protein for each Brassicaceae cluster with at least 8 members in a cluster and Cluster name could not contain the word kinase.
+```
+#number of nonkinases
+grep -c ">" CleanNamesNonKinasesTax3700Uniprot.fasta
+77216
+```
+* Gets rid of the extra header information in the fasta files, which can drastically affect the time to embed the proteins
+```
+awk '{print $1}'  uniref_NOT_name_kinase_AND_count_8_T_2024_11_11.fasta >CleanNamesNonKinasesTax3700Uniprot.fasta
+awk '{print $1}'  uniref_taxonomy_id_3700_AND_name_kin_2024_11_08.fasta >CleanNamesKinasesTax3700Uniprot.fasta
+
+cat CleanNames*fasta >ProteinDataset.fasta
+```
+
+
+
+
+### Prepare sequences for embedding the best protein model provided
+
+
+generate_embeddings.py
+```
+import torch
+from transformers import AutoTokenizer, AutoModel
+import pickle
+import argparse
+
+# Input Parameters
+parser = argparse.ArgumentParser(description="Generate ProtTrans embeddings")
+parser.add_argument('--input', type=str, required=True, help="Input FASTA file")
+parser.add_argument('--output', type=str, required=True, help="Output pickle file for embeddings")
+args = parser.parse_args()
+
+
+#Here the tokenizer is converting the protein sequence into tokens, with each token representing a single amino acid using prot_bert_bfd, a deep learning transformer model.
+def generate_embeddings(input_fasta, output_pkl):
+    tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd")
+    model = AutoModel.from_pretrained("Rostlab/prot_bert_bfd")
+    #This dictionary will store proteinIDs as keys and embeddings as values
+    embeddings = {}
+
+
+    # Read sequences line by line. If the line starts with ">" , extract the ID; otherwise, read amino acid sequence.
+    with open(input_fasta) as f:
+        for line in f:
+            if line.startswith(">"):
+                prot_id = line.strip().lstrip(">")
+            else:
+                sequence = line.strip()
+                #protein sequence is tokenized into input features suitable for the transformer model. return_tensors="pt" converts the input into a PyTorch tensor, the format required.
+                inputs = tokenizer(sequence, return_tensors="pt")
+                #This prevents computation of gradients, processes tokens and outputs hidden states for each amino acid. 
+                with torch.no_grad():
+                    #last_hidden_state extracts the output of the last layer in the transformer, which contains the learned representation fo reach token. mean averages the token embeddings along the sequence length dimension. squeeze().numpy() converts the pytorch tensor into a numpy array. Averaging across the entire sequence condenses the data into a single vector representing each whole protein.
+                    embedding = model(**inputs).last_hidden_state.mean(1).squeeze().numpy()
+                #Protein ID is key, and computed embedding (numerical vector) is the value.     
+                embeddings[prot_id] = embedding
+
+    # Save embeddings to file
+    with open(output_pkl, "wb") as f:
+        pickle.dump(embeddings, f)
+
+generate_embeddings(args.input, args.output)
+```
+
+
+
+
+**Run generate_embeddings.py**
+```
+#embed the training dataset.  This is the most time consuming step of the pipeline, and can take >24 hours for 100k proteins.   Thus I will provide this output. 
+python generate_embeddings.py --input ProteinDataset.fasta --output embeddings.pkl
+
+
+#embed the dataset you want to identify kinases.  These need to be small,maybe 100 proteins, so embedding doesnt take too long for the workshop. 
+python generate_embeddings.py --input HeadCleanTN10FinalManualAnnotation_proteins.fasta --output embeddings4PredictionHead.pkl
+```
+
+### Create labels for classification 
+```
+#creates header
+echo -e "Protein_ID\tLabel" >labels.tsv
+grep ">" KinasesTax3700Uniprot.fasta |awk '{print $1"\tkinase"}' |sed 's/>//g' >>labels.tsv
+grep ">" NonKinasesTax3700Uniprot.fasta |awk '{print $1"\tnon-kinase"}' |sed 's/>//g' >>labels.tsv
+```
+
+**train_classifier.py**
+```
+import pickle
+import pandas as pd
+import argparse
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+from sklearn.preprocessing import StandardScaler
+
+
+# Argument parser for shell inputs
+parser = argparse.ArgumentParser(description="Train a classifier to identify kinases")
+parser.add_argument('--embeddings', type=str, required=True, help="Path to embeddings pickle file")
+parser.add_argument('--labels', type=str, required=True, help="Path to labels file (TSV format)")
+parser.add_argument('--output', type=str, required=True, help="Path to save trained classifier pickle file")
+parser.add_argument('--scaler_output', type=str, required=True, help="Path to save the scaler pickle file")
+args = parser.parse_args()
+
+# Load embeddings and labels files
+with open(args.embeddings, "rb") as f:
+    embeddings = pickle.load(f)
+
+labels_df = pd.read_csv(args.labels, sep="\t")
+labels_df = labels_df[labels_df["Protein_ID"].isin(embeddings.keys())]
+
+# Loads each Protein_ID and Lable into a dataframe, converting kinase and non-kinase to zeros and ones. 
+X = [embeddings[prot_id] for prot_id in labels_df["Protein_ID"]]
+y = labels_df["Label"].apply(lambda x: 1 if x == "kinase" else 0).values
+
+# Prints the distribution of the two classes being classified to record if your datasets were balanced or not. 
+num_kinases = sum(y)
+num_non_kinases = len(y) - num_kinases
+print(f"Class Distribution: {num_kinases} kinases, {num_non_kinases} non-kinases")
+
+
+# Split the data into training and testing to evaluate model performance. random_state=42 ensure reproducability by fixing randomness.
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+
+# Scale the feature matrix.   This standardizes the embeddings by removin gthe mean and scaling to unit variance. The scaler computes mean and standard deviation and applies the transformation.
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# Train logistic regression model
+model = LogisticRegression()  # Handles class imbalance
+model.fit(X_train_scaled, y_train)
+
+# Evaluate the model with Accuracy and a classification report (Precision, Recall, F1-Score, and Support). 
+y_pred = model.predict(X_test_scaled)
+accuracy = accuracy_score(y_test, y_pred)
+print(f"Accuracy: {accuracy}")
+print("Classification Report:")
+print(classification_report(y_test, y_pred))
+
+# Save the trained classifier
+with open(args.output, "wb") as f:
+    pickle.dump(model, f)
+print(f"Classifier saved as {args.output}")
+
+# Save the scaler
+with open(args.scaler_output, "wb") as f:
+    pickle.dump(scaler, f)
+print(f"Scaler saved as {args.scaler_output}")
+
+# Print model coefficients
+print("Model coefficients (weights):")
+print(model.coef_)
+
+# Print feature matrix stats
+print("Feature matrix stats:")
+print(f"Mean: {np.mean(X)}, Variance: {np.var(X)}")
+
+# Confusion matrix
+print("Confusion Matrix")
+print("Non-kinase correctly identified, Non-kinase misclassified")
+print("Kinase correctly identified, Kinase misclassified")
+print(confusion_matrix(y_test, y_pred))
+```
+
+**Train the classifier**
+```
+#Here we use the embeddings we obtained from generate_embeddings.py and the labels we created to classify our sequences.
+python train_classifier.py --embeddings embeddings4PredictionDataset.pkl --labels labels4PredictionDataset.tsv --output logistic_regression_model.pkl --scaler_output scaler.pkl
+```
+
+### Predict kinases from your dataset
+
+**predict_kinases.py**
+```
+import pickle
+import argparse
+import numpy as np
+import os
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+# Argument parser to specify the embeddings and classifier files
+parser = argparse.ArgumentParser(description="Predict kinases using a trained classifier")
+parser.add_argument('--embeddings', type=str, required=True, help="Embeddings pickle file for prediction")
+parser.add_argument('--model', type=str, required=True, help="Trained classifier pickle file")
+parser.add_argument('--output', type=str, required=True, help="Output file to save predictions")
+args = parser.parse_args()
+
+# Load the new embeddings 
+with open(args.embeddings, "rb") as f:
+    embeddings = pickle.load(f)
+
+# load the trained classifier
+with open(args.model, "rb") as f:
+    model = pickle.load(f)
+
+# Load the scaler
+with open(scaler_path, "rb") as f:
+    scaler = pickle.load(f)
+
+# Prepare the embeddings for prediction
+# Convert the embeddings dictionary into a list of feature vectors
+X_new = np.array(list(embeddings.values()))
+
+# Scale the new embeddings (X_new) for prediction
+X_new_scaled = scaler.transform(X_new)
+
+# Predict probabilities and class labels for each protein
+predictions = {}
+y_pred_probs = model.predict_proba(X_new_scaled)
+
+# classify your prediction based on the probability that it is a kinase
+for prot_id, prob in zip(embeddings.keys(), y_pred_probs[:, 1]):
+    # Classify based on the probability threshold
+    pred = "kinase" if prob > 0.6 else "non-kinase"
+    predictions[prot_id] = (pred, prob)
+
+# Print the prediction and the probability for the kinase class
+print("Protein\tPrediction\tKinase Probability")
+for prot_id, (pred, prob) in predictions.items():
+    print(f"{prot_id}\t{pred}\t{prob:.4f}")
+
+# Write predictions to the specified output file
+with open(args.output, "w") as f:
+    for prot_id, (label, prob) in predictions.items():
+        f.write(f"{prot_id}\t{label}\t{prob:.4f}\n")
+
+print(f"Predictions saved to {args.output}")
+```
+
+**Run the prediction**
+```
+python predict_kinases.py --embeddings SmallTrainingembeddings.pkl --model logistic_regression_model.pkl --output predictions.tsv
+```
+
+[Back to the Assembly and Annotation Index page](annotation_and_assembly_index.md)
diff --git a/dataAnalysis/GenomeAnnotation/annotation_and_assembly_index.md b/dataAnalysis/GenomeAnnotation/annotation_and_assembly_index.md
@@ -61,4 +61,6 @@ header:
   * [Introduction to Braker2 Gene Prediction](Intro_to_Braker2.md)
   * [Tutorial for NCBI PGAP](PGAP_tutorial.md)
   * [Motif Identification and Finding with MEME and FIMO](MEME_Motif_Finding_In_Genomes.md)  
-  * [How to identify the secreted protein from an annotation and predict subcellular localization](Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
+  * [Assign GO Terms to Proteins using Deep Learning with DeepGoPlus](DeepGoPlus_AI_Functional_Prediction_of_Proteins.md)
+  * [How to Identify the Secreted Protein from an Annotation and Predict Subcellular Localization](Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
+  * [How to Functionally Classify Proteins using ProtTrans, a Kinase Example ](Protein_Classification_with_ProtTrans)
diff --git a/index.md.bk b/index.md.bk
@@ -64,7 +64,10 @@
     * [Intro to Maker Gene Prediction](dataAnalysis/GenomeAnnotation/Intro_To_Maker.md)
     * [Intro to Braker2 Gene Prediction](dataAnalysis/GenomeAnnotation/Intro_to_Braker2.md)
     * [Motif Identification and Finding with MEME and FIMO](dataAnalysis/GenomeAnnotation/MEME_Motif_Finding_In_Genomes.md)
+    * [Assign GO Terms to Proteins using Deep Learning with DeepGoPlus](dataAnalysis/GenomeAnnotation/DeepGoPlus_AI_Functional_Prediction_of_Proteins.md)
     * [How to identify the secreted protein from an annotation and predict subcellular localization](dataAnalysis/GenomeAnnotation/Secreted_Protein_Prediction_with_SignalP_and_TMHMM.md)
+    * [How to Functionally Classify Proteins using ProtTrans, a Kinase Example ](dataAnalysis/GenomeAnnotation/Protein_Classification_with_ProtTrans)
+
   * RNA-Seq Analysis
     * [RNA-Seq Example with a Genome Assembly](dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.md): Arabidopsis with a Genome
     * [RNA-Seq Example without a Genome Assembly](dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-without-a-genome.md): Arabidopsis without a Genome