This project applies transformers-based language modeling and graph neural network-based link prediction techniques to assist UML modelers. The objective is to leverage natural language labels in UML models and the graph structure of UML models to learn patterns for various tasks.
- a. Given a UML class and its relations with neighboring classes, predict the name of the abstract class.
- b. Given the abstract classes and relations of a UML class, predict the name of the UML class.
- Predict the "stereotype" of a UML class in an OntoUML, considering the name of the class, its neighboring classes, and relationships.
- Given a UML model with missing links between the classes, predict the missing links.
Partial model completion is achieved in two stages: Pretraining and Finetuning.
In the pretraining stage, a UMLGPT
model is trained on graph data as a next token prediction task. Nodes are represented as strings in the format <node name> <relations> <super types>
. The model learns generalized patterns in the UML models data.
The trained UMLGPT
model is then used for a sequence classification task using the UMLGPTClassifier
class. Given the string representation of a node, the model is finetuned to predict the super type of the node or the node name.
For example, in the abstract class prediction task, the model predicts the abstract class label of the UML class based on the [entity relations]
string representation.
Language models tokenize strings into vectors of numbers using tokenizers. This project supports both pretrained language model (PLM) tokenizers and a custom VocabTokenizer
class based on node strings data.
The VocabTokenizer
class is implemented similarly to the AutoTokenizer
class of Hugging Face to maintain a consistent API.
The UMLGPTTrainer
class implements a trainer that uses a UMLGPT
or UMLGPTClassifier
model and trains the model for next token prediction or a sequence classification task.
This task employs the same classes (UMLGPT
or UMLGPTClassifier
) as the first task to predict the UML class stereotype.
The Link Prediction task uses node embeddings learned from the pretraining phase in the UMLGPT
model. The GNNModel
utilizes these node embeddings and is trained along with a Multi-Layer Perceptron Predictor to detect if a link exists between two nodes or not.
For the first two tasks (sequence classification), the chosen metrics are:
- MRR - Mean Reciprocal Rank
- Hits@k with k = 1, 3, 5, 10
- Precision
For the third Link Prediction task, the metric chosen is roc_auc_score
.
Task No. | Metric Name | Target | Achieved | Based on |
---|---|---|---|---|
1 | Hits@10 | 0.48 | 0.496 | Bert Sequence Classifier |
2 | Precision | 0.64 | 0.83 | 1 and 2 |
3 | roc_auc_score | 0.860 | 0.829 | GPT2-based embeddings as opposed to UMLGPT2 embeddings |
For first case, note that the linked work have the same problem statement, i.e., to predict the class name given the partial model and they report EClassifier Recall@10 score - 0.4258 and Local Context Recall@10 score - 0.458. However for the former, the comparison cannot be truly valid because they have not made the data public and for the latter, there are several issues with their implementation design, due to which even though they report a 0.458 recall@10 score, the score should be poorer. For instance, they use masked language modeling to predict the masked class names given a graph traversal representation of a UML model as a string.
E.g., ( <MODEL> ( <CLS> ( <NAME> ACMEFile ) ( <ASSOCS> ( ACMEEntry entries ) ) ) ( <CLS> ( <NAME> <mask> ) ) )
They mask the class name ACMEEntry
and train the model to predict the class name. However, the class name already occurs in a relation just before in the string. This significantly threatens the validity of their results.
For the second case, one of the baselines is my previous work that used word2vec and GNN based node classification to predict the OntoUML stereotype. The second work is a ontological rules based approach to infer the stereotypes. In case of the rule based approach, their hits@1 score is fairly poor - less than 30%, however their hits@3 score is better. However in both case, my approach using embeddings from pretrained language model and finetuning the embeddings for sequence classification task outperform both the works.
For the last case, I could not find any baseline as far as academia is concerned. However, I considered a comparison of embeddings from language models i.e., on one hand, embeddings from a pretrained language model and on the other, embeddings from the UMLGPT model implemented in this project. Both these embeddings are then used to train the GNN model as implemented in this project and it turns out the pretrained embeddings marginally outperform the UMLGPT model.
This file encapsulates essential methods and classes for creating PyTorch datasets, extracting data from UML models, and converting them into strings.
This script processes the .ecore files of UML models dataset, converting and storing them in .pkl format.
This file encompasses all implemented PyTorch nn.Module
classes, such as UMLGPT, UMLGPTClassifier, MLPPredictor, and GNNModel. Additionally, it includes nn.Module
classes required to create transformer blocks.
This module houses all necessary methods and classes for extracting data from JSON files of OntoUML models within the datasets
folder.
All command-line argument parameters are contained and explained in this file.
This script executes the pretraining phase on UML models data.
This script performs sequence classification on UML models, predicting the UML class name or the UML abstract class of a UML class. It supports specifying a pretrained tokenizer or a custom vocab tokenizer. The model for classification can be an untrained UMLGPT, a pretrained UMLGPT model, or any model from the Hugging Face library.
This file contains methods to calculate metrics on the predictions.
This script executes sequence classification on UML models to predict the OntoUML stereotype of a UML class or relation. The tokenizer is always from a pretrained language model, and a custom tokenizer is not yet implemented for this case. The model for classification can be an untrained UMLGPT, a pretrained UMLGPT model, or any model from the Hugging Face library. Link to ontouml_classification.py
This script is used to execute link prediction between graphs on UML models. The tokenizer is always from a pretrained language model, and a custom tokenizer is not yet implemented for this case. Link to link_prediction.py
This file is used to create node triples, i.e., UML class, relation, and abstract class triples for UML models.
This script is used to create node triples for OntoUML classes. In this case, the node triple contains information not only about the neighbors but up to a distance d
, as specified by the argument distance
.
This file specifies all the trainers for the three different tasks.
Install dependencies - The dependencies are available in requirements.txt file and can be install by executing -
pip install -r requirements.txt
All Run Configurations for the three tasks:
- PLM = pretrained language model
- Word tokenizer = tokenizer generated from VocabTokenizer
All the parameters are explained in the parameters.py file.
python pretraining.py --tokenizer=bert-base-cased --gpt_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1
python pretraining.py --tokenizer=word --gpt_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1
python pretraining.py --gpt_model=gpt2 --batch_size=128 --lr=0.00001 --num_epochs=2
python uml_classification.py --tokenizer=bert-base-cased --classification_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=super_type
python uml_classification.py --tokenizer=bert-base-cased --classification_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=entity
python uml_classification.py --tokenizer=word --classification_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=super_type
python uml_classification.py --tokenizer=word --classification_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=entity
python uml_classification.py --classification_model=bert-base-cased --batch_size=128 --lr=0.00001 --num_epochs=1 --class_type=super_type
python uml_classification.py --classification_model=bert-base-cased --batch_size=128 --lr=0.00001 --num_epochs=1 --class_type=entity
Tokenizer used here should be same as the tokenizer used for pretraining
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_uml-gpt_tok=bert-base-cased/best_model.pt --tokenizer=bert-base-cased --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=super_type
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_uml-gpt_tok=bert-base-cased/best_model.pt --tokenizer=bert-base-cased --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=entity
Tokenizer used here should be same as the tokenizer used for pretraining
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_uml-gpt_tok=word/best_model.pt --tokenizer=word --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=super_type
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_uml-gpt_tok=word/best_model.pt --tokenizer=word --batch_size=128 --lr=0.0001 --num_epochs=1 --class_type=entity
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_gpt2/best_model --tokenizer=word --batch_size=128 --lr=0.00001 --num_epochs=1 --class_type=super_type
python uml_classification.py --classification_model=uml-gpt --from_pretrained=models/pre_gpt2/best_model --tokenizer=word --batch_size=128 --lr=0.00001 --num_epochs=1 --class_type=entity
ontouml_classification.py --data_dir=datasets/ontoumlModels --classification_model=uml-gpt --num_layers=6 --num_heads=8 --embed_dim=256 --num_epochs=1
python link_prediction.py --from_pretrained=models/pre_uml-gpt_tok=bert-base-cased/best_model.pt --num_layers=2 --embed_dim=256 --tokenizer=bert-base-cased
Tokenizer used here should be same as the tokenizer used for pretraining
python link_prediction.py --from_pretrained=models/pre_gpt2/best_model --num_layers=2 --embed_dim=256 --tokenizer=gpt2