Skip to content

Replication package of the paper "AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models" (ASE 2022)

License

Notifications You must be signed in to change notification settings

martin-wey/ast-probe

Repository files navigation

AST-Probe - Codebase and data

Paper. AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

Link. https://arxiv.org/abs/2206.11719

Conference. 37th IEEE/ACM International Conference on Automated Software Engineering (ASE 22')

Installation 🛠️

  1. Clone the repository.
git clone https://github.com/martin-wey/ast-probe
cd ast-probe
  1. Create a python3 virtual environment and install requirements.txt.
python3 -m venv <name_of_your_env>
source <name_of_your_env>/bin/activate
pip install -r requirements.txt

We run out experiments using Python 3.9.10. In the requirements.txt, we include the PyTorch and CUDA versions that we used to report the results of the paper.

  1. Install all tree-sitter grammars:
mkdir grammars
cd grammars
git clone https://github.com/tree-sitter/tree-sitter-python.git
git clone https://github.com/tree-sitter/tree-sitter-javascript.git
git clone https://github.com/tree-sitter/tree-sitter-go.git
git clone https://github.com/tree-sitter/tree-sitter-php.git
git clone https://github.com/tree-sitter/tree-sitter-ruby.git
git clone https://github.com/tree-sitter/tree-sitter-java.git
cd ..
python src/data/build_grammars.py
  1. Add the project directory to your Python path.
export PYTHONPATH="${PYTHONPATH}:~/ast-probe/"
  1. [Optional] Execute all tests.
python -m unittest discover

Running the probe 🚀

  1. Dataset generation.
python src/dataset_generator.py --download --lang python
python src/dataset_generator.py --lang javascript
python src/dataset_generator.py --lang go

The script dataset_generator.py with the argument --download will download the CodeSearchNet dataset, filter code snippets, and extract 20,000 samples for training, 4,000 for testing and 2,000 for validation. The filtering criteria are the following:

  • We filter out code snippets that have a length > 512 after tokenization.
  • We remove code snippets that cannot be parsed by tree-sitter.
  • We remove code snippets containing syntax errors
  1. Train the AST-Probe.
python src/main.py \
  --do_train \
  --run_name <folder_run_name> \
  --pretrained_model_name_or_path <hugging_face_model> \
  --model_type <model_type> \
  --lang <lang> \
  --layer <layer> \
  --rank <rank>

The script main.py is in charge of training the probe. The main arguments are the following:

  • --do_train: if you want to train a probe classifier.
  • --run_name: indicates the name of the folder where the log, model and results will be stored.
  • --pretrained_model_name_or_path: the pre-trained model's id in the HuggingFace Hub. e.g., microsoft/codebert-base, roberta-base, Salesforce/codet5-base, etc.
  • --model_type: the model architecture. Currently, we only support roberta or t5.
  • --lang: programming language. Currently, we only support python, javascript or go.
  • --layer: the layer of the transformer model to probe. Normally, it goes from 0 to 12. If the pre-trained models is huggingface/CodeBERTa-small-v1, then this argument should range between 0 and 6.
  • --rank: dimension of the syntactic subspace.

As a result of this script, a folder runs/folder_run_name will be generated. This folder contains three files:

  • ìnfo.log: log file.
  • pytorch_model.bin: the probing model serialized i.e., the basis of the syntactic subspace, the vectors C and U.
  • metrics.log: a serialized dictionary that contains the training losses, the validation losses, the precision, recall, and F1 score on the test set. You can use python -m pickle runs/folder_run_name/metrics.log to check the metrics for the run.

Here is an example of the usage of this script:

python src/main.py \
  --do_train \
  --run_name codebert_python_5_128 \
  --pretrained_model_name_or_path microsoft/codebert-base \
  --model_type roberta \
  --lang python \
  --layer 5 \
  --rank 128

This command trains a 128-dimensional probe over the output embeddings of the 5th layer of CodeBERT using the Python dataset. After running this command, the folder runs/codebert_python_5_128 is created.


Replicating the experiments of the paper

To replicate the experiments included in the paper, we provide two scripts that run everything.

  • run_experiments_rq123.py: to replicate the results of RQ1, RQ2 and RQ3.
  • run_experiments_rq4.py: to replicate the results of RQ4.

You may have to change a few things in these two scripts such as the CUDA_VISIBLE_DEVICE, i.e., GPU used with PyTorch. Besides, the script will generate all the results for all experiments in separated folders such as described in the previous section of this readme.

After running the experiments (i.e., replicating the RQs), it is possible to get plots similar to the ones included in the paper using the plot_graphs.py script by specifying the base path of the run directories: python plot_graphs.py --run_dir ./runs.


Hardware specifications

In principle, all the experiments of this paper can be reproduced by following the instructions previously mentioned. For completeness, we also provide specifications the hardware and OS we used to get the results included in the paper.

OS: Gentoo Linux
OS release: Gentoo Base System 2.8
Kernel: Linux 5.15.41-gentoo-x86_64
GPU: NVIDIA GeForce RTX 3090
CUDA version: 11.3

Unfortunately, our probe cannot be run without at least one GPU. We also cannot ensure that the scripts can be used with GPUs other than those of NVIDIA.


You can cite our work if you find this repository or the paper useful.

@misc{hernandez-ast-probe-2022,
  doi = {10.48550/ARXIV.2206.11719},
  url = {https://arxiv.org/abs/2206.11719},
  author = {López, José Antonio Hernández and Weyssow, Martin and Cuadrado, Jesús Sánchez and Sahraoui, Houari},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Programming Languages (cs.PL), Software Engineering (cs.SE), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

About

Replication package of the paper "AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models" (ASE 2022)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages