Guide for replication of results for paper "Project-Level Encoding for Neural Source Code Summarization of Subroutines", accepted at ICPC 2021, Madrid, Spain (Virtual due to pandemic).

Arxiv - https://arxiv.org/abs/2103.11599

Step 0 - Dataset building

We began with the main java dataset of 2.1m methods aswell the complete 50 million method extended set requested from Le Clair et al{http://leclair.tech/data/funcom/}

The dataset was filtered for duplicates then project context was constructed from the extended set using the scripts in the "builder" folder.

We provide the compiled dataset as well as the scripts used to compile .This data can be found at : https://drive.google.com/drive/folders/10r-I5C76e-1vZuY5gqYQQNKgAY5MOJpg?usp=sharing

Step 1 - Training

To ensure no recursive errors or edits, create directories nfs>projects and clone this git repository. Download and unpack all data from the aws link into this directory as well. Create directory outdir, with 4 subdirectories outdir/{models, histories, viz, predictions} Use Requirements.txt to get your python 3.x virtual environment in sync with our setup. Venv is preferred. Common issues that might arise from updating an existing venv and solutions :

GPU not recognized: checking the compatibility of your gpu cudnn/cuda or other drivers with the keras and tf versions fixes this.
Tf unable to allocate tensor: uninstall tensorflow and then update tensorflow-gpu only. Note we have not tested our setup with tf 2.x
keras "learning rate" error: clean uninstall keras and install keras 2.3.1 {pip upgrade is broken for this dependency so will not work}

To train the most basic project-level context model use the following command :

time python3 train.py --model-type=attendgru-pc --batch-size=50 --epochs=10 --datfile=dataset_random.pkl --gpu=0

Note: --datfile=dataset_3Drandom.pkl for code2seq and graph2seq models or any custom models that use ast graphs you might wanna test. This is true for --datfile arg for all scripts in this project.

Scripts for Actionwords from S.Haque et al {https://github.com/actionwords/actionwords} for RQ2 table can be found in the firstwords folder and largely follow the same pattern as these scripts (predicts are provided as well)

Step 2 - Predictions

Training print screen will display the epoch at which the model converges, that is when the validation accuracy is not increase much or just before it starts to decrease and validation loss goes up. Once epoch is identified run the following script and replace file in this example with the trained model epoch and timestamp.

python3 predict.py /nfs/projects/projcon/outdir/models/attendgru-pc_E09_random_1608163249.h5 --datfile=dataset_random.pkl --gpu=0

predicted comments for all models are provided in the predictions folder.

Step 3 - Ensemble Predictions

A script to run ensembles using mean predictions from two models can be run with this simple modification after isolating two best performing models files.

python3 predict_ensemble.py /nfs/projects/projcon/outdir/models/attendgru_E10_random_1609946700.h5 /nfs/projects/projcon/outdir/models/attendgru-pc_E09_random_1608163249.h5 --datfile=dataset_random.pkl --gpu=0

predicted comments for all ensembles are provided in the predictions folder.

Step 4 - Metrics

Bleu and Rouge scores as well a comparison script to insolate maximum improvement have been provided by the name of bleu.py, rougemetric.py and bleucompare.py all of them can be run with the similar commands

 python3 rougemetric.py /nfs/projects/projcon/outdir/predictions/predict-attendgru_E10_random_1609946700-attendgru-pc_E09_random_1608163249.txt

 python3 bleu.py /nfs/projects/projcon/outdir/predictions/predict-attendgru_E10_random_1609946700-attendgru-pc_E09_random_1608163249.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guide for replication of results for paper "Project-Level Encoding for Neural Source Code Summarization of Subroutines", accepted at ICPC 2021, Madrid, Spain (Virtual due to pandemic).

Arxiv - https://arxiv.org/abs/2103.11599

Step 0 - Dataset building

Step 1 - Training

Step 2 - Predictions

Step 3 - Ensemble Predictions

Step 4 - Metrics

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
builder		builder
custom		custom
data		data
firstwords		firstwords
models		models
predictions		predictions
README.md		README.md
astpathmaker.py		astpathmaker.py
bleu.py		bleu.py
bleucompare.py		bleucompare.py
model.py		model.py
my_get_activations.py		my_get_activations.py
myutils.py		myutils.py
predict.py		predict.py
predict_ensemble.py		predict_ensemble.py
requirements.txt		requirements.txt
rougemetric.py		rougemetric.py
tokenizer.py		tokenizer.py
train.py		train.py

aakashba/projcon

Folders and files

Latest commit

History

Repository files navigation

Guide for replication of results for paper "Project-Level Encoding for Neural Source Code Summarization of Subroutines", accepted at ICPC 2021, Madrid, Spain (Virtual due to pandemic).

Arxiv - https://arxiv.org/abs/2103.11599

Step 0 - Dataset building

Step 1 - Training

Step 2 - Predictions

Step 3 - Ensemble Predictions

Step 4 - Metrics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages