This repo serves as the official implementation of ACL 2021 findings paper "Code Summarization with Strcuture-induced Transformer".
If you have any questions, be free to email me.
pip install -r requirements.txt
For Python, we follow the pipline in https://github.com/wanyao1992/code_summarization_public.
For Java, we fetch from https://github.com/xing-hu/TL-CodeSum.
In the paper, we write the scripts on our own to parse code into AST. But it is a tough task. We are trying to find a nice way to do so and then experiment under SiT.
For just reproducing the results, you can download the data we used directly from here and put both python
and java
in the data
directory.
The adjacency
is too large to load on my personal server. So I allocate a guid for each code snippet in .guid
and retrieve them one by one. What you need to do is:
cd sit3
unzip adjacency.zip
Training
cd main
python train.py --dataset_name python --model_name YOUR_MODEL_NAME
See the log through:
vi ../modelx/YOUR_MODEL_NAME.txt
In the paper, we run SiT for 150 epochs. For example in Java:
01/18/2021 01:12:25 PM: [ dev valid official: Epoch = 150 | bleu = 44.89 | rouge_l = 55.25 | Precision = 61.14 | Recall = 57.81 | F1 = 56.95 | examples = 8714 | valid time = 58.93 (s) ]
Testing
python test.py --dataset_name python --beam_size 5 --model_name YOUR_MODEL_NAME
**Issue**
For Python, we do not follow the original data split in Wei's paper and consequently rerun both SiT and Transformer on our split. This is a potential drawback of the paper if comparing to other LSTM baselines. If you want the original split, please refer to https://github.com/GoneZ5/SCRIPT. Thank you.
Acknowledgement: The implementation is based on https://github.com/wasiahmad/NeuralCodeSum.
@inproceedings{hongqiu2021summarization,
author = {Hongqiu, Wu and Hai, Zhao and Min, Zhang},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
title = {Code summarization with structure-induced transformer},
year = {2021}
}