The code in the repository doesn't create the right code for any possible python question. It is trained on a small data and creates a good program related to the questions present in the data. However, it still doesn't understand indentation properly for very long programs. After a ':' model correctly predicts the amount of indentation for the line but fails to capture which statements to keep in which indent specially for a very long program.
- Remove all comments # and '''...''': This is to reduce the vocab size and make the problem simpler for the model
- Replace tabs with " ": This helps to keep same indentation scheme in the file. Specially for cases with indentation scheme as 4-3-2 spaces.
- Replacing multiple line declarations of variables: We use python's own tokenizer. It was creating problems with multiline declarations.
- Removing duplicate question answer pairs: Original data had many duplicate questions and python codes submitted by as same assignment by different team members. After removing duplicate pairs, the total unique question answer pair we about 3100 as compared to 4600+ original pairs.
- Tokenization: As said earlier, we used python's own tokenizer. There was a problem with it though. It took strings like the ones present in print('Akash Kumar') as a seperate string token 'Akash Kumar'. This unnecessarily increase vocab size. So tokenized these strings as characters to increase models verstality.
- Formatting with proper indentation: Data had multiple indentation schemes. We identify the indent required and finally replace it with corresponding '\t' to keep sequence length smaller.
Primarily we used Cross Entropy as our loss function. We experimented with an additional penalty for code that fails execution but:
- The new penalty based loss function make the training really slow because for each output in a batch we had to execute the script to get the result.
- Model didn't learn. Since there is no way for the parameters to find gradients wrt to actual execution of the scripts, we multiplied it as a separate constant to the loss value. This changes the gradient value and naturally it didn't work. But we tried to see if we can atleast have some rudimentary learning which we can adjust with the punishment_constant we chose as a hyperparameter.
We created python embeddings using CONALA mined dataset. The dataset consists of 590763 snippets of python. We train Decoder only transformer architecture and train it in an autoregressive manner. The task is simple to predict the next word given an input token. We train embeddings for a total of 15018 tokens which we got after using pythons in built tokenizer on the CONALA mined dataset.
In addition to 3100 examples from the original data we add additional 2800 examples from conala-train and conala-test datasets. The datasets are of same format with a natural language prompt for a python code and the corresponding python snippet.
Architecture is same as mentioned in the paper "Attention is all you need.". It's an encoder decoder model with the natural language prompt processed by the encoder and code generation by the decoder using multi-headed and self attention.
We used Rouge-L metric which matches the longest subsequence. In code, there is a fixed structure with snippets following each other to build on previous snippets. In machine translation, the same words can come in the beginning and at the end to form the same meaning so n-grams based evaluation metric makes sense. Since in the code, n-grams presenting anywhere doesn't make sense, we chose ROUGE-L metric. It gives score according to matching of the longest common subsequence in target and output codes. We get a maximum of 15.8 ROUGE-L score on the validation set. Refer this file for code.
Here are the different experiments we did, and their corresponding files.
-
Vanilla encoder-decoder architecture with word vocab: The first experiment with a simple encoder-decoder architecture with python tokenizer and no pretrained embeddings.
-
Vanilla encoder-decoder architecture with char vocab: In this file, we do similar things as above we just used a char vocab for the decoder. We realized, that decoder outputs didn't have space between statements like 'def gdc(x,y)'
-
Penalty for wrong execution: As discussed earlier, the model didn't train well with an extra execution penalty because it deflected gradients from their true direction.
-
Training python embeddings: In this file, we trained decoder embeddings for python tokens using 590763 instances of mined data in conala-mined dataset. The embeddings along with their corresponding vocab are present in the data folder.
-
Conala data with original data: Similar to details in 1. Here, we trained our model on more data from conala train and test files from CONALA dataset.
-
Conala data with original with python Embeddings: We use this file to report our results. We train on total 5937 unique question answer pairs along with the pretrained embeddings we got from 4.
Check the file here for example outputs for better formatting. Refer this file for code.
Reference paper: https://arxiv.org/pdf/2002.05442.pdf