This is an attempt to use transformers and self-attention in order to convert English descriptions into Python code.
We will be using this pre-curated Dataset for training our transformer model. The format of the data is as follows:
# English Description 1
<Python Code 1>
# English Description 2
<Python Code 2>
# English Description 3
<Python Code 3>
.
.
.
Each English description/question starts with a '#' and is followed by its corresponding python code. Each data point that we look for comprises of a question and its corresponding Python code. We can therefore look for the first character in each line to determine the start of the next data point. All lines between two lines starting with a '#' form a part of the python solution.
To further parse out the python code we make use of python's source code tokenizer to effectively deal with code syntax and indentation(spaces and tabs).
Since we have mere 5000 data points, we make use of data augmentations to increase the size of our dataset. While tokenizing the python code, we mask the names of certain variables randomly(with 'var_1, 'var_2' etc) to ensure that the model that we train does not merely fixate on the way the variables are named and tries to understand the inherent logic and syntax of the python code.
For example consider the folowing program:
def add_two_numbers (num1 ,num2 ):
sum =num1 +num2
return sum
we can replace some of the above variables to create new data points. The following are valid augmentations:
def add_two_numbers (var_1 ,num2 ):
sum =var_1 +num2
return sum
def add_two_numbers (num1 ,var_1 ):
sum =num1 +var_1
return sum
def add_two_numbers (var_1 ,var_2 ):
sum = var_1 + var_2
return sum
In the above example, we have therefore expanded a single data point into 3 more data points using our random variable replacement technique.
We will be using the transformer model as explained in this blog to perform sequence to sequence learning on our dataset. Here we will be treating the English description/question as our source(SRC) and the corresponding Python code as the target(TRG) for our training.
We use spacy's default tokenizer to tokenize our SRC sequence.
SRC = [' ', 'write', 'a', 'python', 'function', 'to', 'add', 'two', 'user', 'provided', 'numbers', 'and', 'return', 'the', 'sum']
We use python's source code tokenizer to tokenize our TRG. Python's tokenizer returns several attributes for each token. We only extract the token type and the corresponding string attribute in form of a tuple(i.e., (token_type_int, token_string)) as the final token. Our TRG is a sequence of such tuples.
TRG = [(57, 'utf-8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'), (53, ','), (1, 'var_1'), (53, ')'), (53, ':'), (4, '\n'), (5, ' '), (1, 'sum'), (53, '='), (1, 'num1'), (53, '+'), (1, 'var_1'), (4, '\n'), (1, 'return'), (1, 'sum'), (4, ''), (6, ''), (0, '')]
We have used augmentations in our dataset to mask variable literals. This means that our model can predict a variety of values for a particular variable and all of them are correct as long as the predictions are consistent through the code. This would mean that our training labels are not very certain and hence it would make more sense to treat them to be correct with probability 1- smooth_eps and incorrect otherwise. This is what label smoothening does. By adding label smoothening to Cross-Entropy we ensure that the model does not become too confident in predicting some of our variables that can be replaced via augmentations.
We use the validation loss and training loss to determine when our model is trained. The model with minimum validation loss is used as the final trained model.
It is important to note that label smoothening leads to much higher loss values as compared to models that do not make use of label smoothening. But this is as expected as we do not intend to be certain with our label predictions. This is particularly the case with variables as there can be multiple correct options as long as the predictions are consistent through the target code sequence.
Input:
"program to sort a list of dictionaries by key"
Output:
var_1 ={'Nikhil':{'roll':24 ,'marks':17 },
'Akshat':{'roll':54 ,'marks':12 },
'Akash':{'roll':15 },'marks':15 }}
sort_key ='marks'
res ='marks'
res =var_2 (test_dict .items (),key =lambda x :x [1 ][sort_key ])
print ("The sorted dictionary by marks is : "+str (res ))
Input:
"function to merge two lists"
Output:
def merge_lists (l1 :list ,var_1 :list ):
return list (zip (l1 ,var_1 ))
Input:
"program to find gcd"
Output:
def compute_hcf (x ,y ):
while (y ):
x ,y =y ,x %y
return x
Input:
"program to calculate simple interest"
Output:
princ_amount =float (input (" Please Enter the Principal Amount : "))
var_1 =float (input (" Please Enter the Rate Of Interest : "))
var_2 =float (input (" Please Enter Time period in Years : "))
simple_interest =(princ_amount *var_1 *var_2 )/100
print ("\nSimple Interest for Principal Amount {0} = {1}".format (princ_amount ,simple_interest ))
Input:
"function to sum odd elements of list"
Output:
def sum_odd_elements (l :list ):
return sum ([i for i in l if i %2 ==1 ])
Input:
"program to multiply integers in a list"
Output:
def multiplyList (myList ):
result =1
for x in myList :
result =result *x
return result
Input:
"program to reverse a string"
Output:
var_1 ='Today is bad day'
var_1 [::-1 ]
Input:
"function to find reverse of a string"
Output:
def revStr (inp ):
inp =inp [::-1 ]
return inp
Input:
"program to find the length of tuple"
Output:
thistuple =("apple","banana","cherry")
print (len (thistuple ))
Input:
"program to find the area of a square"
Output:
side =5
print (f'Area: {side * side}')
Input:
"program to print epoch timestamp"
Output:
import time
import datetime
str1 ="20/01/2020"
element =datetime .strptime (str1 ,"%d/%m/%Y")
timestamp =datetime .timestamp (element )
print (timestamp )
Input:
"program to find ascii value of charecter"
Output:
var_1 ='5'
print (ord (var_1 ))
Input:
"function to find fibonacci sequence"
Output:
def recur_fibo (n ):
if n <=1 :
return n
else :
return (recur_fibo (n -1 )+recur_fibo (n -2 ))
Input:
"function to find largest element in a dictionary"
Output:
def dic_largest (var_1 ):
return max (var_1 .values ())
Input:
"program to convert list of tuples into a dictionary"
Output:
var_1 =[("x",1 ),("x",2 ),("y",1 ),("y",1 ),("y",2 ),("z",1 )]
d ={}
for a ,var_2 in var_1 :
d .setdefault (a ,[]).append (var_2 )
print (d )
Input:
"function to convert to binary"
Output:
def convertToBinary (n ):
if n >1 :
convertToBinary (n //2 )
print (n %2 ,end ='')
Input:
"program to implement a linked list"
Output:
class Node :
def __init__ (self ,data ):
self .data =data
self .next =None
class LinkedList :
def __init__ (self ):
self .head =None
def reverse (self ):
prev =None
current =self .next
while (current is not None ):
next =current .next
current =current =current
current =current
current =current
self .next =current
def push (self ,new_data ):
next =Node (new_data )
self .next =self .next =self .next
def printList (self ):
temp =self .head =temp
while (temp ):
print (temp )
temp =temp =temp .next
llist =LinkedList ()
llist .push (20 )
llist .push (4 )
llist .push (15 )
llist .push (85 )
print ("Given Linked List")
llist .printList ()
llist .printList ()
llist .printList ()
llist .printList ("\nReversed Linked List")
llist .printList ()
Input:
"program to add lists elementwise"
Output:
var_1 =[1 ,2 ,3 ,4 ,5 ]
list2 =[5 ,6 ,7 ]
final =[a +b for a in var_1 for b in var_1 ]
print (f"sum of every pair of numbers from two lists:{final}")
Input:
"program to find common values between two sets"
Output:
var_1 ={10 ,20 ,30 ,40 ,50 }
set2 ={60 ,70 ,80 ,90 ,10 }
if var_1 ==1 :
print ("Two sets have no items in common")
else :
print ("Two sets have items in common")
print (var_1 .intersection (y ))
Input:
"program to find number of unique values in a list"
Output:
input_list =[1 ,2 ,2 ,5 ,4 ,4 ,3 ,8 ]
var_1 =[]
for var_2 in input_list :
if var_2 not in var_1 :
var_1 +=1
print ("No of unique items are:",var_1 )
Input:
"function to remove empty lists from a list of lists"
Output:
def removeEmptyList (li ):
res =[ele for ele in li if ele !=[]]
return res
Input:
"write a function to capitalize a string"
Output:
def capitalize (text ):
return text .title ()
Input:
"write a function to find the area of a circle"
Output:
def circle_area (r ):
return 22 /7 *r **2
Input:
"write a python program to merge two dictionaries"
Output:
x ={'key1':'val1','key2':'val2'}
y ={'key3':'val3','key4':'val4'}
z ={**x ,**y }# z = x | y
Input:
"write a function to find factorial"
Output:
def factorial (n ):
if n ==0 :
return 1
else :
return n *factorial (n -1 )