You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code tokval = " ".join(tokval.split()) has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:
"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..
Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?
The text was updated successfully, but these errors were encountered:
The code
tokval = " ".join(tokval.split())
has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:https://github.com/microsoft/CodeXGLUE/blob/ac74a62802a0dd159b3258c78a2df8ad36cdf2b9/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py#L53C17-L53C50
"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..
Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?
The text was updated successfully, but these errors were encountered: