Code-Code CodeCompletion-token not tokenizing all spaces #190

markNZed · 2024-11-16T09:16:44Z

The code tokval = " ".join(tokval.split()) has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:

https://github.com/microsoft/CodeXGLUE/blob/ac74a62802a0dd159b3258c78a2df8ad36cdf2b9/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py#L53C17-L53C50

"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..

Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-Code CodeCompletion-token not tokenizing all spaces #190

Code-Code CodeCompletion-token not tokenizing all spaces #190

markNZed commented Nov 16, 2024

Code-Code CodeCompletion-token not tokenizing all spaces #190

Code-Code CodeCompletion-token not tokenizing all spaces #190

Comments

markNZed commented Nov 16, 2024