Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset #2

Open
villmow opened this issue Aug 6, 2018 · 12 comments
Open

Dataset #2

villmow opened this issue Aug 6, 2018 · 12 comments

Comments

@villmow
Copy link

villmow commented Aug 6, 2018

Hi guys, awesome project. Would you mind releasing the original training and testing dataset, without any pickled or preprocessed files?

@0bserver07
Copy link

0bserver07 commented Dec 4, 2018

@villmow Check out the files can be downloaded from here on Gdrive or Baidu url.

@matanpugach
Copy link

@0bserver07 @guxd could you share the raw method names\descriptions?

@villmow
Copy link
Author

villmow commented Dec 4, 2018

@matanpugach That's what I meant!

I already downloaded the files from Gdrive, but they are preprocessed. Everyone using your dataset is limited to your features (token, api sequence, name tokens) and a vocabulary limit of 10,000. One cannot restore the original "RawCode" - "Documentation" mapping from your dataset, to - for example - try new features.

@wanyao1992
Copy link

Hi @villmow,
I have the same requirement, have you got the raw datasets without preprocessing from @guxd?

@gauravkoradiya
Copy link

i have python raw data how to preprocess it as same u did for java. I would to build code search for python code. How to do preprocessing??

@guxd
Copy link
Owner

guxd commented Apr 2, 2019

@gauravkoradiya You should use python code parser. Python provides an ast module which supports the parsing.
Here is a sample project which parses python code into ASTs.
https://github.com/fyrestone/pycode_similar
You may find more from the GitHub.
After that, you need to convert ASTs into call sequences.

@gauravkoradiya
Copy link

@gauravkoradiya You should use python code parser. Python provides an ast module which supports the parsing.
Here is a sample project which parses python code into ASTs.
https://github.com/fyrestone/pycode_similar
You may find more from the GitHub.
After that, you need to convert ASTs into call sequences.

@gauravkoradiya
Copy link

Awesome..thank you....I got it.

@hoogang
Copy link

hoogang commented May 21, 2019

could you share the original datasets without any pickled or preprocessed files?

@jackalhan
Copy link

I agree with @hoogang . @guxd , would you please release your raw or original datasets without any pickled or preprocessed files?
Thank you.

@LeeSureman
Copy link

It's a pity the authors do not release original dataset

@guxd
Copy link
Owner

guxd commented Jun 6, 2023

The raw code datasets are available at /pytorch/train.rawcode.rar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants