-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub Diffs #31
Comments
What will be the filtering criteria for repositories we're going to index for scraping diffs?
|
Yes, these seem like sensible criteria, I think that should be everything we need. |
By length criteria, do you mean the Length of |
I meant the length of the combined data, but after checking with Louis we decided this doesn't need to be filtered because the constraint is too highly variable and model-dependent. So the criteria you mention above should be fine alone. |
Updated to remove Python specific stuff, to allow for scraping all languages. |
We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words |
Discussed this with Joel and we think that at least diffs which create files could be useful at some point in the future and potentially those which delete files too - not necessarily for ELM replication but for training refactoring models. Since this dataset could be used on several possible projects, I think it will help long term to not remove these from the scrape. Filtering out unhelpful commit messages seems good, but I can think of some scenarios where we have short helpful commit messages so need to carefully decide on how to do that. |
GitHub Diffs
Description
Dataset is on BigQuery as a table of commit hashes and messages.
Procedure
From commit hash and message, produce dict containing:
This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.
We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.
Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a
dataset.py
base classesExample
Give an example of the columns and data:
The text was updated successfully, but these errors were encountered: