Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Diffs #31

Open
5 of 6 tasks
herbiebradley opened this issue Sep 27, 2022 · 7 comments
Open
5 of 6 tasks

GitHub Diffs #31

herbiebradley opened this issue Sep 27, 2022 · 7 comments
Labels
dataset-request Request for addition of new dataset

Comments

@herbiebradley
Copy link

herbiebradley commented Sep 27, 2022

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

  • Raw files before changes
  • Commit message
  • Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

  • Minimal working example
  • Decide on length threshold
  • parquet output
  • Inherit from dataset.py base classes
  • Parallel processing
  • Bitbucket modifications - see Bitbucket diffs #5

Example

Give an example of the columns and data:

before_file commit_message diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] Change version [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]
@herbiebradley herbiebradley added the dataset-request Request for addition of new dataset label Sep 27, 2022
@reshinthadithyan
Copy link
Collaborator

reshinthadithyan commented Sep 28, 2022

What will be the filtering criteria for repositories we're going to index for scraping diffs?

>10 GitHub stars
>2 commits
Must have a liberal license
Exclude forks

cc @ncoop57, @herbiebradley

@herbiebradley
Copy link
Author

herbiebradley commented Sep 28, 2022

Yes, these seem like sensible criteria, I think that should be everything we need.

@reshinthadithyan
Copy link
Collaborator

reshinthadithyan commented Sep 28, 2022

By length criteria, do you mean the Length of commit_message?
If that's the case, the Table has commit message column, we can query with length constraints.

@herbiebradley
Copy link
Author

I meant the length of the combined data, but after checking with Louis we decided this doesn't need to be filtered because the constraint is too highly variable and model-dependent.

So the criteria you mention above should be fine alone.

@herbiebradley
Copy link
Author

Updated to remove Python specific stuff, to allow for scraping all languages.

@ncoop57
Copy link
Collaborator

ncoop57 commented Sep 28, 2022

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

@herbiebradley
Copy link
Author

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

Discussed this with Joel and we think that at least diffs which create files could be useful at some point in the future and potentially those which delete files too - not necessarily for ELM replication but for training refactoring models. Since this dataset could be used on several possible projects, I think it will help long term to not remove these from the scrape.

Filtering out unhelpful commit messages seems good, but I can think of some scenarios where we have short helpful commit messages so need to carefully decide on how to do that.

@reshinthadithyan reshinthadithyan moved this to Sprint-1 in Pile V2 Oct 3, 2022
@PhungVanDuy PhungVanDuy mentioned this issue Oct 3, 2022
4 tasks
@PhungVanDuy PhungVanDuy moved this from Sprint-1 to In Progress in Pile V2 Oct 4, 2022
@PhungVanDuy PhungVanDuy moved this from In Progress to Sprint-1 in Pile V2 Oct 4, 2022
@ncoop57 ncoop57 moved this from Sprint-1 to In Progress in Pile V2 Nov 11, 2022
@reshinthadithyan reshinthadithyan moved this from In Progress to Done in Pile V2 Jan 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset-request Request for addition of new dataset
Projects
Status: Done
Development

No branches or pull requests

3 participants