GitHub Diffs #31

herbiebradley · 2022-09-27T23:42:59Z

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

Raw files before changes
Commit message
Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

Minimal working example
Decide on length threshold
parquet output
Inherit from dataset.py base classes
Parallel processing
Bitbucket modifications - see Bitbucket diffs #5

Example

Give an example of the columns and data:

before_file	commit_message	diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ]	Change version	[{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]

The text was updated successfully, but these errors were encountered:

reshinthadithyan · 2022-09-28T01:59:52Z

What will be the filtering criteria for repositories we're going to index for scraping diffs?

>10 GitHub stars
>2 commits
Must have a liberal license
Exclude forks

cc @ncoop57, @herbiebradley

herbiebradley · 2022-09-28T10:53:20Z

Yes, these seem like sensible criteria, I think that should be everything we need.

reshinthadithyan · 2022-09-28T12:52:13Z

By length criteria, do you mean the Length of commit_message?
If that's the case, the Table has commit message column, we can query with length constraints.

herbiebradley · 2022-09-28T13:41:15Z

I meant the length of the combined data, but after checking with Louis we decided this doesn't need to be filtered because the constraint is too highly variable and model-dependent.

So the criteria you mention above should be fine alone.

herbiebradley · 2022-09-28T19:27:51Z

Updated to remove Python specific stuff, to allow for scraping all languages.

ncoop57 · 2022-09-28T21:53:12Z

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

herbiebradley · 2022-09-28T22:31:53Z

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

Discussed this with Joel and we think that at least diffs which create files could be useful at some point in the future and potentially those which delete files too - not necessarily for ELM replication but for training refactoring models. Since this dataset could be used on several possible projects, I think it will help long term to not remove these from the scrape.

Filtering out unhelpful commit messages seems good, but I can think of some scenarios where we have short helpful commit messages so need to carefully decide on how to do that.

herbiebradley added the dataset-request Request for addition of new dataset label Sep 27, 2022

herbiebradley mentioned this issue Sep 27, 2022

Bitbucket diffs #5

Open

reshinthadithyan moved this to Sprint-1 in Pile V2 Oct 3, 2022

reshinthadithyan added this to Pile V2 Oct 3, 2022

PhungVanDuy mentioned this issue Oct 3, 2022

gitlab #4

Open

4 tasks

PhungVanDuy moved this from Sprint-1 to In Progress in Pile V2 Oct 4, 2022

PhungVanDuy moved this from In Progress to Sprint-1 in Pile V2 Oct 4, 2022

herbiebradley mentioned this issue Oct 7, 2022

GitHub Diffs #36

Open

ncoop57 moved this from Sprint-1 to In Progress in Pile V2 Nov 11, 2022

reshinthadithyan moved this from In Progress to Done in Pile V2 Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Diffs #31

GitHub Diffs #31

herbiebradley commented Sep 27, 2022 •

edited

Loading

reshinthadithyan commented Sep 28, 2022 •

edited

Loading

herbiebradley commented Sep 28, 2022 •

edited

Loading

reshinthadithyan commented Sep 28, 2022 •

edited

Loading

herbiebradley commented Sep 28, 2022

herbiebradley commented Sep 28, 2022

ncoop57 commented Sep 28, 2022

herbiebradley commented Sep 28, 2022

GitHub Diffs #31

GitHub Diffs #31

Comments

herbiebradley commented Sep 27, 2022 • edited Loading

GitHub Diffs

Description

Procedure

Example

reshinthadithyan commented Sep 28, 2022 • edited Loading

herbiebradley commented Sep 28, 2022 • edited Loading

reshinthadithyan commented Sep 28, 2022 • edited Loading

herbiebradley commented Sep 28, 2022

herbiebradley commented Sep 28, 2022

ncoop57 commented Sep 28, 2022

herbiebradley commented Sep 28, 2022

herbiebradley commented Sep 27, 2022 •

edited

Loading

reshinthadithyan commented Sep 28, 2022 •

edited

Loading

herbiebradley commented Sep 28, 2022 •

edited

Loading

reshinthadithyan commented Sep 28, 2022 •

edited

Loading