-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitbucket Code #34
Comments
I got started here using the google drive parquet file, but ran into some questions before I could make much more progress. import pyarrow.parquet as pq
from urllib.request import urlopen
import shutil
import os
bitbucket_repos = pq.read_table("bitbucket_version_1.parquet", columns=["full_name", "mainbranch", "description", "uuid"]).to_pandas()
bitbucket_repos["mainbranch"] = bitbucket_repos["mainbranch"].apply(lambda x: x and x["name"])
def download_repo(repo):
main_branch = repo["mainbranch"]
full_name = repo["full_name"]
repo_uuid = repo["uuid"]
zip_link = "https://bitbucket.org/" + full_name + "/get/" + main_branch + ".zip"
try:
urlopen(zip_link)
shutil.unpack_archive("./" + main_branch + ".zip", extract_dir = "./" + repo_uuid)
return repo
except:
return None
# For now, all this does is ensure the existence of a license file
# Question: is it good enough to look for a substring of known licenses
# to ensure that the repo we're scraping has the license we would expect?
# Is there any prior art here?
def open_license(repo):
for root, dirs, files in os.walk("./" + repo["uuid"]):
for name in files:
if name == "LICENSE" or name == "LICENSE.txt" or name == "LICENSE.md":
return True
return False
def traverse_repo(repo):
for root, dirs, files in os.walk("./" + repo["uuid"]):
for name in files:
print(name)
|
Hi @CamdenClark, Thank you for your consideration to pick this one.
We are currently focusing on a release based on a list of data that we are prioritizing, so decisions about bitbucket may not be clear at this point, to make things less confusing and take your time, you can look and start with other more explicit datasets like #4 or #33 |
Thanks for a quick response! I will focus on gitlab then instead. |
Title
Dataset URL - here
Does the dataset exist in a scraped format?
URL if Yes - here
Description
Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.
Procedure
Tests
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
The text was updated successfully, but these errors were encountered: