Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the "hash" in the metadata dataset exacly, are you sure it is SHA256? #39

Open
venthur opened this issue Jan 19, 2024 · 1 comment

Comments

@venthur
Copy link

venthur commented Jan 19, 2024

According to the docs, the metadata-dataset about every file uploaded to PyPI, i.e. the parquet files listed in https://github.com/pypi-data/data/raw/main/links/dataset.txt, contain a SHA256 hash. However, it is not described how the hash is calculated.

When trying to verify that you calculate the SHA256 over the respective file itself, i encountered some issues:

  • your hash is too short for a SHA256, it has the same length as a SHA1 though
  • however, when i calculate the SHA1 of a downloaded file, it does not match yours (neither does SHA256)
  • two files in your dataset that have the same hash, also have the same SHA1 hash on my end, however, my and your hashes are different

Can you explain, which hash you are using and if you are hashing the contents of the file linked to via the path?

Thank you very much for the awesome dataset!

@orf
Copy link
Member

orf commented Aug 30, 2024

Hey! Thanks for the kind words @venthur. Sorry about the late reply: I had these messages filtered out.

It's been a while since I looked at this project, but you're right: it's not a SHA256 hash. I dug into where I generate this, and for some reason I chose to use Oid::hash_object here, from libgit2.

That was... an unfortunate decision, and I can't see why I chose to do it that way. Parts of this project where pretty experimental, and I was pretty much learning Rust at the time.

So it's going to be a SHA1 hash of blob ${length}\0${content}. Which is actually so inconvenient.

I'll try and think of a way to rectify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants