Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture more PyPI-specific and dependency metadata about packages #12

Open
sethmlarson opened this issue Sep 5, 2023 · 1 comment
Open

Comments

@sethmlarson
Copy link

sethmlarson commented Sep 5, 2023

Hello @orf, I absolutely love https://py-code.org! Thank you for creating this service.

I manually maintain my own dataset about Python packages available on PyPI (but more around dependency metadata and PyPI-specific information like maintainers). Do you have any interest in supporting these use-cases? Would happily stop maintaining my own dataset and point to py-code if this information is made available (your dataset is much more automated and has a nice frontend ✨)

Let me know what you think, and thanks again!

@sethmlarson sethmlarson changed the title Capture more PyPI-specific and metadata about packages Capture more PyPI-specific and dependency metadata about packages Sep 5, 2023
@orf
Copy link
Member

orf commented Sep 5, 2023

Hey! I absolutely do, something like this is the next phase of the "pypi-data cinematic universe". I have have some of this raw data already captured from pypi, but it seems you have enriched it a bit.

Right now we have a few disconnected pieces that we can jam together to do cool things:

  1. We have the raw pypi JSON data on releases
  2. We have all the code
  3. We have metadata on the contents of pypi archives

With this you can:

  1. Find the unique git OIDs of all some-interesting-file-name.py files, or others by a specific pattern
  2. Fetch and parse the contents of those files to extract some interesting metrics, producing a mapping of {git_oid: stats}
  3. Turn the mapping of {git_oid: stats} to {(project_name, project_version): stats} using the git_oid and the datasets in this repo
  4. Turn {(project_name, project_version): stats} into anything, by joining the (project_name, project_version) on another dataset (like yours)

So with this we could parse all .py files, count the number of classes, and plot "classes written over time, segmented by PyPI trove classifier/other pypi metadata/number of downloads/maintainer/whatever".

The problem is that this is all disconnected and a bit shit. I want this to be relatively seamless because I'm sick of doing it manually 😂.

I'm working on a CLI tool to handle step 1, 2 and 3 for users, but step 4 is pretty interesting.

Perhaps we could take the pypi-json-data dataset, enrich it a bit and provide it in some format that can be used as part of this workflow?

That data could also be explorable via py-code.org, I've been thinking of adding some info from pypi-json-data to the site. not sure what format it should be in though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants