Capture more PyPI-specific and dependency metadata about packages #12

sethmlarson · 2023-09-05T14:54:18Z

Hello @orf, I absolutely love https://py-code.org! Thank you for creating this service.

I manually maintain my own dataset about Python packages available on PyPI (but more around dependency metadata and PyPI-specific information like maintainers). Do you have any interest in supporting these use-cases? Would happily stop maintaining my own dataset and point to py-code if this information is made available (your dataset is much more automated and has a nice frontend ✨)

Let me know what you think, and thanks again!

orf · 2023-09-05T16:01:35Z

Hey! I absolutely do, something like this is the next phase of the "pypi-data cinematic universe". I have have some of this raw data already captured from pypi, but it seems you have enriched it a bit.

Right now we have a few disconnected pieces that we can jam together to do cool things:

We have the raw pypi JSON data on releases
We have all the code
We have metadata on the contents of pypi archives

With this you can:

Find the unique git OIDs of all some-interesting-file-name.py files, or others by a specific pattern
Fetch and parse the contents of those files to extract some interesting metrics, producing a mapping of {git_oid: stats}
Turn the mapping of {git_oid: stats} to {(project_name, project_version): stats} using the git_oid and the datasets in this repo
Turn {(project_name, project_version): stats} into anything, by joining the (project_name, project_version) on another dataset (like yours)

So with this we could parse all .py files, count the number of classes, and plot "classes written over time, segmented by PyPI trove classifier/other pypi metadata/number of downloads/maintainer/whatever".

The problem is that this is all disconnected and a bit shit. I want this to be relatively seamless because I'm sick of doing it manually 😂.

I'm working on a CLI tool to handle step 1, 2 and 3 for users, but step 4 is pretty interesting.

Perhaps we could take the pypi-json-data dataset, enrich it a bit and provide it in some format that can be used as part of this workflow?

That data could also be explorable via py-code.org, I've been thinking of adding some info from pypi-json-data to the site. not sure what format it should be in though.

sethmlarson changed the title ~~Capture more PyPI-specific and metadata about packages~~ Capture more PyPI-specific and dependency metadata about packages Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture more PyPI-specific and dependency metadata about packages #12

Capture more PyPI-specific and dependency metadata about packages #12

sethmlarson commented Sep 5, 2023 •

edited

Loading

orf commented Sep 5, 2023 •

edited

Loading

Capture more PyPI-specific and dependency metadata about packages #12

Capture more PyPI-specific and dependency metadata about packages #12

Comments

sethmlarson commented Sep 5, 2023 • edited Loading

orf commented Sep 5, 2023 • edited Loading

sethmlarson commented Sep 5, 2023 •

edited

Loading

orf commented Sep 5, 2023 •

edited

Loading