Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery bigquery-public-data.pypi.distribution_metadata missing data #16008

Open
mensfeld opened this issue May 24, 2024 · 9 comments
Open

BigQuery bigquery-public-data.pypi.distribution_metadata missing data #16008

mensfeld opened this issue May 24, 2024 · 9 comments
Labels
bug 🐛 requires triaging maintainers need to do initial inspection of issue

Comments

@mensfeld
Copy link

Running this query:

SELECT 
  name,
  version,
  summary,
  description
FROM 
  `bigquery-public-data.pypi.distribution_metadata`
WHERE 
  name = 'virtualenv'
ORDER BY 
  version;

misses several new versions available here: https://pypi.org/project/virtualenv/#history released in April and May. It's similar for some other packages.

Describe the bug

All versions info should be available in BigQuery.

Expected behavior

I would expect them (except eventual consistency ofc) to be available in BQ.

To Reproduce

Run in BigQuery:

SELECT 
  name,
  version,
  summary,
  description
FROM 
  `bigquery-public-data.pypi.distribution_metadata`
WHERE 
  name = 'virtualenv'
ORDER BY 
  version;

and see versions are missing.

@mensfeld mensfeld added bug 🐛 requires triaging maintainers need to do initial inspection of issue labels May 24, 2024
@ewdurbin
Copy link
Member

The task that ensures consistency was disabled due to poor performance in... 2021 🙃

#10256

But was never subsequently re-enabled that I can tell, as the contributor never returned to address the issue.

For triage, I have manually run this task, can you confirm if you're seeing consistency?

@mensfeld
Copy link
Author

@ewdurbin was all the data synced? That is, should all the historical gaps be filled now?

When I query virtualenv I'm still missing 20.25.2+ versions (anything newer).

Is there any other endpoint to get the recent releases data?

@mensfeld
Copy link
Author

@ewdurbin I'm still not seeing the newer releases of virtualenv in the BigQuery dataset :(

@ewdurbin
Copy link
Member

ewdurbin commented Jun 3, 2024

Hmmm, unclear what the issue is. @di are you familiar with why the sync wouldn't capture past releases?

@di
Copy link
Member

di commented Jun 3, 2024

That's not the job that inserts new metadata, that job just syncs missing metadata if insertion fails for some reason.

Insertion of new metadata happens on upload: https://github.com/pypi/warehouse/blob/main/warehouse/forklift/legacy.py#L1222-L1223

The timeline here is suspiciously close to when we did some migrations on these schemas, my guess is that the update_bigquery_release_files‎ is failing and we're unaware.

@ewdurbin
Copy link
Member

ewdurbin commented Jun 3, 2024

So sync_bigquery_release_files is not the bulk equivalent of update_bigquery_release_files‎?

@di
Copy link
Member

di commented Jun 3, 2024

It is, but it shouldn't be necessary anymore, metadata should be reliably getting inserted on upload (but it appears it isn't anymore).

@ewdurbin
Copy link
Member

ewdurbin commented Jun 3, 2024

hm, okay I ran sync_bigquery_release_files in an attempt to triage and it seems it didn't bulk load missing info. seems this needs some more investigation.

@di
Copy link
Member

di commented Jun 3, 2024

Probably failing for the same reason the individual job is failing I would venture a guess!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 requires triaging maintainers need to do initial inspection of issue
Projects
None yet
Development

No branches or pull requests

3 participants