You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fernando approved retrieving the archived database to retrieve downloads_ids for Nov/Dec 2021. @rahulbot asked:
Is there a quick way to audit all the other historical CSV files to see if we need download_ids from any other periods as well?
I wrote a script to enumerate objects in the mediacloud-database... buckets, requesting only the first 128 bytes of each object, and saving only the first line to a disk file.
Here is what I found, looking for first lines which lack downloads_id, and ignoring summary files
(leaving analysis/discussion to follow-ups):
So, all but nine days in 2021 (previously not known about) are covered by backups of the final PG database (Epoch F)?
(Or dumps of the C or E epochs, if we had them)
My gloss after meeting with all of the indexer team was that we want to move ahead with restoring from the Epoch F backup, which should cover the maximal range here- (sans those nine days in the middle of November).
I did a check of the distinct dates for stories from 2020 - 2008, from the restored database B, vs the csv files we have in s3 s3://mediacloud-files/${year}. The finding is that for all the distinct dates where we have a story based on collect_date, there a corresponding csv file on s3 (_v1, or _v2 prefix) db_vs_s3_comparison.csv
Fernando approved retrieving the archived database to retrieve downloads_ids for Nov/Dec 2021.
@rahulbot asked:
I wrote a script to enumerate objects in the
mediacloud-database...
buckets, requesting only the first 128 bytes of each object, and saving only the first line to a disk file.Here is what I found, looking for first lines which lack downloads_id, and ignoring summary files
(leaving analysis/discussion to follow-ups):
The text was updated successfully, but these errors were encountered: