Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regenerate "historical" CSV files that don't have downloads_id #329

Open
philbudne opened this issue Aug 20, 2024 · 4 comments
Open

regenerate "historical" CSV files that don't have downloads_id #329

philbudne opened this issue Aug 20, 2024 · 4 comments
Assignees

Comments

@philbudne
Copy link
Contributor

Fernando approved retrieving the archived database to retrieve downloads_ids for Nov/Dec 2021.
@rahulbot asked:

Is there a quick way to audit all the other historical CSV files to see if we need download_ids from any other periods as well?

I wrote a script to enumerate objects in the mediacloud-database... buckets, requesting only the first 128 bytes of each object, and saving only the first line to a disk file.

Here is what I found, looking for first lines which lack downloads_id, and ignoring summary files
(leaving analysis/discussion to follow-ups):

pbudne@angwin:~/s3-audit$ find mediacloud-database-* -name \*csv | xargs grep -v downloads_id | sort | egrep -v 'summar(y_|ies)|database_b.csv'
mediacloud-database-c-files/csv_files/2021_11_12.csv:https://expert.ru/doc-list/rss/
mediacloud-database-c-files/csv_files/2021_11_13.csv:https://www.dharitri.com/feed/
mediacloud-database-c-files/csv_files/2021_11_14.csv:https://www.casilinanews.it/feed
mediacloud-database-c-files/csv_files/2021_11_15.csv:https://www.mercurynews.com/feed/
mediacloud-database-c-files/csv_files/2021_11_16.csv:https://www.monacomatin.mc/rss
mediacloud-database-c-files/csv_files/2021_11_17.csv:https://www.diariodebatepregon.com/rss/home.xml
mediacloud-database-c-files/csv_files/2021_11_18.csv:http://avenueskhabar.com.np/feed/
mediacloud-database-c-files/csv_files/2021_11_20.csv:http://cms-delivery-mia.terra.com/feeder/public/articles/20e07ef2795b2310VgnVCM3000009af154d0RCRD.rss
mediacloud-database-c-files/csv_files/2021_11_21.csv:https://www.ikz-online.de/?
mediacloud-database-files/2013/stories_2013-03-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-05-04.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-05-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-08-06.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-09-02.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2017-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2014/stories_2014-10-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-09-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-09-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-31.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-6.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-7.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-8.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-9.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-02.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-03.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-04.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-05.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-06.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-07.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-08.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-09.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-25.csv:collect_date,stories_id,media_id,url
@philbudne
Copy link
Contributor Author

TO BE VERIFIED! The DB Epoch chart always makes me dizzy!

Dates with missing downloads_id (with applicable epochs in parens):

  • 2021/11/12-2021/11/21 (B)
  • 2021/11/23-2021/12/25 (C/E/F)
  • 2019/04/06-2019/04/30 2017/12/01, 2017/09/28, 2017/09/22, 2014/10/24, 2013/03/21, 2013/05/04, 2013/05/13, 2013/08/06, 2013/09/02 (A-F)

So, all but nine days in 2021 (previously not known about) are covered by backups of the final PG database (Epoch F)?
(Or dumps of the C or E epochs, if we had them)

@pgulley
Copy link
Member

pgulley commented Aug 21, 2024

My gloss after meeting with all of the indexer team was that we want to move ahead with restoring from the Epoch F backup, which should cover the maximal range here- (sans those nine days in the middle of November).

@thepsalmist
Copy link
Contributor

I did a check of the distinct dates for stories from 2020 - 2008, from the restored database B, vs the csv files we have in s3 s3://mediacloud-files/${year}. The finding is that for all the distinct dates where we have a story based on collect_date, there a corresponding csv file on s3 (_v1, or _v2 prefix)
db_vs_s3_comparison.csv

@philbudne
Copy link
Contributor Author

philbudne commented Oct 1, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants