Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate enrich_date and enrich_earliest_date enrichments #197

Closed
amywieliczka opened this issue Nov 8, 2022 · 3 comments · Fixed by #1090
Closed

Migrate enrich_date and enrich_earliest_date enrichments #197

amywieliczka opened this issue Nov 8, 2022 · 3 comments · Fixed by #1090
Assignees

Comments

@amywieliczka
Copy link
Collaborator

amywieliczka commented Nov 8, 2022

Should be migrated as methods on the Record class in metadata_mapper/mapper.py

Related to #196

@amywieliczka amywieliczka added this to the Rikolti MVP Milestone #3 milestone Nov 8, 2022
@amywieliczka amywieliczka self-assigned this Nov 8, 2022
@amywieliczka amywieliczka removed this from the Rikolti MVP Milestone #3 milestone Nov 15, 2022
@christinklez christinklez changed the title Migrate enrich_date and enrich_earliest_date enrichments [for later] Migrate enrich_date and enrich_earliest_date enrichments Jan 9, 2023
@christinklez christinklez changed the title [for later] Migrate enrich_date and enrich_earliest_date enrichments [as time allows] Migrate enrich_date and enrich_earliest_date enrichments Jan 9, 2023
@amywieliczka amywieliczka changed the title [as time allows] Migrate enrich_date and enrich_earliest_date enrichments Migrate enrich_date and enrich_earliest_date enrichments Jul 23, 2024
@amywieliczka
Copy link
Collaborator Author

amywieliczka commented Jul 23, 2024

I performed some analysis of the usage of enrich_earliest_date and enrich_date enrichment functions, and then migrated these enrichments from calisphere-legacy-harvester/dpla-ingestion as part of this PR: #1090

Next steps:

  • This is simply a theoretical migration of the enrichments, and no harvests have actually been run through these migrated enrichment functions. We need to actually harvest some collections in a local dev environment to make sure the code actually works. Unfortunately, my aws-mwaa-local-environment has been borked, and I haven't been able to get a dev environment working again to test out these code updates.
  • Once we know the code works, then we also need to make sure the code does what it's supposed to. As far as I can tell, we can either:
    • Get the validator working again (may need to resolve whatever error is thrown in MWAA) against Solr (beanstalk solr is still alive and kicking) and Couch (may need to start up ingest-couch-prd, ingest-front-prd) so that we can validate the facet_decade, sort_date_start, and sort_date_end fields for records - maybe even get this working in deployed mwaa, so @aturner and @christinklez can validate these fields for a handful of 5-10 sample collections
    • Run 5-10 collections through a re-harvest process and QA those collections on calisphere-stage with a special eye for the decade facet, date metadata value, and sort_date_start/sort_date_end usage using sort.
    • Maybe write a one-off date validation script to compare these handful of date-related fields (date, facet_decade, sort_date_start, sort_date_end), but the validator already does a really good job of comparing records, and already is designed to talk to Solr.
  • Once we've determined that the code mostly works and mostly does what it's supposed to (even if it's not an exhaustive test), then we need to determine either a) a migration path to update records already in the index or b) a re-harvesting plan to progressively pick up decade data for harvests on re-harvest of the collections.
    • Probably a re-harvesting plan is best, following standard QA practices that include a special glance at the decade facet for collections as we re-harvest them. Could possibly even just restart harvests from the most recently fetched metadata by clearing the mapping tasks onward in the harvest DAG - this would prevent inadvertently refreshing anyone's metadata that they did not want refreshed yet.
  • [No further work needed] Update the Calisphere Prod UI to facet and filter on the facet_decade field instead of the date field (Stage ucldc/public_interface#405)
  • [No further work needed] Confirmed that Calisphere UI already uses sort_date_start and sort_date_end to support the sort options - no changes needed for this, though as more records have these fields, we'll get better sorts.

@amywieliczka amywieliczka linked a pull request Jul 23, 2024 that will close this issue
@amywieliczka
Copy link
Collaborator Author

@amywieliczka
Copy link
Collaborator Author

amywieliczka commented Aug 30, 2024

Completed since last update:

  • Got the code actually working and running in my local mwaa runner (fixed my local mwaa runner)
  • Got the validator working again
    • Made validator conditional on the presence of UCLDC_SOLR_URL and UCLDC_COUCH_URL
    • Commented out is_shown_at and is_shown_by validations (these validated against couch, which is no longer up)
  • Ran the validator for 27 test collections - filter on Calisphere Solr Mapper Type = No
  • Using those 27 test collections, created a set of test cases to enrich_date, enrich_earliest_date, get_facet_decades, make_sort_dates, and unpack_display_date. To run these tests, from within the rikolti folder, run pytest metadata_mapper

Next steps:

  • Code Review of the PR
  • Deployment to MWAA (this will involve a new pip install & an update to startup.sh) & verify conditional validation logic is working well
  • Create a re-harvesting plan to progressively pick up and QA date data. (Could potentially re-harvest from fetched metadata stored in our s3 bucket)

Later steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants