Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Use augur curate to curate country metadata #3015

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Oct 17, 2024

resolves #

preview URL: https://add-country-metadata.loculus.org/

Summary

Prompted by work in #2991, this uses augur curate for formatting.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test
micromamba activate pp-integrity
cd tests/regression-testing 
snakemake results/ebola-zaire.meta.format-authors.add-country-metadata.diff 
snakemake results/cchf.meta.format-authors.add-country-metadata.diff 
snakemake results/west-nile.meta.format-authors.add-country-metadata.diff 

anna-parker and others added 28 commits October 10, 2024 18:05
…ata from NCBI Virus (#2990)

* Use raw jsonl instead of generated tsv when ingesting data from NCBI virus

* Do not require authors list to end in ';', capitalize names correctly.

* Add tests for capitalization

* Add a warning if author list might be in wrong format

* Add ascii specific warning

* Add tests for warnings and errors

* Only capitalize if full authors string is upper case

* Properly capitalize initial

* Move titlecase option to ingest only - add ingest tests
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Oct 17, 2024
@anna-parker anna-parker changed the title Add_country_metadata feat(ingest): Use augur curate to curate country metadata Oct 17, 2024
@anna-parker
Copy link
Contributor Author

anna-parker commented Oct 17, 2024

Sadly with further testing this no longer seems like a good approach:

  1. It requires a python downgrade and tedious refactor of our code
  2. augur curate splits the geoLocCountry field in NCBI into country: division, location. For samples where this is incorrect it will only curate them if there is an exact match in the config. For example Conneticut/Hartford is curated as Conneticut/Hartford but not Connecticut/xxx unless it is also in the config. It also does not catch cases where the state has been replaced with an abbreviation (unless that exact case is in the config), the division and location field is not clearly split, or the order of division and location has been inverted unless if the exact case is specified in the config, e.g. these states show in the location field:
image

@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Oct 18, 2024
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Oct 20, 2024
@anna-parker anna-parker force-pushed the format_authors branch 2 times, most recently from c3cb79c to 6e05226 Compare October 22, 2024 09:38
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Oct 24, 2024
Base automatically changed from format_authors to main October 29, 2024 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant