Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Format ingested country: division into geoLocCountry, geoLocAdmin1, geoLocAdmin2 #2991

Closed
wants to merge 68 commits into from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Oct 11, 2024

resolves #

preview URL: https://format-country.loculus.org/

Summary

We have mentioned in the past it would be good to break down NCBI's geoloc field. This is of the format: country: division, where division can normally be further split into geoLocAdmin1 and geoLocAdmin2.

Sadly the format of the division field is inconsistent. For example for the US some samples have state, city others city, state, yet others do not separate state with a , but or use the abbreviation. This means to extract more specific details we will need to do string matching. It also became clear to me that NIH does not clean their sequences based on the common occurrence of Conneticut. Astrahan is another example of this. I have now added a list of common misspellings, potentially in future we could use fuzzy matching for this.

I got chatgpt to write me up some nice yaml lists with all states/regions/provinces for countries where we have a lot of samples with geoLocAdmin1 information.

By default for unmapped countries I keep the mapping as is: division is mapped to geoLocAdmin1.

For mapped countries I put the state(region/province) in geoLocAdmin1 if the division matches the state (division must include the state name). I put the entire division string into geoLocAdmin2 as well - unless it is an exact match for the state, then I do not additionally add it to geoLocAdmin2 (to avoid duplication). If no state matches division I put the entire division string in geoLocAdmin2.

Potential Alternatives

Screenshot

Previous list of geoLocAdmin1 fields:
image
image
New formatting:
image
image

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Oct 11, 2024
@anna-parker anna-parker changed the title Format_country WIP: Format_country Oct 11, 2024
@anna-parker anna-parker changed the title WIP: Format_country feat(ingest): Format ingested country: division into geoLocCountry, geoLocAdmin1, geoLocAdmin2 Oct 11, 2024
@anna-parker
Copy link
Contributor Author

anna-parker commented Oct 14, 2024

  1. Download https://raw.githubusercontent.com/nextstrain/ncov-ingest/refs/heads/master/source-data/gisaid_geoLocationRules.tsv
  2. Install augur (need to downgrade python to Python 3.11.10)
  3. Run augur curate, relies on metadata being in tsv in format specified by dataformat tsv - either use command or modify format_ncbi_dataset_report some more.
dataformat tsv virus-genome --package results/ncbi_dataset.zip > results/metadata.tsv 
augur curate parse-genbank-location --metadata results/metadata.tsv --location-field='Geographic Location' --output-metadata results/curated.tsv
augur curate apply-geolocation-rules --metadata results/curated.tsv --output-metadata results/curated_rules.tsv  --geolocation-rules=gisaid_geoLocationRules.tsv --region-field='Geographic Region'

Still need to map the output of division and location into geoLocAdmin1 and geoLocAdmin2 - this should be done in a separate rule and then joined with other metadata fields, maybe in prepare data.

Base automatically changed from change_how_ingest_ingests to format_authors October 17, 2024 11:57
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Oct 17, 2024
@corneliusroemer corneliusroemer force-pushed the format_authors branch 2 times, most recently from 533f051 to f480d67 Compare October 18, 2024 14:58
@anna-parker
Copy link
Contributor Author

Closing as I improved this in #3026 by using pycountry for official ISO standard lists of geoLocAdmin1 divisions for every country and additionally used fuzzywuzy for fuzzy-matching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant