Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated UN ILO topic generation script, added variable groupings #4242

Merged
merged 12 commits into from
May 16, 2024

Conversation

dwnoble
Copy link
Contributor

@dwnoble dwnoble commented May 15, 2024

  • Updated UN ILO topic generation script
  • Added dc/g/Custom_UN variable grouping root to un staging environment
  • Fixed some typos in undata who variable groupings
  • Updated un staging ILO bt version
  • added undata_ilo search index
  • added undata_ilo embeddings and topic cache

@dwnoble dwnoble requested review from pradh and shifucun May 15, 2024 05:30
@dwnoble dwnoble requested a review from n-h-diaz May 15, 2024 21:40
Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!!

undata_ilo_ft:
store: MEMORY
embeddings: gs://datcom-nl-models/embeddings_undata_ilo_2024_05_15_11_18_05.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
model: ft-final-v20230717230459-all-MiniLM-L6-v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when you sync, you might also need to add this to

Copy link
Contributor Author

@dwnoble dwnoble May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -49,6 +49,7 @@ class DCNames(str, Enum):
SDG_DC = 'sdg'
SDG_MINI_DC = 'sdgmini'
UNDATA_DC = 'undata'
UNDATA_ILO_DC = 'undata_ilo'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add this to SPECIAL_DC_LIST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

nodes = generate([_UNDATA_ILO_ROOT],
vars,
REMOVE_SVG_PREFIX,
skip_assert=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment for why we skip_assert? I suppose we want to resolve that (in the next iteration of variable groupings?)

Copy link
Contributor Author

@dwnoble dwnoble May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment & TODO to remove the skip_assert

_UNDATA_ILO_NL_DESCRIPTIONS_FILE = os.path.join(
_TMP_DIR, 'undata_ilo_nl_descriptions.csv')

REMOVE_SVG_PREFIX = "Custom_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could consider making this a flag like "--use_custom_dc_groups" or something that can then set "Custom_". So when we run UNData post launch, we don't need it.

Deferring this to later is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout- added a comment here for now, and can address this when moving this data to production


REMOVE_SVG_PREFIX = "Custom_"

API_ROOT = "https://staging.unsdg.datacommons.org"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this too should not be hardcoded, but passed via flag? Deferring to later is OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 added a comment.

def download_svg_recursive(svgs: List[str],
nodes: Dict[str, Dict],
keep_vars: Set[str],
remove_svg_prefix=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regret not wrapping this whole thing in a class :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After working with this script , maybe we can refactor it to a class and to take in a yaml config file for the various sdg/who/ilo configurations

@@ -27,6 +27,10 @@ indexes:
store: MEMORY
embeddings: gs://datcom-nl-models/embeddings_undata_2024_03_20_11_01_12.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
model: ft-final-v20230717230459-all-MiniLM-L6-v2
undata_ilo_ft:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As validation, should we have an e2e test like

def test_detection_basic_undata(self):
self.run_detection('detection_api_undata_idx', ['Health in USA'],
test='unittest',
idx='undata_ft')
?

(And later when the data has been added to main DC, we can do fulfill as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@pradh
Copy link
Contributor

pradh commented May 16, 2024

I see your comments, but not the commit?

…environments. Added detect tests for undata_ilo index. pr feedback
@dwnoble
Copy link
Contributor Author

dwnoble commented May 16, 2024

I see your comments, but not the commit?

Commented a little too early- just pushed

@pradh pradh merged commit dd5bdc5 into datacommonsorg:master May 16, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants