Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve / fix naming in NL server #4263

Merged
merged 13 commits into from
May 24, 2024
Merged

Conversation

shifucun
Copy link
Contributor

@shifucun shifucun commented May 22, 2024

This PR aims to greatly disambiguate code logic by using distinct naming for various object. Currently there are 10+ names like embedding_xxx for class, object, dict and files with a lot of clashes.

With this change:

  • Use catalog to replace embeddings_config object and represent the all available (may not be used) model/index config
  • Use registry to replace embedddings_map object which holds all the model/client objects
  • Use env to replace embeddings_spec which is the additional config per environment
  • Make all the field naming the same across yaml file and dataclass object. This will help for parsing and error checking.
  • Use A and A_dict to differentiate dataclass object vs yaml dict.
  • Move attribute model part of registry

@shifucun shifucun requested a review from pradh May 22, 2024 20:08
@shifucun shifucun changed the title Improve / fix naming in NL server Improve / fix naming in NL server [DO NOT MERGE BEFORE PROD RELEASE] May 22, 2024
@shifucun shifucun changed the title Improve / fix naming in NL server [DO NOT MERGE BEFORE PROD RELEASE] Improve / fix naming in NL server May 23, 2024
@shifucun shifucun requested a review from keyurva May 23, 2024 04:27
Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful change! Thank you for breaking it up so that it is easy to review!

default_indexes:
- medium_ft
enabled_indexes:
- medium_ft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intend to remove others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to only keep medium_ft for these instances. For world bank, do you think we should keep sdg and undata?

Copy link
Contributor

@pradh pradh May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since worldbank is part of un, maybe keep it there, could come handy for any demos, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

_LOCAL_ENV_VALUES_PATH: str = f'{Path(__file__).parent.parent}/deploy/helm_charts/envs/autopush.yaml'


@dataclass
class EmbeddingsSpec:
default_index: str
class Env:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add class comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

EMBEDDINGS_SPEC_KEY: str = 'EMBEDDINGS_SPEC'
REGISTRY_KEY: str = 'NL_REGISTRY'
CATALOG_KEY: str = 'NL_CATALOG'
ENV_KEY: str = 'NL_ENV'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one-line comment on what each is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -33,57 +34,59 @@


#
# A map from specific index to embeddings stores and models.
# A class to hold embeddings stores and models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more precise definition, and distinguish it from Env and Catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

# Defines one embeddings index config.
@dataclass
class EmbeddingsConfig:
# This represents the full catalog of models and indexes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... a subset of them may not be enabled by the Env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -47,13 +48,12 @@ def healthz():
if default_embeddings_loaded:
return 'OK', 200

default_index_type = current_app.config[
config.EMBEDDINGS_SPEC_KEY].default_index
default_index_type = current_app.config[config.ENV_KEY].default_indexes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] may not be safe? LHS has a check before access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, updated the check.

@shifucun
Copy link
Contributor Author

thanks for the review!

@shifucun shifucun enabled auto-merge (squash) May 24, 2024 05:25
logging.warning('Health Check Failed: Default index name empty!')
nl_env = current_app.config[config.ENV_KEY]
if not nl_env.default_indexes:
logging.error('Health Check Failed: Default index name empty!')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are warnings below, so change them all or not?

(original rational for warning was I thought this can happen while booting up, i think i saw this crash in custom dc - #4240)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Let me update here in next PR. I hope we don't need to make a real prediction call in /healthz then all logic here can be simplified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense!

@shifucun shifucun merged commit 32dc95b into datacommonsorg:master May 24, 2024
9 checks passed
Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always a fan of using good names - thanks!

@@ -142,9 +143,9 @@ def generate_embeddings_yaml(model_info: utils.ModelConfig,
"version": 1,
"indexes": {
"custom_ft": {
"embeddings": embeddings_csv_handler.abspath(),
"embeddings_path": embeddings_csv_handler.abspath(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be cool if at some point, changes to the yaml format don't require changes to the custom embeddings tool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think update the generated yaml file should do that, so running code that generate v10 yaml file with _v10.yaml. and the server would look for _v10.yaml and not read _v9.yaml (which can not parse).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants