-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shared/lib in build embeddings tool and run this tool as a module #4254
Conversation
_SHARED_LIB_DIR = os.path.join(_THIS_DIR, "..", "..", "..", "shared", "lib") | ||
sys.path.append(_SHARED_LIB_DIR) | ||
import gcs # type: ignore | ||
from shared.lib import gcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed!
@@ -67,7 +67,8 @@ def load_data(): | |||
# Build custom embeddings. | |||
command2 = [ | |||
'python', | |||
'build_custom_dc_embeddings.py', | |||
'-m', | |||
'tools.nl.embeddings.build_custom_dc_embeddings', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile will also need to be updated and validated that it builds correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and verified with docker build locally.
python3 -m tools.nl.embeddings.build_custom_dc_embeddings "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is not used anywhere. So we can delete it. If you do, please update the commands in the doc: https://github.com/datacommonsorg/website/blob/master/tools/nl/embeddings/build_custom_dc_embeddings.md
Also, can you run these commands as modules to verify that the functionality works as expected: #4216 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified commands can run. I have also tested with local setup that talks to GCS storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the entire doc uses this script, so will keep as is. As I plan to consolidate the build embedding tool later, can remove at that time.
…to build-embedding-lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing cleanup!!
autogen_dfs.append(pd.read_csv(autogen_csv).fillna("")) | ||
if autogen_dfs: | ||
df_svs = pd.concat([df_svs] + autogen_dfs) | ||
df_svs = df_svs.drop_duplicates(subset=utils.DCID_COL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're getting rid of autogen files support, should we also delete them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, they have been deleted already...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
Thanks for review! |
This makes it possible to use shared libraries (more to come) and common config processing in NL server.
With this, more GCS download functions could be removed.
Also remove autogen input support since no autogen descriptions exist anymore.