Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared/lib in build embeddings tool and run this tool as a module #4254

Merged
merged 8 commits into from
May 22, 2024

Conversation

shifucun
Copy link
Contributor

@shifucun shifucun commented May 21, 2024

This makes it possible to use shared libraries (more to come) and common config processing in NL server.

With this, more GCS download functions could be removed.

Also remove autogen input support since no autogen descriptions exist anymore.

@shifucun shifucun requested review from keyurva and pradh May 21, 2024 05:37
@shifucun shifucun changed the title Make build embeddings tool a module and able to import from shared/lib Use shared/lib in build embeddings tool and run this tool as a module May 21, 2024
_SHARED_LIB_DIR = os.path.join(_THIS_DIR, "..", "..", "..", "shared", "lib")
sys.path.append(_SHARED_LIB_DIR)
import gcs # type: ignore
from shared.lib import gcs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed!

@@ -67,7 +67,8 @@ def load_data():
# Build custom embeddings.
command2 = [
'python',
'build_custom_dc_embeddings.py',
'-m',
'tools.nl.embeddings.build_custom_dc_embeddings',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile will also need to be updated and validated that it builds correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and verified with docker build locally.

python3 -m tools.nl.embeddings.build_custom_dc_embeddings "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is not used anywhere. So we can delete it. If you do, please update the commands in the doc: https://github.com/datacommonsorg/website/blob/master/tools/nl/embeddings/build_custom_dc_embeddings.md

Also, can you run these commands as modules to verify that the functionality works as expected: #4216 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified commands can run. I have also tested with local setup that talks to GCS storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the entire doc uses this script, so will keep as is. As I plan to consolidate the build embedding tool later, can remove at that time.

@shifucun shifucun requested a review from keyurva May 21, 2024 23:20
Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing cleanup!!

autogen_dfs.append(pd.read_csv(autogen_csv).fillna(""))
if autogen_dfs:
df_svs = pd.concat([df_svs] + autogen_dfs)
df_svs = df_svs.drop_duplicates(subset=utils.DCID_COL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're getting rid of autogen files support, should we also delete them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they have been deleted already...

Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@shifucun shifucun enabled auto-merge (squash) May 22, 2024 00:03
@shifucun
Copy link
Contributor Author

Thanks for review!

@shifucun shifucun merged commit c44216a into datacommonsorg:master May 22, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants