-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generic GCS download functions and use them in NL server/tools #4252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool simplification!
shared/lib/gcs.py
Outdated
|
||
|
||
def maybe_download(gcs_path: str, | ||
local_path_prefix='/tmp', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any reason why local_path_prefix
here and local_path
in the other methods? Perhaps being more explicit and calling it something like local_parent_dir
may be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to be consistent with the naming, so local_path(*) means the download destination. I guess local_path_root
maybe better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to having root in the name. local_path_root
or local_root
sg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice refactor!!
raise ValueError(f"Invalid GCS path: {gcs_path}") | ||
bucket_name, blob_name = get_path_parts(gcs_path) | ||
local_path = os.path.join(local_path_prefix, bucket_name, blob_name) | ||
if os.path.exists(local_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running locally, if the gcs_path is a directory, the below can be a useful check because after reboot, I've noticed the files in /tmp/ are deleted but directory exists.
if os.path.exists(local_path) and len(os.listdir(local_path)) > 0:
# When running locally, we may already have downloaded the path.
# But sometimes after restart, the directories in `/tmp` become
# empty, so ensure that's not the case.
return local_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added check
if os.path.exists(local_path): | ||
return local_path | ||
if download_blob_by_path(gcs_path, local_path, use_anonymous_client): | ||
return local_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use download_blob here given you have bucket_name and blob_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If download_blob_by_path is no longer used, can get rid of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a general function. May worth to save the user parsing gs path into bucket and blob though
blobs = bucket.list_blobs(prefix=blob_name) | ||
count = 0 | ||
for blob in blobs: | ||
if blob.name.endswith("/"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def download_file(bucket: str, | ||
filename: str, | ||
use_anonymous_client: bool = False) -> str: | ||
def download_blob(bucket_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice general function!
import shared.lib.gcs as gcs | ||
|
||
|
||
class TestGCSFunctions(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments on what each test is testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
server/lib/nl/common/bad_words.py
Outdated
local_file = gcs.download_file(bucket=GLOBAL_CONFIG_BUCKET, | ||
filename=BAD_WORDS_FILE) | ||
local_file = gcs.maybe_download( | ||
f'gs://{GLOBAL_CONFIG_BUCKET}/{BAD_WORDS_FILE}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoding gs://
, should we have a make_path(bucket, suffix)
helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if os.path.exists(local_path): | ||
return local_path | ||
if download_blob_by_path(gcs_path, local_path, use_anonymous_client): | ||
return local_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If download_blob_by_path is no longer used, can get rid of it?
shared/lib/gcs.py
Outdated
raise ValueError(f"Invalid GCS path: {gcs_path}") | ||
bucket_name, blob_name = get_path_parts(gcs_path) | ||
local_path = os.path.join(local_path_root, bucket_name, blob_name) | ||
if os.path.exists(local_path) and len(os.listdir(local_path)) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one quick check, does os.listdir(file_path)
work? If not might need another condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, need to check is a dir first.
Thanks for the review! |
Added 3 generic GCS functions:
Use these functions throughout NL apps and remove unused functions.