Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic GCS download functions and use them in NL server/tools #4252

Merged
merged 6 commits into from
May 21, 2024

Conversation

shifucun
Copy link
Contributor

@shifucun shifucun commented May 20, 2024

Added 3 generic GCS functions:

  • download_blob() which downloads a blob(file or folder) given bucket, blob name and local path.
  • download_blob_by_path() which is a wrapper of download_blob() that takes gcs path
  • maybe_download() which downloads the blob under 'local_path_prefix' and keeps the gcs path structure, and skip the download if already exists.

Use these functions throughout NL apps and remove unused functions.

@shifucun shifucun requested review from keyurva and pradh May 20, 2024 18:26
Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool simplification!



def maybe_download(gcs_path: str,
local_path_prefix='/tmp',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any reason why local_path_prefix here and local_path in the other methods? Perhaps being more explicit and calling it something like local_parent_dir may be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to be consistent with the naming, so local_path(*) means the download destination. I guess local_path_root maybe better here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having root in the name. local_path_root or local_root sg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@shifucun shifucun enabled auto-merge (squash) May 20, 2024 22:36
Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactor!!

raise ValueError(f"Invalid GCS path: {gcs_path}")
bucket_name, blob_name = get_path_parts(gcs_path)
local_path = os.path.join(local_path_prefix, bucket_name, blob_name)
if os.path.exists(local_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running locally, if the gcs_path is a directory, the below can be a useful check because after reboot, I've noticed the files in /tmp/ are deleted but directory exists.

  if os.path.exists(local_path) and len(os.listdir(local_path)) > 0:
    # When running locally, we may already have downloaded the path.
    # But sometimes after restart, the directories in `/tmp` become
    # empty, so ensure that's not the case.
    return local_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check

if os.path.exists(local_path):
return local_path
if download_blob_by_path(gcs_path, local_path, use_anonymous_client):
return local_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use download_blob here given you have bucket_name and blob_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If download_blob_by_path is no longer used, can get rid of it?

Copy link
Contributor Author

@shifucun shifucun May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a general function. May worth to save the user parsing gs path into bucket and blob though

blobs = bucket.list_blobs(prefix=blob_name)
count = 0
for blob in blobs:
if blob.name.endswith("/"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment what this means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def download_file(bucket: str,
filename: str,
use_anonymous_client: bool = False) -> str:
def download_blob(bucket_name: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice general function!

import shared.lib.gcs as gcs


class TestGCSFunctions(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments on what each test is testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

local_file = gcs.download_file(bucket=GLOBAL_CONFIG_BUCKET,
filename=BAD_WORDS_FILE)
local_file = gcs.maybe_download(
f'gs://{GLOBAL_CONFIG_BUCKET}/{BAD_WORDS_FILE}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoding gs://, should we have a make_path(bucket, suffix) helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if os.path.exists(local_path):
return local_path
if download_blob_by_path(gcs_path, local_path, use_anonymous_client):
return local_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If download_blob_by_path is no longer used, can get rid of it?

raise ValueError(f"Invalid GCS path: {gcs_path}")
bucket_name, blob_name = get_path_parts(gcs_path)
local_path = os.path.join(local_path_root, bucket_name, blob_name)
if os.path.exists(local_path) and len(os.listdir(local_path)) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one quick check, does os.listdir(file_path) work? If not might need another condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need to check is a dir first.

@shifucun
Copy link
Contributor Author

Thanks for the review!

@shifucun shifucun merged commit 78122be into datacommonsorg:master May 21, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants