Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing to write to S3 bucket without read permission #473

Open
pconstr opened this issue Sep 18, 2024 · 5 comments
Open

failing to write to S3 bucket without read permission #473

pconstr opened this issue Sep 18, 2024 · 5 comments

Comments

@pconstr
Copy link

pconstr commented Sep 18, 2024

I've got a use case where I'd like to use cloudpathlib to write to a bucket where the user has permissions for put_object but not for get_object.

Currently (0.19) I get an exception when opening as it appears cloudpathlib attempts to refresh the cache:

   with (target / f"{k}.csv").open("w") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/cloudpathlib/cloudpath.py", line 663, in open
    self._refresh_cache(force_overwrite_from_cloud=force_overwrite_from_cloud)
  File "/usr/local/lib/python3.11/dist-packages/cloudpathlib/cloudpath.py", line 1220, in _refresh_cache
    stats = self.stat()
            ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/cloudpathlib/s3/s3path.py", line 61, in stat
    meta = self.client._get_metadata(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/cloudpathlib/s3/s3client.py", line 139, in _get_metadata
    data = self.s3.ObjectSummary(cloud_path.bucket, cloud_path.key).get(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/boto3/resources/factory.py", line 581, in do_action
    response = action(self, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/botocore/client.py", line 569, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/botocore/client.py", line 1023, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: User: arn:aws:iam::************:user/************ is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::********.csv" because no identity-based policy allows the s3:GetObject action

Is there a way to work around this?
If there isn't but it's something that could be changed, I'd be happy to work on a patch given some direction as to what would be acceptable.

@pjbull
Copy link
Member

pjbull commented Sep 19, 2024

Ah, interesting. What happens (and/or what do you expect to happen) if the file already exists?

We could potentially add ClientError and a check for the "AccessDenied" string here:
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/s3/s3path.py#L62

Then we will raise NoStat in this scenario and _refresh_cache will assume the file doesn't exist and keep rolling:
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/cloudpath.py#L1219-L1224

@pconstr
Copy link
Author

pconstr commented Sep 19, 2024

I'm doing open("w") so I expect it to overwrite (eventually) whatever was there regardless of whether or not I can tell if there was something already. I also expect the cache not to be updated, and if I do a read later I expect that read to fail even if I just wrote to it immediately before, because I don't have access.

Currently I'm working around this by just doing:

        with StringIO() as f:
            v.to_csv(f)
            s = f.getvalue()
        dest_path = target / f"{k}.csv"
        # can't use dest_path.open() because it needs get_object permissions to write
        client = boto3.client('s3')
        client.put_object(Bucket=dest_path.bucket, Key=dest_path.key, Body=s)

It's not too bad, I can still get some value out of CloudPath, but I need to be aware of the use cases where I may not have read access and use this instead, and that's annoying.

@pjbull
Copy link
Member

pjbull commented Sep 20, 2024

Makes sense. We're open to taking a PR that implements the approach I suggested above. May need some additional fixes/testing, but should be a good first start.

@pconstr
Copy link
Author

pconstr commented Sep 22, 2024

OK. It seems to me that implementing this properly according to that suggestion will require some context information in stat() in order to know if an error resulting from lack of read access ought to be translated to NoStatError and then tolerated or bubbled up. I reckon we don't want to end up with translated exceptions for other uses of stat() such as in connection with open("r") - we would still want a ClientError bubbling up with all the cloud-specific details.
I guess that means adding a parameter to stat() or something similar.
That interface change entails changing all the implementations. In any case write-only permissions are possible pretty much everywhere.
I can see that's not going to be a small PR. Am I missing something?

@pjbull
Copy link
Member

pjbull commented Sep 23, 2024

I reckon we don't want to end up with translated exceptions for other uses of stat() such as in connection with open("r") - we would still want a ClientError bubbling up with all the cloud-specific details.

I think it might actually be ok just to raise NoStatError and keep going in the access denied scenario.

I believe that the scenarios are either:

  • User is calling stat. They get NoStatError either way, which let's them know we can't get the stats. We could include the reason in the message or attach a property like parent_exception in case someone needs to check it.
  • User is calling a read/write function; we check stat for the cache, NoStatError there gets ignored, then we try to actually read/write the file and no access error occurs on that call and is properly surfaced to the user.

Are there other scenarios that seem like they may cause a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants