Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bucket to be mounted to an arbitrary logical path #2196

Open
tedgin opened this issue May 14, 2024 · 6 comments
Open

Allow bucket to be mounted to an arbitrary logical path #2196

tedgin opened this issue May 14, 2024 · 6 comments

Comments

@tedgin
Copy link

tedgin commented May 14, 2024

I'm requesting that the iRODS S3 storage resource plugin be able to mount (attach, graft) a bucket to an arbitrary logical path in an iRODS zone.

Currently, a bucket path, e.g., /my_bucket/, is mounted at the zone logical path, e.g., /zone/. This means a data object added at /zone/home/user/object gets the name home/user/object in /my_bucket/. This is fine when a bucket is being used as a zone-wide storage resource, and the data in the bucket will primarily be accessed through iRODS. If the data will primarily be accessed outside of iRODS, but on occasion still needs to be accessed through iRODS, forcing an object in the bucket to be prefixed with something like home/user or be accessed in iRODS at the base of the zone, i.e., /zone/object is inconvenient.

Pretend that /my_bucket/ already has thousands of objects in it when an iRODS storage resource is created for it. Furthermore, there are mature workflows that add and access objects in the bucket outside of iRODS following specific naming conventions. Renaming existing objects so they don't show up directly under the zone would be difficult. This gets worse, if one of the S3 objects has the name of an existing iRODS collection or data object like home or home/tedgin/teletubbies.jpg. If the bucket path were able to be mounted to an arbitrary logical path, e.g., /zone/home/project/s3-bucket/, then the names of existing S3 objects wouldn't need to be renamed.

Having a bucket be able to be mounted to an arbitrary logical path also opens up the possibility of a user or project being able to access data from an S3 bucket that they own (and pay for) from within an iRODS zone without the S3 bucket becoming usable by everyone else in the zone. Supporting this is outside the scope of this feature request.

@tedgin
Copy link
Author

tedgin commented May 14, 2024

One way of satisfying this feature request would be to create a new archive naming policy, e.g., chroot. This naming policy would require that the logical mount path be provided as part of the S3 storage resource definition. This could be done using a context variable, e.g., ROOT_COLL. Here's an example iadmin command for creating one of these S3 resources.

iadmin mkresc \
   myBucketResc \
   s3 \
   "$(hostname)":/my_bucket/prefix/in/bucket \
   'S3_DEFAULT_HOSTNAME=s3.us-east-1.amazonaws.com;S3_AUTH_FILE=/var/lib/irods/my_bucket.keypair;ARCHIVE_NAMING_POLICY=chroot;ROOT_COLL=/zone/home/project/s3-bucket'

Here's an implementation of this for version 4.2.11. main...tedgin:irods_resource_plugin_s3:main

@korydraughn
Copy link
Contributor

Very interesting. We'll look into it following UGM.

@alanking
Copy link
Contributor

For posterity...

This was discussed at length during the May 2024 S3 Working Group. Minutes have not yet been published.

@trel
Copy link
Member

trel commented May 22, 2024

Now available...
https://github.com/irods-contrib/irods_working_group_s3/blob/main/20240503-minutes.md

@trel
Copy link
Member

trel commented Jun 7, 2024

I think this is a subset of today's functionality of the S3 plugin.

This is a restriction of which logical_path(s) are allowed to be stored on this resource.

So... potentially a new context string setting...
...;LOGICAL_PATHS=/logical_path/1::/logical_path/2::/logical_path/3;...
OR
...;LOGICAL_PATH=/logical_path/1;LOGICAL_PATH=/logical_path/2;LOGICAL_PATH=/logical_path/3;...

Could be enough to let us implement the requested feature and 'pin' a resource to a certain subset of the logical namespace.

If there is existing data in a newly 'mounted' bucket, it would need to be 'scanned' or 'registered' for that data to be visible via the catalog. Could be via Lambda, could be ingest tool, etc.

@trel
Copy link
Member

trel commented Jun 7, 2024

But that new idea wouldn't allow management/updating of the physical path in the bucket itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants