-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User-based ObjectStore #4840
User-based ObjectStore #4840
Conversation
I started a branch with fixes here: https://github.com/jgoecks/galaxy/tree/UserBasedObjectStore2 Specifically, there are fixes for anonymous access. I can't seem to find your fork to initiate a pull request however—perhaps because your repo is restricted somehow and/or is so far behind the main repo? |
@VJalili I still cannot find your fork to make a PR against. I'll try to look into this more soon. |
@jgoecks I applied the changes you made on your branch on this branch. |
It will be nice to have user-based storage. @VJalili Wonder whether you use sql tables to manage user and corresponding storages. I haven't looked deep into this project yet, but my first feeling is to build a table on top of current storage management system. |
@dannon Thank you! I guess that has fixed it as all tests passed locally. |
@dannon I think the patch works fine for integration tests, but it breaks CI unit tests. |
@VJalili Ahh, sure enough. I was laser focused on that one issue, let me dig deeper since there's more to the picture here. Yeah, the error here has popped up again: I'll try to figure out how we're getting an hda handle that's no longer bound. |
The orphan HDA handle is the issue causing the integration test's failure; I guess that is happening when Galaxy is writing metadata to a file. |
Thanks for refactoring the concept of ownership out of the dataset instance level (HDA/LDDA) and for the integration tests. These are serious improvements I believe. Can you add an integration test of copying data on storage media between users? I assume based on the reading if a user copies my data and then I delete the storage media - the data will disappear for the user but I want that verified and stated explicitly with a test case. Is that fair? |
@jmchilton as per the challenges using this feature for shared data may introduce (e.g., authorization issues), last we decided to postpone the ability of using this feature for shared data. Do you think we should add some warnings for users who attempt to use this feature for share data? |
Yes, ideally. I'm not sure yet if that should be required for this PR but that is a good idea in general if we're going to impose that restriction. |
@jmchilton I disabled sharing for user storage media (a history that contains a dataset stored on a user-owned storage, cannot be shared); please see a272454. Any other thoughts? |
@@ -1402,6 +1429,7 @@ def _set_object_store_ids(self, job): | |||
# afterward. State below needs to happen the same way. | |||
for dataset_assoc in job.output_datasets + job.output_library_datasets: | |||
dataset = dataset_assoc.dataset | |||
self.__assign_media(job, dataset.dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent months of my life trying to optimize this process of initializing the output datasets. Can we have some property on app that we can check to see if this method would ever doing anything - and skip it if there is no possibility of assigning media?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! this config property is added that if disabled, this method will not do anything. Would that be addressing your concerns?
One sticking point at a time - I think User. _calculate_or_set_disk_usage is going to be a problem here right? Like as soon as the user's disk usage is recalculated - all the dataset usage for all the attached disk is going to be added to the user's quota even though you very carefully prevented it from being initially added. I'm trying to work on this in the context of creating like scratch storage object stores - I think what we need is more abstractions around quota calculation that ties them closer to object stores and is extensible for applications like this. I'll see if I can come up with something. |
Not used yet in Galaxy core yet, but useful for applications where you want object store selection to be based on user in some way. This code was taken from galaxyproject#4840. Part of this is trying to reduce the number of files that branch touches to make review easier - but I'm confident this extension point is good regardless. Also it makes it clear we need to keep the user object in the picture when assigning the object store ID in the future.
@@ -1057,7 +1063,12 @@ def purge_deleted_datasets(self, trans): | |||
if not hda.deleted or hda.purged: | |||
continue | |||
if trans.user: | |||
trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user)) | |||
if not hda.dataset.has_active_storage_media(): | |||
trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get around duplicating this code in the controllers with #10208.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That other PR has been merged so I think this should be rebased now along with the fix for https://github.com/galaxyproject/galaxy/pull/4840/files#r486551671.
"""Sets and gets the size of the data on disk""" | ||
return self.dataset.set_size(**kwds) | ||
"""Sets the size of the data on disk""" | ||
self.dataset.set_size(**kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a broken rebase or is it important to drop the return here?
@@ -359,7 +359,7 @@ def execute(self, tool, trans, incoming=None, return_job=False, set_output_hid=T | |||
# datasets first, then create the associations | |||
parent_to_child_pairs = [] | |||
child_dataset_names = set() | |||
object_store_populator = ObjectStorePopulator(app) | |||
object_store_populator = ObjectStorePopulator(app, user=trans.user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a new PR with these changes - https://github.com/galaxyproject/galaxy/compare/dev...jmchilton:user_objectstore_populator?expand=1. This continues a theme along with #10208 and #10212 of trying to establish Galaxy abstractions that restrict the code needed to implement this functionality just to object store, quota, and model code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure: #10231
@@ -969,7 +969,13 @@ def _populate_restricted(self, trans, user, histories, send_to_users, action, se | |||
else: | |||
# Only deal with datasets that have not been purged | |||
for hda in history.activatable_datasets: | |||
if trans.app.security_agent.can_access_dataset(send_to_user.all_roles(), hda.dataset): | |||
if len(hda.dataset.storage_media_associations) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not sufficient to prevent sharing at all I don't think. There are other paths to share datasets that don't hit this controller, library datasets should be prohibited from being such datasets, etc...
I think #10840 is what we want. It is much more general - it allows any objectstore to be marked as private - and it is much more comprehensive in how it prevents sharing. It has test cases, it prevents such datasets from even showing up where say importing history datasets into libraries, etc...
I think this portion of the PR should be dropped when that other PR is merged and instead just ensure that your user based objectstores are marked as private - I think better APIs and UIs will pretty cleanly fallout from that.
# exception(s). | ||
if state == JOB_READY and self.app.config.enable_quotas and \ | ||
(job.user is not None and | ||
(job.user.active_storage_media is None or not job.user.has_active_storage_media())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be redone on top of #10221 - which I think abstracts out the quota checking into nice optimizable functions. I think rather than checking if has_active_storage_media we should build on the abstractions in that PR to just ask if the configured objectstore we're talking to has quota left and then we can disable quota on objectstores that use storage media. The ability to disable quota on an objectstore is included in that PR.
Went with alternate implementation #18127 |
(1) Initially started at #4314; (2) all the commits of this PR were squashed into a single commit on Dec 11, 2019, the history of the changes are preserved via this branch.
Introduction
This PR extends Galaxy's
ObjectStore
to enable users to bring-their-own-resources: users can plug a media (e.g., Amazon S3 bucket) on which Galaxy will persist their datasets.Motivations
Highlights
order
andquota
attribute to each, and Galaxy will use them based on the givenorder
and will fall from one to another if their quota limit is reached;order
attribute of storage media, users can use both instance-wide storage and their own media. For instance, they can direct Galaxy to use the instance-wide storage until their quota limit is reached (e.g., 250GB on Galaxy Main), then use their own media for the rest of their data storage needs.Hierarchical
ObjectStore; hence, it is functional only ifHierarchical
ObjectStore is configure. However, the hierarchy is applied instance-wide only, and does not affect user’s plugged media configuration;What's next?
We're aiming to keep this PR "minimally functional"; hence, features such as ability to mount a cloud-based storage and user interfaces will be implemented in subsequent PRs.
How to use
POST
a payload as the following to the/api/storage_media
(you may use Postman to send API requests):A_PATH_ON_LOCAL_DISK
; e.g.,:. └── d └── b └── 1 └── dataset_db1b29ae-524a-46c1-af8d-e3e9e6861a4e.dat