Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-based ObjectStore #4840

Closed
wants to merge 20 commits into from
Closed

Conversation

VJalili
Copy link
Member

@VJalili VJalili commented Oct 20, 2017

(1) Initially started at #4314; (2) all the commits of this PR were squashed into a single commit on Dec 11, 2019, the history of the changes are preserved via this branch.

Introduction

This PR extends Galaxy's ObjectStore to enable users to bring-their-own-resources: users can plug a media (e.g., Amazon S3 bucket) on which Galaxy will persist their datasets.

Motivations

  • unlimited storage: users on Galaxy instances with limited storage resources (e.g., storage quota) can potentially have an unlimited storage by plugging their own (cloud-based) storage to Galaxy;
  • data sharing: having datasets generated by Galaxy stored on user’s cloud-based storage makes it easier sharing analysis results with collaborators;
  • flexible persistence location: members of different labs using a common Galaxy instance hosted at their institute can have their data stored on their lab’s network attached storage (NAS).

Highlights

  • For users without a plugged storage media, Galaxy will continue to use an instance-wide configuration for their data storage needs;
  • A user's storage media (e.g., an S3 bucket) will be used for their data storage needs only, and will not be accessed for other user's storage needs;
  • A storage media can be a local path, or an Amazon S3 bucket;
  • Users can plug multiple media (e.g., two different local path, and three Amazon S3 buckets), assign an order and quota attribute to each, and Galaxy will use them based on the given order and will fall from one to another if their quota limit is reached;
  • Leveraging the order attribute of storage media, users can use both instance-wide storage and their own media. For instance, they can direct Galaxy to use the instance-wide storage until their quota limit is reached (e.g., 250GB on Galaxy Main), then use their own media for the rest of their data storage needs.
  • Storage media are defined leveraging Galaxy’s cloud authorization model, hence Galaxy does not ask for user’s credentials.
  • This PR implements all the necessary models, managers, functions, and APIs; and there will be a separate PR for UI;
  • The functionality is leveraging Hierarchical ObjectStore; hence, it is functional only if Hierarchical ObjectStore is configure. However, the hierarchy is applied instance-wide only, and does not affect user’s plugged media configuration;
  • Each storage media has its separate staging path (mainly used for S3 backend), independent from the instance-wide ObjectStore and other storage media; and admin can define a default staging path.

What's next?

We're aiming to keep this PR "minimally functional"; hence, features such as ability to mount a cloud-based storage and user interfaces will be implemented in subsequent PRs.

How to use

  1. Configure objectstore to the hierarchical backend; e.g.,:
<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
        <object_store type="distributed" id="primary" order="0">
            <backends>
                <backend id="files1" type="disk" weight="1">
                    <files_dir path="database/files1"/>
                    <extra_dir type="temp" path="database/tmp1"/>
                    <extra_dir type="job_work" path="database/job_working_directory1"/>
                </backend>
                <backend id="files2" type="disk" weight="1">
                    <files_dir path="database/files2"/>
                    <extra_dir type="temp" path="database/tmp2"/>
                    <extra_dir type="job_work" path="database/job_working_directory2"/>
                </backend>
            </backends>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files3"/>
            <extra_dir type="temp" path="database/tmp3"/>
            <extra_dir type="job_work" path="database/job_working_directory3"/>
        </object_store>
    </backends>
</object_store>
  1. Login and get your API key;
  2. POST a payload as the following to the /api/storage_media (you may use Postman to send API requests):
{
    "category": "local",
    "path": "A_PATH_ON_LOCAL_DISK",
    "order": "1",
    "quota": "1000.0",
    "usage": "0.0"
}
  1. Then any dataset you create, will be stored in the A_PATH_ON_LOCAL_DISK; e.g.,:
.
└── d
    └── b
        └── 1
            └── dataset_db1b29ae-524a-46c1-af8d-e3e9e6861a4e.dat

@jgoecks
Copy link
Contributor

jgoecks commented Oct 24, 2017

I started a branch with fixes here: https://github.com/jgoecks/galaxy/tree/UserBasedObjectStore2

Specifically, there are fixes for anonymous access. I can't seem to find your fork to initiate a pull request however—perhaps because your repo is restricted somehow and/or is so far behind the main repo?

@VJalili
Copy link
Member Author

VJalili commented Oct 24, 2017

Thanks for the updates @jgoecks . Please see if you can make a PR agains this branch; if not, I can update this branch. Besides, I guess we could avoid your last commit.

@jgoecks
Copy link
Contributor

jgoecks commented Oct 24, 2017

@VJalili I still cannot find your fork to make a PR against. I'll try to look into this more soon.

@VJalili
Copy link
Member Author

VJalili commented Oct 24, 2017

@jgoecks I applied the changes you made on your branch on this branch.

@VJalili VJalili changed the title [WIP] User-based object store [WIP] User-based ObjectStore Nov 3, 2017
@VJalili VJalili mentioned this pull request Dec 7, 2017
@qiagu
Copy link
Contributor

qiagu commented Mar 23, 2018

It will be nice to have user-based storage. @VJalili Wonder whether you use sql tables to manage user and corresponding storages. I haven't looked deep into this project yet, but my first feeling is to build a table on top of current storage management system.

@dannon
Copy link
Member

dannon commented Mar 5, 2019

@VJalili I opened a PR that I think will fix tests for this PR. It's actually, I think, an issue we have always had and it was just never surfaced until this PR.

VJalili#6

@VJalili
Copy link
Member Author

VJalili commented Mar 5, 2019

@dannon Thank you! I guess that has fixed it as all tests passed locally.

@VJalili
Copy link
Member Author

VJalili commented Mar 5, 2019

@dannon I think the patch works fine for integration tests, but it breaks CI unit tests.

@dannon
Copy link
Member

dannon commented Mar 5, 2019

@VJalili Ahh, sure enough. I was laser focused on that one issue, let me dig deeper since there's more to the picture here.

Yeah, the error here has popped up again:
Parent instance <HistoryDatasetAssociation at 0x7fe8e45b0250> is not bound to a Session; lazy load operation of attribute 'history' cannot proceed

I'll try to figure out how we're getting an hda handle that's no longer bound.

@VJalili
Copy link
Member Author

VJalili commented Mar 5, 2019

The orphan HDA handle is the issue causing the integration test's failure; I guess that is happening when Galaxy is writing metadata to a file.

@VJalili VJalili changed the title [WIP] User-based ObjectStore User-based ObjectStore Mar 12, 2019
@galaxybot galaxybot added this to the 19.09 milestone Jul 29, 2019
@jmchilton
Copy link
Member

Thanks for refactoring the concept of ownership out of the dataset instance level (HDA/LDDA) and for the integration tests. These are serious improvements I believe.

Can you add an integration test of copying data on storage media between users? I assume based on the reading if a user copies my data and then I delete the storage media - the data will disappear for the user but I want that verified and stated explicitly with a test case. Is that fair?

@VJalili
Copy link
Member Author

VJalili commented Jul 29, 2019

@jmchilton as per the challenges using this feature for shared data may introduce (e.g., authorization issues), last we decided to postpone the ability of using this feature for shared data. Do you think we should add some warnings for users who attempt to use this feature for share data?

@jmchilton
Copy link
Member

Do you think we should add some warnings for users who attempt to use this feature for share data?

Yes, ideally. I'm not sure yet if that should be required for this PR but that is a good idea in general if we're going to impose that restriction.

@VJalili
Copy link
Member Author

VJalili commented Aug 5, 2019

@jmchilton I disabled sharing for user storage media (a history that contains a dataset stored on a user-owned storage, cannot be shared); please see a272454. Any other thoughts?

@@ -1402,6 +1429,7 @@ def _set_object_store_ids(self, job):
# afterward. State below needs to happen the same way.
for dataset_assoc in job.output_datasets + job.output_library_datasets:
dataset = dataset_assoc.dataset
self.__assign_media(job, dataset.dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent months of my life trying to optimize this process of initializing the output datasets. Can we have some property on app that we can check to see if this method would ever doing anything - and skip it if there is no possibility of assigning media?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! this config property is added that if disabled, this method will not do anything. Would that be addressing your concerns?

@mvdbeek mvdbeek removed this from the 20.09 milestone Sep 8, 2020
@galaxybot galaxybot added this to the 20.09 milestone Sep 8, 2020
@jmchilton
Copy link
Member

One sticking point at a time - I think User. _calculate_or_set_disk_usage is going to be a problem here right? Like as soon as the user's disk usage is recalculated - all the dataset usage for all the attached disk is going to be added to the user's quota even though you very carefully prevented it from being initially added.

I'm trying to work on this in the context of creating like scratch storage object stores - I think what we need is more abstractions around quota calculation that ties them closer to object stores and is extensible for applications like this. I'll see if I can come up with something.

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Sep 10, 2020
Not used yet in Galaxy core yet, but useful for applications where you want object store selection to be based on user in some way. This code was taken from galaxyproject#4840. Part of this is trying to reduce the number of files that branch touches to make review easier - but I'm confident this extension point is good regardless. Also it makes it clear we need to keep the user object in the picture when assigning the object store ID in the future.
@@ -1057,7 +1063,12 @@ def purge_deleted_datasets(self, trans):
if not hda.deleted or hda.purged:
continue
if trans.user:
trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user))
if not hda.dataset.has_active_storage_media():
trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get around duplicating this code in the controllers with #10208.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That other PR has been merged so I think this should be rebased now along with the fix for https://github.com/galaxyproject/galaxy/pull/4840/files#r486551671.

"""Sets and gets the size of the data on disk"""
return self.dataset.set_size(**kwds)
"""Sets the size of the data on disk"""
self.dataset.set_size(**kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a broken rebase or is it important to drop the return here?

@@ -359,7 +359,7 @@ def execute(self, tool, trans, incoming=None, return_job=False, set_output_hid=T
# datasets first, then create the associations
parent_to_child_pairs = []
child_dataset_names = set()
object_store_populator = ObjectStorePopulator(app)
object_store_populator = ObjectStorePopulator(app, user=trans.user)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a new PR with these changes - https://github.com/galaxyproject/galaxy/compare/dev...jmchilton:user_objectstore_populator?expand=1. This continues a theme along with #10208 and #10212 of trying to establish Galaxy abstractions that restrict the code needed to implement this functionality just to object store, quota, and model code.

Copy link
Member Author

@VJalili VJalili Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure: #10231

@@ -969,7 +969,13 @@ def _populate_restricted(self, trans, user, histories, send_to_users, action, se
else:
# Only deal with datasets that have not been purged
for hda in history.activatable_datasets:
if trans.app.security_agent.can_access_dataset(send_to_user.all_roles(), hda.dataset):
if len(hda.dataset.storage_media_associations) > 0:
Copy link
Member

@jmchilton jmchilton Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient to prevent sharing at all I don't think. There are other paths to share datasets that don't hit this controller, library datasets should be prohibited from being such datasets, etc...

I think #10840 is what we want. It is much more general - it allows any objectstore to be marked as private - and it is much more comprehensive in how it prevents sharing. It has test cases, it prevents such datasets from even showing up where say importing history datasets into libraries, etc...

I think this portion of the PR should be dropped when that other PR is merged and instead just ensure that your user based objectstores are marked as private - I think better APIs and UIs will pretty cleanly fallout from that.

# exception(s).
if state == JOB_READY and self.app.config.enable_quotas and \
(job.user is not None and
(job.user.active_storage_media is None or not job.user.has_active_storage_media())):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be redone on top of #10221 - which I think abstracts out the quota checking into nice optimizable functions. I think rather than checking if has_active_storage_media we should build on the abstractions in that PR to just ask if the configured objectstore we're talking to has quota left and then we can disable quota on objectstores that use storage media. The ability to disable quota on an objectstore is included in that PR.

@jmchilton
Copy link
Member

Went with alternate implementation #18127

@jmchilton jmchilton closed this Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.