-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical upload API optimized for folders & collections. #5220
Conversation
a8e69f0
to
4ba5607
Compare
6ecd830
to
90c35e8
Compare
90c35e8
to
9260d96
Compare
@jmchilton this is super cool and such a yaml file turned out to be super useful in our training community. Looking forward to these nice enhancements. What do you think about adding permissions also to this schema, so that I can specify during upload the group/role? |
ba2f029
to
24d15bb
Compare
Making permissions part of the schema would make the 'bag' not portable though since it would relay on given instance's namespace. Or did you mean something like 'set access to the bag to these four users on import' feature? |
So a syntax for bag imports currently would be:
or
or some other stuff for direct path imports, server_dir imports, uploads, etc... At any rate I would imagine one would stash the roles needed for library creation in the destination part:
Or however we would identify the different kinds of roles and the roles themselves - so I think this could make sense. The destination is pretty independent from what is being imported. Another thing is that existing library folders can be targeted currently (it is actually how I started implementing this).
so certainly this can be used to upload bags or zips or files or directories to existing folders with custom permissions as well. I have a TODO list for this PR that definitely has permissions on it - but it might be relegated to a follow up issue unless someone feels strongly this needs to be in the first attempt. |
I see, so the roles here would be user emails which can be portable to some extent? edit: because the bag does not know about groups and custom roles names/ids |
There are two things - the API and the the YAML format. They corresponding pretty directly and certainly role ids or role names in the API make a lot of sense. I wouldn't really expect for instance sample YAML files with library descriptions to be portable if they defined such roles though. Perhaps CLI parameters for settings those up or a GUI that you can paste the generic config into and then customize. |
That is what I was thinking. Load your bag definition and then you would have an ability to choose instance specific settings like permissions, parent folder etc. before you click the big 'Import' button. |
2d693b7
to
73ff3df
Compare
lib/galaxy/workflow/modules.py
Outdated
@@ -809,6 +809,9 @@ def execute(self, trans, progress, invocation_step, use_cached_job=False): | |||
invocation = invocation_step.workflow_invocation | |||
step = invocation_step.workflow_step | |||
tool = trans.app.toolbox.get_tool(step.tool_id, tool_version=step.tool_version) | |||
if tool.is_workflow_compatible: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - thanks!
f056df4
to
b48841b
Compare
@anatskiy So I hit a few problems getting your example to work but I've come up with solutions and test cases: I was right that this broke down quite a bit when I merged #5609 back into this branch, but that breakage should all be fixed with 6252a9e. Then I noticed that Galaxy doesn't consistently sniff CSV files in the same way depending on how you are uploading - so I opened #5643 for the existing uploader (which is included in this PR) and added a fix for this new uploader. Finally I added in c695112 which has a bunch more tests for the new uploader including CSV tests derived from your file. So destination:
type: hdas
name: Some dataset
description: some description
items:
- name: Dataset 1
src: url
url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv should now upload just fine. The one caveat I did learn is that your file has carriage returns in it - that Python doesn't by default know how to parse when sniffing as a csv, so this file ends up as a txt input. This isn't a problem for the regular uploader because it always converts carriage returns to posix newlines. This API doesn't do that by default because it is less eager to mess with your data - but it can be switched on. So to get this to sniff as CSV you can use: destination:
type: hdas
name: Some dataset
description: some description
items:
- name: Dataset 1
src: url
url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
to_posix_lines: true or just destination:
type: hdas
name: Some dataset
description: some description
items:
- name: Dataset 1
src: url
url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
to_posix_lines: true
ext: csv Thanks again for trying it out, hope these fixes help. |
@jmchilton thanks a lot! It works perfectly now! I have just one question. Is it possible to upload data with just a POST request? Or such API doesn't exist? |
@jmchilton if you can resolve the conflict please I think we are ready to merge. |
…on_util. I need to use this from upload code.
Allows describing hierarchical data in JSON or inferring structure from archives or directories. Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike existing API endpoints, a mix of these on a per file basis is allowed and they work seemlessly between libraries and histories. Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads). The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey hanlding in 4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligable from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render. This implements 5159 though much simpler YAML descriptions of data libraries should be possible basically as the API descriptions. We can replace the data library script in Ephemeris https://github.com/galaxyproject/ephemeris/blob/master/ephemeris/setup_data_libraries.py with one that converts a simple YAML file into an API call and allows many new options for free. In future PRs I'll add filtering options to this and it will serve as the backend to 4733.
In the case of data-fetch there is extra validation that is done so this is somewhat important.
Concerning that they sometimes will get deleted in production with default settings - see 5361.
Trying to improve the user experience of this rule based uploader by placing HDAs and HDCAs in the history at the outset that the history panel can poll and that we can turn red if the upload fails. From Marius' PR review: > I can see that a job launched in my logs, but it failed and there were no visual indications of this in the UI Not every HDA for instance can be created, for example if reading them from a zip file for instance that happens on the backend still. Likewise if HDCAs don't define a collection type up front they cannot be pre-created (if for instance that is inferred from a folder structure). Library things aren't precreated at all in this commit. There is room to pre-create more but I think this is an atomic commit as it is now and it will hopefully improve the user experience for the rule based uploader considerably.
- Remove seemingly unneeded hack in upload_common. - Remove stray debug statement. - Add more comments in the output collection code related to different destination types. - Restructure if/else in data_fetch to avoid assertion with constant.
Previously sniffing would happen on the original file (before carriage returns and tabular spaces were converted) if in_place was false and on the converted file if it was true.
@bgruening Awesome - thanks for the review. I've rebased and resolved the conflicts. @anatskiy Yeah - it is possible via the API - that publication script doesn't work that way though. There are examples in the test code - https://github.com/galaxyproject/galaxy/pull/5220/files#diff-a1e38277f26f32a21c0878b6b640accbR191. So any dictionary in the API that can support for instance It doesn't yet support |
Cool thanks a lot @jmchilton. |
This new API endpoint allows describing hierarchical data in JSON or inferring structure from archives or directories.
Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike the existing library API endpoint, a mix of these on a per file source basis is allowed.
Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads).
The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey handling in #4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligible from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render.
This implements #5159 by allowing BagIt directories and BagIt archives (e.g. zip or tar.gz files containing BagIt direcotries) to specified as the source of items for a library folder. though a much simpler YAML description of data libraries can be translated directly into API calls into the new endpoint. To demonstrate this I've included an example script fetch_to_library.py that implements this. The example input file included demonstrates this:
This example demonstrates creating a library via this endpoint, but that destination element could be used to create an HDCA in a history or populate the contents of an existing library folder just as easily.
Once this lands in a stable release - I'll port a cleaned up version of this example script as a replacement or augmentation to the data library script in Ephemeris. Unlike the script in Ephemeris and the existing library folder upload options - this handles much more metadata and allows the creation of nested folders instead of just flat lists of contents.
In future PRs I'll add filtering options to this and it will serve as the backend to #4733.
Implements #4734.