Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical upload API optimized for folders & collections. #5220

Merged
merged 19 commits into from
Mar 9, 2018

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Dec 14, 2017

This new API endpoint allows describing hierarchical data in JSON or inferring structure from archives or directories.

Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike the existing library API endpoint, a mix of these on a per file source basis is allowed.

Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads).

The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey handling in #4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligible from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render.

This implements #5159 by allowing BagIt directories and BagIt archives (e.g. zip or tar.gz files containing BagIt direcotries) to specified as the source of items for a library folder. though a much simpler YAML description of data libraries can be translated directly into API calls into the new endpoint. To demonstrate this I've included an example script fetch_to_library.py that implements this. The example input file included demonstrates this:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items:
  - name: Quality Control
    description: |
      Data for sequence quality control tutorial at http://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html.
      10.5281/zenodo.61771
    items:
      - src: url
        url: https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_1.fastq
        name: GSM461178_untreat_paired_subset_1
        ext: fastqsanger
        info: Untreated subseq of GSM461178 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/61771/files/GSM461182_untreat_single_subset.fastq
        name: GSM461182_untreat_single_subset
        ext: fastqsanger
        info: Untreated subseq of GSM461182 from 10.1186/s12864-017-3692-8
  - name: Small RNA-Seq
    description: |
      Data for small RNA-seq tutorial available at http://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/srna/tutorial.html
      10.5281/zenodo.826906
    items:
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep1_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep1
        ext: fastqsanger.gz
        info: Downsample rep1 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep2_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep2
        ext: fastqsanger.gz
        info: Downsample rep2 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep3_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep3
        ext: fastqsanger.gz
        info: Downsample rep3 from 10.1186/s12864-017-3692-8

This example demonstrates creating a library via this endpoint, but that destination element could be used to create an HDCA in a history or populate the contents of an existing library folder just as easily.

Once this lands in a stable release - I'll port a cleaned up version of this example script as a replacement or augmentation to the data library script in Ephemeris. Unlike the script in Ephemeris and the existing library folder upload options - this handles much more metadata and allows the creation of nested folders instead of just flat lists of contents.

In future PRs I'll add filtering options to this and it will serve as the backend to #4733.

Implements #4734.

@bgruening
Copy link
Member

@jmchilton this is super cool and such a yaml file turned out to be super useful in our training community. Looking forward to these nice enhancements.

What do you think about adding permissions also to this schema, so that I can specify during upload the group/role?

@jmchilton jmchilton force-pushed the upload_2.0 branch 3 times, most recently from ba2f029 to 24d15bb Compare December 20, 2017 14:23
@martenson
Copy link
Member

What do you think about adding permissions also to this schema, so that I can specify during upload the group/role?

Making permissions part of the schema would make the 'bag' not portable though since it would relay on given instance's namespace. Or did you mean something like 'set access to the bag to these four users on import' feature?

@jmchilton
Copy link
Member Author

jmchilton commented Dec 20, 2017

So a syntax for bag imports currently would be:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items_from: bagit_archive
src: url
url: http://example.org/coolbag.zip

or

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

or some other stuff for direct path imports, server_dir imports, uploads, etc... At any rate I would imagine one would stash the roles needed for library creation in the destination part:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
  roles:
    admin:
      - name_of_role
    read: 
      - name_or_ro_role
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

Or however we would identify the different kinds of roles and the roles themselves - so I think this could make sense. The destination is pretty independent from what is being imported. Another thing is that existing library folders can be targeted currently (it is actually how I started implementing this).

destination:
  type: library_folder
  library_folder_id: folder_id
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

so certainly this can be used to upload bags or zips or files or directories to existing folders with custom permissions as well.

I have a TODO list for this PR that definitely has permissions on it - but it might be relegated to a follow up issue unless someone feels strongly this needs to be in the first attempt.

@martenson
Copy link
Member

martenson commented Dec 20, 2017

I see, so the roles here would be user emails which can be portable to some extent?

edit: because the bag does not know about groups and custom roles names/ids

@jmchilton
Copy link
Member Author

I see, so the roles here would be user emails which can be portable to some extent?

There are two things - the API and the the YAML format. They corresponding pretty directly and certainly role ids or role names in the API make a lot of sense. I wouldn't really expect for instance sample YAML files with library descriptions to be portable if they defined such roles though. Perhaps CLI parameters for settings those up or a GUI that you can paste the generic config into and then customize.

@martenson
Copy link
Member

GUI that you can paste the generic config into and then customize.

That is what I was thinking. Load your bag definition and then you would have an ability to choose instance specific settings like permissions, parent folder etc. before you click the big 'Import' button.

@martenson martenson added this to the 18.05 milestone Jan 2, 2018
@jmchilton jmchilton force-pushed the upload_2.0 branch 3 times, most recently from 2d693b7 to 73ff3df Compare January 6, 2018 19:04
@@ -809,6 +809,9 @@ def execute(self, trans, progress, invocation_step, use_cached_job=False):
invocation = invocation_step.workflow_invocation
step = invocation_step.workflow_step
tool = trans.app.toolbox.get_tool(step.tool_id, tool_version=step.tool_version)
if tool.is_workflow_compatible:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - thanks!

@jmchilton jmchilton force-pushed the upload_2.0 branch 2 times, most recently from f056df4 to b48841b Compare January 9, 2018 20:24
@jmchilton
Copy link
Member Author

jmchilton commented Mar 7, 2018

@anatskiy So I hit a few problems getting your example to work but I've come up with solutions and test cases:

I was right that this broke down quite a bit when I merged #5609 back into this branch, but that breakage should all be fixed with 6252a9e.

Then I noticed that Galaxy doesn't consistently sniff CSV files in the same way depending on how you are uploading - so I opened #5643 for the existing uploader (which is included in this PR) and added a fix for this new uploader.

Finally I added in c695112 which has a bunch more tests for the new uploader including CSV tests derived from your file.

So

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv

should now upload just fine. The one caveat I did learn is that your file has carriage returns in it - that Python doesn't by default know how to parse when sniffing as a csv, so this file ends up as a txt input. This isn't a problem for the regular uploader because it always converts carriage returns to posix newlines. This API doesn't do that by default because it is less eager to mess with your data - but it can be switched on. So to get this to sniff as CSV you can use:

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
    to_posix_lines: true

or just

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
    to_posix_lines: true
    ext: csv

Thanks again for trying it out, hope these fixes help.

@anatskiy
Copy link
Contributor

anatskiy commented Mar 8, 2018

@jmchilton thanks a lot! It works perfectly now!

I have just one question. Is it possible to upload data with just a POST request? Or such API doesn't exist?

@bgruening
Copy link
Member

@jmchilton if you can resolve the conflict please I think we are ready to merge.

jmchilton added 19 commits March 8, 2018 10:32
…on_util.

I need to use this from upload code.
Allows describing hierarchical data in JSON or inferring structure from archives or directories.

Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike existing API endpoints, a mix of these on a per file basis is allowed and they work seemlessly between libraries and histories.

Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads).

The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey hanlding in 4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligable from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render.

This implements 5159 though much simpler YAML descriptions of data libraries should be possible basically as the API descriptions. We can replace the data library script in Ephemeris https://github.com/galaxyproject/ephemeris/blob/master/ephemeris/setup_data_libraries.py with one that converts a simple YAML file into an API call and allows many new options for free.

In future PRs I'll add filtering options to this and it will serve as the backend to 4733.
In the case of data-fetch there is extra validation that is done so this is somewhat important.
Concerning that they sometimes will get deleted in production with default settings - see 5361.
Trying to improve the user experience of this rule based uploader by placing HDAs and HDCAs in the history at the outset that the history panel can poll and that we can turn red if the upload fails.

From Marius' PR review:

> I can see that a job launched in my logs, but it failed and there were no visual indications of this in the UI

Not every HDA for instance can be created, for example if reading them from a zip file for instance that happens on the backend still. Likewise if HDCAs don't define a collection type up front they cannot be pre-created (if for instance that is inferred from a folder structure). Library things aren't precreated at all in this commit. There is room to pre-create more but I think this is an atomic commit as it is now and it will hopefully improve the user experience for the rule based uploader considerably.
- Remove seemingly unneeded hack in upload_common.
- Remove stray debug statement.
- Add more comments in the output collection code related to different destination types.
- Restructure if/else in data_fetch to avoid assertion with constant.
Previously sniffing would happen on the original file (before carriage returns and tabular spaces were converted) if in_place was false and on the converted file if it was true.
@jmchilton
Copy link
Member Author

@bgruening Awesome - thanks for the review. I've rebased and resolved the conflicts.

@anatskiy Yeah - it is possible via the API - that publication script doesn't work that way though. There are examples in the test code - https://github.com/galaxyproject/galaxy/pull/5220/files#diff-a1e38277f26f32a21c0878b6b640accbR191.

So any dictionary in the API that can support for instance {"src": "path", "path": "/path/to"} or {"src": "url", "url": "http://example.com"} can also support {"src": "files"}. This means treat the multi-part upload parameters files_0, files_1, etc... as an iterator and grab "the next one". So the number of instances of {"src": "files"} appearing should match the number of files attached via multi-part posts.

It doesn't yet support {"src": "paste"} - so you can't put the content directly in a simple non-multi-part POST I think. It will need that at some point to serve as the main upload GUI backend, but it also doesn't yet tackle composite files which will also be needed someday. I think the structure of the JSON though will allow both of those more naturally than the older endpoint.

@bgruening
Copy link
Member

Cool thanks a lot @jmchilton.
ping @sneumann and @pcm32 this will make you happy I suppose :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants