Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a subset of the CWL Draft 3 tool format. #1

Closed
wants to merge 227 commits into from
Closed

Conversation

jmchilton
Copy link

CWL Support:

  • Implemented integer, long, float, double, boolean, and File parameters, and arrays thereof as well ["null", <simple_type>] union parameters and Any-type parameters. More complex unions of datatypes are stil unsupported (unions of two or more non-null parameters, unions of ["null", Any], etc...).
  • Draft 3 CreateFileRequirements are supported (see the test_rename test case).
  • Draft 3 InlineJavascriptRequirement are support to define output files (see test_cat3 test case).
  • EnvVarRequirements are supported (see the test_env_tool1 and test_env_tool2 test cases).
  • Secondary files are supported at least partially, see the index1 and showindex1 CWL tools created to verify this as well as the test_index1 test case.
  • Docker integration is only partial (simple docker pull is supported) - so cat3-tool.cwl works for example. Full semantics of CWL docker support has yet to be implemented. The remaining work is straight-forward and trackd in the meta-issue More Refined Docker Support for Tools galaxyproject/galaxy#1684.
  • Expression tools are supported (see parseInt-tool test case).
  • Non-File CWL outputs are represented as expression.json files. Traditionally Galaxy hasn't supported non-File outputs from tools but CWL Galaxy has work in progress on bringing native Galaxy support for such outputs Add Expression Tools to Galaxy #27.

Implementation Notes:

  • CWL secondary files are stored in __secondary_files__ directory in the dataset's extra_files_path directory.
  • The tool execution API has been extended to add a inputs_representation parameter that can be set to "cwl" now. The cwl representation for running tools corresonding to the CWL job json format with {class: "File: path: "/path/to/file"} inputs replaced with {"src": "hda", "id": "<dataset_id>"}. Code for building these requests for CWL job json is available in the test class.
  • Since the CWL <-> Galaxy parameter translation may change over time, for instance if Galaxy develops or refines parameter classes - CWL state and CWL state version is tracked in the database and hopefully for reruns, etc... we could update the Galaxy state from an older version to a new one.
  • CWL allows output parameters to be either File or non-File and determined at runtime, so galaxy.json is used to dynamically adjust output extension as needed for non-File parameters.

Implementation Description:

The reference implementation Python library (mainly developed by Peter Amstutz - https://github.com/common-workflow-language/common-workflow-language/tree/master/reference) is used to load tool files ending with .json or .cwl and proxy objects are created to adapt these tools to Galaxy representations. In particular input and output descriptions are loaded from the tool.

When the tool is submitted, a special specialized tool class is used to build a cwltool compatible job description from the supplied Galaxy inputs and the CWL reference implementation is used to generate a CWL reference implementation Job object. A command-line is generated from this Job object.

As a result of this - Galaxy largely does not need to worry about the details of command-line adapters, expressions, etc....

Galaxy writes a description of the CWL job that it can reload to the job working directory. After the process is complete (on the Galaxy compute server, but outside the Docker container) this representation is reloaded and the dynamic outputs are discovered and moved to fixed locations as expected by Galaxy. CWL allows for much more expressive output locations than Galaxy, for better or worse, and this step uses cwltool to adapt CWL to Galaxy outputs.

Currently all File outputs are sniffed to determined a Galaxy datatype, CWL draft 3 allows refinement on this and this remains work to be done.

  1. CWL should support EDAM declaration of types and Galaxy should provide a mapping to core datasets to skip sniffing is types are found.
  2. For finer grain control within Galaxy, extensions to CWL should allow setting actual Galaxy output types on outputs. (Distinction between fastq and fastqsanger in Galaxy is very important for instance.)

Testing:

% git clone https://github.com/common-workflow-language/galaxy.git
% git checkout cwl
% cd galaxy
% virtualenv .venv
% . .venv/bin/activate
% pip install cwltool

Start Galaxy.

% GALAXY_RUN_WITH_TEST_TOOLS=1 run.sh --reload

Open http://localhost:8080/ and see CWL test tools (along with all Galaxy test tools) in left hand tool panel.

To go a step further and actually run CWL jobs within their designated Docker containers, copy the following minimal Galaxy job configuration file to config/job_conf.xml. (Adjust the docker_sudo parameter based on how you execute Docker).

https://gist.github.com/jmchilton/3997fa471d1b4c556966

Run API tests demonstrating the various CWL demo tools with the following command.

./run_tests.sh -api test/api/test_tools_cwl.py

Issues

Work remaining on CWL support for Galaxy is tracked at https://github.com/common-workflow-language/galaxy/issues.

@jmchilton
Copy link
Author

This will be rebased frequently but this standing PR should represent the current progress of the cwl branch. Any help would be appreciated, will work on building a TODO list of what needs to be done.

@jmchilton jmchilton force-pushed the cwl branch 5 times, most recently from 1315239 to 3bd8aad Compare October 22, 2015 18:00
mr-c added a commit that referenced this pull request Oct 22, 2015
Revert most of the 79-char linewrapping in objectstore
@jmchilton jmchilton force-pushed the cwl branch 2 times, most recently from 21666c9 to 6d66270 Compare November 2, 2015 22:20
jmchilton pushed a commit that referenced this pull request Nov 2, 2015
@jmchilton jmchilton force-pushed the cwl branch 3 times, most recently from c2f64ae to 22f4234 Compare November 6, 2015 04:50
@jmchilton jmchilton force-pushed the cwl branch 6 times, most recently from 714442e to 8706584 Compare November 16, 2015 01:28
@jmchilton jmchilton force-pushed the cwl branch 10 times, most recently from 6c814b2 to 9befc72 Compare December 2, 2015 03:41
jmchilton added 11 commits June 13, 2016 14:18
This special class of tools leverages the infrastructure for tool inputs, tool state tracking, tool module for workflows, tool API, etc... without actually producing command-line jobs. Instead these tools are provided the input model objects and are expected to produce output model objects directly. This provides an oppertunity

The first driving use case for these tools are also included - namely tools that allow zipping and unzipping paired collections. These tools can be mapped over lists (e.g. list:paired to (list, list) or the inverse) using much of the existing infrastructure for tools. Test cases included that validate these work with mapping operations and in workflows.

The most obvious advantage of these versus traditional tools that do the same thing is that the data isn't copied on disk - new HDAs are created directly from the source datasets.

Testing:

This PR includes various API test cases for functionality, these can be run with the following command:

```
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_unzip_collection
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_inputs
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_list_inputs
./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_workflow_run_zip_collections
```
This differs from a traditional tool in that its inputs don't need to be in an 'ok' state and instead of creating new datasets and duplicating data on disk, new HDAs are created from the existing datasets.
Testing:

```
./run_tests.sh -framework -id __FLATTEN__
```
The user is prompted for a JavaScript expression, which is in turn ran once per dataset in a list and used as filter. If the JavaScript evaluates to a Python truthy value, the HDA is copied into the output dataset (without duplicating the data on disk).

The JavaScript expression is supplied various HDA attributes in the environment (currently all metadata values, file_size, file_ext, and dbkey). The supplied test case filters out datasets that do not contain an even number of lines.

Testing:

```
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_filter_0
```
... for loading tool actions that require it.
Takes in a list dataset collection and produces a list of lists keying the outer list on a user supplied function. This reuses the JavaScript expression code used by the filter model tool.

Testing:

```
./run_tests.sh -framework -id __GROUP__
```
 - Introduce models and a API for creating tools dynamically.
 - Use Galaxy's testing-only YAML based representation of tools to prototype this.
 - Extend Format 2 workflow definitions to allow embedding tools directly into workflows, either directly or using a CWL-style @import syntax.

Testing:

Test cases demonstrating tools can be imported (only by admins) and are runnable are included with this commit. More test cases regarding workflow use of dynamic tools and Format 2 workflow definition extensions are also included.

These tests can be run with the following commands:

```
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_nonadmin_users_cannot_create_tools
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_dynamic_tool_1
./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_import_export_dynamic
./run_tests.sh -api test/api/test_workflows_from_yaml.py:WorkflowsFromYamlApiTestCase.test_workflow_embed_tool
./run_tests.sh -api test/api/test_workflows_from_yaml.py:WorkflowsFromYamlApiTestCase.test_workflow_import_tool
```
- Tool definition languge and plumbing and datatype for expressing expressions as jobs.
- Allow connecting expression tools to parameters in workflows, will delay evaluation of workflow so calculated value
- Example test expression tools for testing and demonstration.
- [WIP] Workflow expression module to allow users to specify arbitrary expressions.
CWL Support:
--------------

 - Implemented integer, long, float, double, boolean, and File parameters, and arrays thereof as well some simple unions of these parameters and Any-type parameters. More complex unions of datatypes are stil unsupported.
 - Draft 3 ``CreateFileRequirement``s are supported (see the ``test_rename`` test case).
 - Draft 3 ``InlineJavascriptRequirement`` are support to define output files (see ``test_cat3`` test case).
 - ``EnvVarRequirement``s are supported (see the ``test_env_tool1`` and ``test_env_tool2`` test cases).
 - Secondary files are supported at least partially, see the ``index1`` and ``showindex1`` CWL tools created to verify this as well as the ``test_index1`` test case.
 - Docker integration is only partial (simple docker pull is supported) - so ``cat3-tool.cwl`` works for example. Full semantics of CWL docker support has yet to be implemented. The remaining work is straight-forward and trackd in the meta-issue galaxyproject#1684.
 - Expression tools are supported (see ``parseInt-tool`` test case).
 - Non-File CWL outputs are represented as ``expression.json`` files. Traditionally Galaxy hasn't supported non-File outputs from tools but CWL Galaxy has work in progress on bringing native Galaxy support for such outputs #27.

Implementation Notes:
----------------------

 - CWL secondary files are stored in ``__secondary_files__`` directory in the dataset's extra_files_path directory.
 - The tool execution API has been extended to add a ``inputs_representation`` parameter that can be set to "cwl" now. The ``cwl`` representation for running tools corresonding to the CWL job json format with {class: "File: path: "/path/to/file"} inputs replaced with {"src": "hda", "id": "<dataset_id>"}. Code for building these requests for CWL job json is available in the test class.
 - Since the CWL <-> Galaxy parameter translation may change over time, for instance if Galaxy develops or refines parameter classes - CWL state and CWL state version is tracked in the database and hopefully for reruns, etc... we could update the Galaxy state from an older version to a new one.
 - CWL allows output parameters to be either ``File`` or non-``File`` and determined at runtime, so ``galaxy.json`` is used to dynamically adjust output extension as needed for non-``File`` parameters.

Implementation Description:
-----------------------------

The reference implementation Python library (mainly developed by Peter Amstutz - https://github.com/common-workflow-language/common-workflow-language/tree/master/reference) is used to load tool files ending with ``.json`` or ``.cwl`` and proxy objects are created to adapt these tools to Galaxy representations. In particular input and output descriptions are loaded from the tool.

When the tool is submitted, a special specialized tool class is used to build a cwltool compatible job description from the supplied Galaxy inputs and the CWL reference implementation is used to generate a CWL reference implementation Job object. A command-line is generated from this Job object.

As a result of this - Galaxy largely does not need to worry about the details of command-line adapters, expressions, etc....

Galaxy writes a description of the CWL job that it can reload to the job working directory. After the process is complete (on the Galaxy compute server, but outside the Docker container) this representation is reloaded and the dynamic outputs are discovered and moved to fixed locations as expected by Galaxy. CWL allows for much more expressive output locations than Galaxy, for better or worse, and this step uses cwltool to adapt CWL to Galaxy outputs.

Currently all ``File`` outputs are sniffed to determined a Galaxy datatype, CWL draft 3 allows refinement on this and this remains work to be done.

  1) CWL should support EDAM declaration of types and Galaxy should provide a mapping to core datasets to skip sniffing is types are found.
  2) For finer grain control within Galaxy, extensions to CWL should allow setting actual Galaxy output types on outputs. (Distinction between fastq and fastqsanger in Galaxy is very important for instance.)

Testing:
---------------------

    % git clone https://github.com/common-workflow-language/galaxy.git
    % git checkout cwl
    % cd galaxy
    % virtualenv .venv
    % . .venv/bin/activate
    % pip install cwltool

Start Galaxy.

    % GALAXY_RUN_WITH_TEST_TOOLS=1 run.sh --reload

Open http://localhost:8080/ and see CWL test tools (along with all Galaxy test tools) in left hand tool panel.

To go a step further and actually run CWL jobs within their designated Docker containers, copy the following minimal Galaxy job configuration file to ``config/job_conf.xml``. (Adjust the ``docker_sudo`` parameter based on how you execute Docker).

https://gist.github.com/jmchilton/3997fa471d1b4c556966

Run API tests demonstrating the various CWL demo tools with the following command.

```
./run_tests.sh -api test/api/test_tools_cwl.py
```

Issues
---------------------------------

Work remaining on CWL support for Galaxy is tracked at https://github.com/common-workflow-language/galaxy/issues.

Refactor toward workflow support.
@mr-c mr-c changed the title Implementat a subset of the CWL Draft 3 tool format. Implement a subset of the CWL Draft 3 tool format. Sep 30, 2016
jmchilton pushed a commit that referenced this pull request Mar 6, 2017
Add fastq(*).bz2 datatypes and converters
@jmchilton
Copy link
Author

This PR has be superseded by #47 - work will now be tracked in the cwl-1.0 branch.

@jmchilton jmchilton closed this Mar 8, 2017
jmchilton pushed a commit that referenced this pull request Jul 13, 2017
jmchilton pushed a commit that referenced this pull request Aug 14, 2017
Swap two more spots initializing TagManagers to new session
jmchilton pushed a commit that referenced this pull request Jan 14, 2018
Small javascript fixes for askomics IE integration
jmchilton pushed a commit that referenced this pull request Jan 14, 2018
jmchilton pushed a commit that referenced this pull request Apr 20, 2018
Update from Galaxyproject repo
jmchilton pushed a commit that referenced this pull request Jul 1, 2018
Need to set default HOME/TMP before env_setup_commands
jmchilton pushed a commit that referenced this pull request Mar 16, 2019
Add tool test for metadata_in_range
nsoranzo pushed a commit that referenced this pull request Dec 29, 2020
nsoranzo pushed a commit that referenced this pull request Jan 19, 2021
nsoranzo pushed a commit that referenced this pull request Apr 2, 2024
* Rename `delta` to `eps` for float-based assertions

* Fix tests
nsoranzo pushed a commit that referenced this pull request Dec 5, 2024
nsoranzo pushed a commit that referenced this pull request Dec 5, 2024
nsoranzo pushed a commit that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.