Merge pull request #88 from ELIXIR-Belgium/dev

Adding ISA-JSON support as input file.
usegalaxy-eu · Dec 18, 2023 · cd3e8a9 · cd3e8a9
2 parents d540934 + abf7b73
commit cd3e8a9
Show file tree

Hide file tree

Showing 42 changed files with 16,576 additions and 66 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,4 @@
 .secret.yml
 build/
 ena_upload_cli.egg-info/
-ena_upload/__pycache__/
+__pycache__/
diff --git a/README.md b/README.md
@@ -7,18 +7,18 @@
 
 # ENA upload tool
 
-This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
+This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
 
 ## Overview
 
-The metadata should be provided in separate tables corresponding to the following ENA objects:
+The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:
 
 * STUDY
 * SAMPLE
 * EXPERIMENT
 * RUN
 
-The program to perform the following actions:
+You can set the tool to perform the following actions:
 
 * add: add an object to the archive
 * modify: modify an object in the archive
@@ -29,11 +29,15 @@ After a successful submission, new tsv tables will be generated with the ENA acc
 
 ## Tool dependencies
 
-* python 3.5+ including following packages:
+* python 3.7+ including following packages:
   * Genshi
   * lxml
   * pandas
   * requests
+  * pyyaml
+  * openpyxl
+  * jsonschema
+
 
 ## Installation
 
@@ -60,12 +64,14 @@ All supported arguments:
   --experiment EXPERIMENT
                         table of EXPERIMENT object
   --run RUN             table of RUN object
-  --data [FILE [FILE ...]]
-                        data for submission
+  --data [FILE ...]     data for submission
   --center CENTER_NAME  specific to your Webin account
   --checklist CHECKLIST
                         specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
   --xlsx XLSX           filled in excel template with metadata
+  --isa_json ISA_JSON   BETA: ISA json describing describing the ENA objects
+  --isa_assay_stream ISA_ASSAY_STREAM
+                        BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
   --auto_action         BETA: detect automatically which action (add or modify) to apply when the action column is not given
   --tool TOOL_NAME      specify the name of the tool this submission is done with. Default: ena-upload-cli
   --tool_version TOOL_VERSION
@@ -88,7 +94,7 @@ To avoid exposing your credentials through the terminal history, it is recommend
 
 ### ENA sample checklists
 
-You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the [ENA website](https://www.ebi.ac.uk/ena/browser/checklists). This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct `--checklist` parameter is given.
+You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).  
 
 #### Fixed sample columns
 
@@ -104,55 +110,11 @@ The command line tool will automatically fetch the correct scientific name based
 
 #### Viral submissions
 
-If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist on the website of ENA to know which values are allowed/possible in the `restricted text` and `text choice` fields.
+If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://github.com/ELIXIR-Belgium/ENA-metadata-templates/tree/main/templates/ERC000033) checklist in our template repo to know what is allowed/possible in the `Controlled vocabulary`fields.
 
 ### ENA study, experiment and run tables
 
-Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV).
-Currently we refer to the [ENA Webin](https://wwwdev.ebi.ac.uk/ena/submit/webin/) to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.
-
-#### Study tsv table
-
-| Name of column | Cardinality | Documentation | CV |
-|---|---|---|---|
-| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. |  |
-| title | mandatory | Title of the study as would be used in a publication. |  |
-| study_type | mandatory | The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study. | yes |
-| study_abstract | mandatory | Briefly describes the goals, purpose, and scope of the Study.  This need not be listed if it can be inherited from a referenced publication. |  |
-| center_project_name | optional | Submitter defined project name.  This field is intended for backward tracking of the study record to the submitter's LIMS. |  |
-| study_description | optional | More extensive free-form description of the study. |  |
-| pubmed_id | optional | Link to publication related to this study. |  |
-
-#### Experiment tsv table
-
-| Name of column | Cardinality | Documentation | CV |
-|---|---|---|---|
-| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. |  |
-| title | mandatory | Short text that can be used to call out experiment records in searches or in displays. |  |
-| study_alias | mandatory | Identifies the parent study. |  |
-| sample_alias | mandatory | Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified. |  |
-| design_description | mandatory | Goal and setup of the individual library including library was constructed. |  |
-| spot_descriptor | optional | The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files). |  |
-| library_name | optional | The submitter's name for this library. |  |
-| library_layout | mandatory | LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified. | yes |
-| insert_size | mandatory | Relative distance. |  |
-| library_strategy | mandatory | Sequencing technique intended for this library | yes |
-| library_source | mandatory | The LIBRARY_SOURCE specifies the type of source material that is being sequenced. | yes |
-| library_selection | mandatory | Method used to enrich the target in the sequence library preparation | yes |
-| platform | mandatory | The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center. | yes |
-| instrument_model | mandatory | Model of the sequencing instrument. | yes |
-| library_construction_protocol | optional | Free form text describing the protocol by which the sequencing library was constructed. |  |
-
-
-#### Run tsv table
-
-| Name of column | Cardinality | Documentation | CV |
-|---|---|---|---|
-| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. |  |
-| experiment_alias | mandatory | Identifies the parent experiment. |  |
-| file_name | mandatory | The name or relative pathname of a run data file. |  |
-| file_type | mandatory | The run data file model. | yes |
-| file_checksum | optional | Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option |  |
+Please check out the [template](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.
 
 
 ### Dev instance
@@ -176,7 +138,7 @@ There are two ways of submitting only a selection of objects to ENA. This is han
 | sample_alias_5 |        | sample_title_2 | 2697049  | sample_description_2 |
 
 
-> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
+> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.
 
 
 ### Using Excel templates
@@ -215,7 +177,7 @@ By default the updated tables after submission will have the action `added` in t
 ## Tool overview
 
 **inputs**:
-* metadata tables/excelsheet
+* metadata tables/excelsheet/isa_json
   * examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
   * (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
   * to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
@@ -262,6 +224,11 @@ By default the updated tables after submission will have the action `added` in t
   ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx 
   ```
 
+* **Using an ISA JSON**
+  ```
+  ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"
+  ```
+
 * **Release submission**
   ```
   ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml 

diff --git a/ena_upload/_version.py b/ena_upload/_version.py
@@ -1 +1 @@
-__version__ = "0.6.4"
+__version__ = "0.7.0"
diff --git a/ena_upload/ena_upload.py b/ena_upload/ena_upload.py
@@ -12,6 +12,7 @@
 import hashlib
 import ftplib
 import requests
+import json
 import uuid
 import numpy as np
 import re
@@ -21,6 +22,8 @@
 import tempfile
 from ena_upload._version import __version__
 from ena_upload.check_remote import remote_check
+from ena_upload.json_parsing.ena_submission import EnaSubmission
+
 
 SCHEMA_TYPES = ['study', 'experiment', 'run', 'sample']
 
@@ -371,7 +374,7 @@ def get_taxon_id(scientific_name):
         taxon_id = r.json()[0]['taxId']
         return taxon_id
     except ValueError:
-        msg = f'Oops, no taxon ID avaible for {scientific_name}. Is it a valid scientific name?'
+        msg = f'Oops, no taxon ID available for {scientific_name}. Is it a valid scientific name?'
         sys.exit(msg)
 
 
@@ -390,7 +393,7 @@ def get_scientific_name(taxon_id):
         taxon_id = r.json()['scientificName']
         return taxon_id
     except ValueError:
-        msg = f'Oops, no scientific name avaible for {taxon_id}. Is it a valid taxon_id?'
+        msg = f'Oops, no scientific name available for {taxon_id}. Is it a valid taxon_id?'
         sys.exit(msg)
 
 
@@ -413,16 +416,15 @@ def submit_data(file_paths, password, webin_id):
 
     except IOError as ioe:
         print(ioe)
-        print("ERROR: could not connect to the ftp server.\
+        sys.exit("ERROR: could not connect to the ftp server.\
                Please check your login details.")
-        sys.exit()
     for filename, path in file_paths.items():
         print(f'uploading {path}')
         try:
             print(ftps.storbinary(f'STOR {filename}', open(path, 'rb')))
         except BaseException as err:
             print(f"ERROR: {err}")
-            print("ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
+            print("ERROR: If your connection times out at this stage, it probably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
             raise
     print(ftps.quit())
 
@@ -699,7 +701,7 @@ def process_args():
 
     parser.add_argument('--data',
                         nargs='*',
-                        help='data for submission',
+                        help='data for submission, this can be a list of files',
                         metavar='FILE')
 
     parser.add_argument('--center',
@@ -712,6 +714,13 @@ def process_args():
 
     parser.add_argument('--xlsx',
                         help='filled in excel template with metadata')
+
+    parser.add_argument('--isa_json',
+                        help='BETA: ISA json describing describing the ENA objects')
+
+    parser.add_argument('--isa_assay_stream',
+                        nargs='*',
+                        help='BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams')
 
     parser.add_argument('--auto_action',
                         action="store_true",
@@ -749,7 +758,7 @@ def process_args():
 
     # check if any table is given
     tables = set([args.study, args.sample, args.experiment, args.run])
-    if tables == {None} and not args.xlsx:
+    if tables == {None} and not args.xlsx and not args.isa_json:
         parser.error('Requires at least one table for submission')
 
     # check if .secret file exists
@@ -764,6 +773,14 @@ def process_args():
             msg = f"Oops, the file {args.xlsx} does not exist"
             parser.error(msg)
 
+    # check if ISA json file exists
+    if args.isa_json:
+        if not os.path.isfile(args.isa_json):
+            msg = f"Oops, the file {args.isa_json} does not exist"
+            parser.error(msg)
+        if args.isa_assay_stream is None :
+            parser.error("--isa_json requires --isa_assay_stream")
+
     # check if data is given when adding a 'run' table
     if (not args.no_data_upload and args.run and args.action.upper() not in ['RELEASE', 'CANCEL']) or (not args.no_data_upload and args.xlsx and args.action.upper() not in ['RELEASE', 'CANCEL']):
         if args.data is None:
@@ -816,6 +833,8 @@ def main():
     secret = args.secret
     draft = args.draft
     xlsx = args.xlsx
+    isa_json_file = args.isa_json
+    isa_assay_stream = args.isa_assay_stream
     auto_action = args.auto_action
 
     with open(secret, 'r') as secret_file:
@@ -857,6 +876,25 @@ def main():
             schema_dataframe[schema] = xl_sheet
             path = os.path.dirname(os.path.abspath(xlsx))
             schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
+    elif isa_json_file:
+        # Read json file
+        with open(isa_json_file, 'r') as json_file:
+            isa_json = json.load(json_file)
+
+        schema_tables = {}
+        schema_dataframe = {}
+        required_assays = []
+        for stream in isa_assay_stream:
+            required_assays.append({"assay_stream": stream})
+        submission = EnaSubmission.from_isa_json(isa_json, required_assays)
+        submission_dataframes = submission.generate_dataframes()
+        for schema, df in submission_dataframes.items():
+            schema_dataframe[schema] = check_columns(
+                df, schema, action, dev, auto_action)
+            path = os.path.dirname(os.path.abspath(isa_json_file))
+            schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
+
+
     else:
         # collect the schema with table input from command-line
         schema_tables = collect_tables(args)

diff --git a/ena_upload/json_parsing/__init__.py b/ena_upload/json_parsing/__init__.py