Skip to content

Commit

Permalink
Merge pull request #88 from ELIXIR-Belgium/dev
Browse files Browse the repository at this point in the history
Adding ISA-JSON support as input file.
  • Loading branch information
bedroesb authored Dec 18, 2023
2 parents d540934 + abf7b73 commit cd3e8a9
Show file tree
Hide file tree
Showing 42 changed files with 16,576 additions and 66 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
.secret.yml
build/
ena_upload_cli.egg-info/
ena_upload/__pycache__/
__pycache__/
77 changes: 22 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@

# ENA upload tool

This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).

## Overview

The metadata should be provided in separate tables corresponding to the following ENA objects:
The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:

* STUDY
* SAMPLE
* EXPERIMENT
* RUN

The program to perform the following actions:
You can set the tool to perform the following actions:

* add: add an object to the archive
* modify: modify an object in the archive
Expand All @@ -29,11 +29,15 @@ After a successful submission, new tsv tables will be generated with the ENA acc

## Tool dependencies

* python 3.5+ including following packages:
* python 3.7+ including following packages:
* Genshi
* lxml
* pandas
* requests
* pyyaml
* openpyxl
* jsonschema


## Installation

Expand All @@ -60,12 +64,14 @@ All supported arguments:
--experiment EXPERIMENT
table of EXPERIMENT object
--run RUN table of RUN object
--data [FILE [FILE ...]]
data for submission
--data [FILE ...] data for submission
--center CENTER_NAME specific to your Webin account
--checklist CHECKLIST
specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
--xlsx XLSX filled in excel template with metadata
--isa_json ISA_JSON BETA: ISA json describing describing the ENA objects
--isa_assay_stream ISA_ASSAY_STREAM
BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
--auto_action BETA: detect automatically which action (add or modify) to apply when the action column is not given
--tool TOOL_NAME specify the name of the tool this submission is done with. Default: ena-upload-cli
--tool_version TOOL_VERSION
Expand All @@ -88,7 +94,7 @@ To avoid exposing your credentials through the terminal history, it is recommend

### ENA sample checklists

You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the [ENA website](https://www.ebi.ac.uk/ena/browser/checklists). This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct `--checklist` parameter is given.
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).

#### Fixed sample columns

Expand All @@ -104,55 +110,11 @@ The command line tool will automatically fetch the correct scientific name based

#### Viral submissions

If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist on the website of ENA to know which values are allowed/possible in the `restricted text` and `text choice` fields.
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://github.com/ELIXIR-Belgium/ENA-metadata-templates/tree/main/templates/ERC000033) checklist in our template repo to know what is allowed/possible in the `Controlled vocabulary`fields.

### ENA study, experiment and run tables

Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV).
Currently we refer to the [ENA Webin](https://wwwdev.ebi.ac.uk/ena/submit/webin/) to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.

#### Study tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| title | mandatory | Title of the study as would be used in a publication. | |
| study_type | mandatory | The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study. | yes |
| study_abstract | mandatory | Briefly describes the goals, purpose, and scope of the Study. This need not be listed if it can be inherited from a referenced publication. | |
| center_project_name | optional | Submitter defined project name. This field is intended for backward tracking of the study record to the submitter's LIMS. | |
| study_description | optional | More extensive free-form description of the study. | |
| pubmed_id | optional | Link to publication related to this study. | |

#### Experiment tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| title | mandatory | Short text that can be used to call out experiment records in searches or in displays. | |
| study_alias | mandatory | Identifies the parent study. | |
| sample_alias | mandatory | Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified. | |
| design_description | mandatory | Goal and setup of the individual library including library was constructed. | |
| spot_descriptor | optional | The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files). | |
| library_name | optional | The submitter's name for this library. | |
| library_layout | mandatory | LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified. | yes |
| insert_size | mandatory | Relative distance. | |
| library_strategy | mandatory | Sequencing technique intended for this library | yes |
| library_source | mandatory | The LIBRARY_SOURCE specifies the type of source material that is being sequenced. | yes |
| library_selection | mandatory | Method used to enrich the target in the sequence library preparation | yes |
| platform | mandatory | The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center. | yes |
| instrument_model | mandatory | Model of the sequencing instrument. | yes |
| library_construction_protocol | optional | Free form text describing the protocol by which the sequencing library was constructed. | |


#### Run tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| experiment_alias | mandatory | Identifies the parent experiment. | |
| file_name | mandatory | The name or relative pathname of a run data file. | |
| file_type | mandatory | The run data file model. | yes |
| file_checksum | optional | Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option | |
Please check out the [template](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.


### Dev instance
Expand All @@ -176,7 +138,7 @@ There are two ways of submitting only a selection of objects to ENA. This is han
| sample_alias_5 | | sample_title_2 | 2697049 | sample_description_2 |


> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.

### Using Excel templates
Expand Down Expand Up @@ -215,7 +177,7 @@ By default the updated tables after submission will have the action `added` in t
## Tool overview

**inputs**:
* metadata tables/excelsheet
* metadata tables/excelsheet/isa_json
* examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
* (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
* to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
Expand Down Expand Up @@ -262,6 +224,11 @@ By default the updated tables after submission will have the action `added` in t
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
```

* **Using an ISA JSON**
```
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"
```

* **Release submission**
```
ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml
Expand Down
2 changes: 1 addition & 1 deletion ena_upload/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.4"
__version__ = "0.7.0"
52 changes: 45 additions & 7 deletions ena_upload/ena_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import hashlib
import ftplib
import requests
import json
import uuid
import numpy as np
import re
Expand All @@ -21,6 +22,8 @@
import tempfile
from ena_upload._version import __version__
from ena_upload.check_remote import remote_check
from ena_upload.json_parsing.ena_submission import EnaSubmission


SCHEMA_TYPES = ['study', 'experiment', 'run', 'sample']

Expand Down Expand Up @@ -371,7 +374,7 @@ def get_taxon_id(scientific_name):
taxon_id = r.json()[0]['taxId']
return taxon_id
except ValueError:
msg = f'Oops, no taxon ID avaible for {scientific_name}. Is it a valid scientific name?'
msg = f'Oops, no taxon ID available for {scientific_name}. Is it a valid scientific name?'
sys.exit(msg)


Expand All @@ -390,7 +393,7 @@ def get_scientific_name(taxon_id):
taxon_id = r.json()['scientificName']
return taxon_id
except ValueError:
msg = f'Oops, no scientific name avaible for {taxon_id}. Is it a valid taxon_id?'
msg = f'Oops, no scientific name available for {taxon_id}. Is it a valid taxon_id?'
sys.exit(msg)


Expand All @@ -413,16 +416,15 @@ def submit_data(file_paths, password, webin_id):

except IOError as ioe:
print(ioe)
print("ERROR: could not connect to the ftp server.\
sys.exit("ERROR: could not connect to the ftp server.\
Please check your login details.")
sys.exit()
for filename, path in file_paths.items():
print(f'uploading {path}')
try:
print(ftps.storbinary(f'STOR {filename}', open(path, 'rb')))
except BaseException as err:
print(f"ERROR: {err}")
print("ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
print("ERROR: If your connection times out at this stage, it probably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
raise
print(ftps.quit())

Expand Down Expand Up @@ -699,7 +701,7 @@ def process_args():

parser.add_argument('--data',
nargs='*',
help='data for submission',
help='data for submission, this can be a list of files',
metavar='FILE')

parser.add_argument('--center',
Expand All @@ -712,6 +714,13 @@ def process_args():

parser.add_argument('--xlsx',
help='filled in excel template with metadata')

parser.add_argument('--isa_json',
help='BETA: ISA json describing describing the ENA objects')

parser.add_argument('--isa_assay_stream',
nargs='*',
help='BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams')

parser.add_argument('--auto_action',
action="store_true",
Expand Down Expand Up @@ -749,7 +758,7 @@ def process_args():

# check if any table is given
tables = set([args.study, args.sample, args.experiment, args.run])
if tables == {None} and not args.xlsx:
if tables == {None} and not args.xlsx and not args.isa_json:
parser.error('Requires at least one table for submission')

# check if .secret file exists
Expand All @@ -764,6 +773,14 @@ def process_args():
msg = f"Oops, the file {args.xlsx} does not exist"
parser.error(msg)

# check if ISA json file exists
if args.isa_json:
if not os.path.isfile(args.isa_json):
msg = f"Oops, the file {args.isa_json} does not exist"
parser.error(msg)
if args.isa_assay_stream is None :
parser.error("--isa_json requires --isa_assay_stream")

# check if data is given when adding a 'run' table
if (not args.no_data_upload and args.run and args.action.upper() not in ['RELEASE', 'CANCEL']) or (not args.no_data_upload and args.xlsx and args.action.upper() not in ['RELEASE', 'CANCEL']):
if args.data is None:
Expand Down Expand Up @@ -816,6 +833,8 @@ def main():
secret = args.secret
draft = args.draft
xlsx = args.xlsx
isa_json_file = args.isa_json
isa_assay_stream = args.isa_assay_stream
auto_action = args.auto_action

with open(secret, 'r') as secret_file:
Expand Down Expand Up @@ -857,6 +876,25 @@ def main():
schema_dataframe[schema] = xl_sheet
path = os.path.dirname(os.path.abspath(xlsx))
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
elif isa_json_file:
# Read json file
with open(isa_json_file, 'r') as json_file:
isa_json = json.load(json_file)

schema_tables = {}
schema_dataframe = {}
required_assays = []
for stream in isa_assay_stream:
required_assays.append({"assay_stream": stream})
submission = EnaSubmission.from_isa_json(isa_json, required_assays)
submission_dataframes = submission.generate_dataframes()
for schema, df in submission_dataframes.items():
schema_dataframe[schema] = check_columns(
df, schema, action, dev, auto_action)
path = os.path.dirname(os.path.abspath(isa_json_file))
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"


else:
# collect the schema with table input from command-line
schema_tables = collect_tables(args)
Expand Down
Empty file.
Loading

0 comments on commit cd3e8a9

Please sign in to comment.