Skip to content

Commit

Permalink
Merge branch 'develop' into qc_report2
Browse files Browse the repository at this point in the history
  • Loading branch information
siebrenf authored Sep 13, 2023
2 parents 4d899b4 + 3edc246 commit d840a71
Show file tree
Hide file tree
Showing 13 changed files with 267 additions and 44 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate

## [Unreleased]

### Added

- download samples directly from ENCODE by assay (ENCSR) and file (ENCFF) accession ids.

### Changed

- the init, run, and explain commands display the supported workflows in their --help
Expand Down
6 changes: 3 additions & 3 deletions docs/content/gettingstarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
Download and install [miniconda](https://www.anaconda.com/) if not yet installed:

```console
user@comp:~$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
user@comp:~$ bash miniconda.sh # (make sure to say **yes** when asked to run conda init)
user@comp:~$ wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -O mambaforge.sh
user@comp:~$ bash mambaforge.sh # (make sure to say **yes** when asked to initialize conda)
user@comp:~$ source ~/.bashrc
```

Expand All @@ -25,7 +25,7 @@ user@comp:~$ conda config --add channels conda-forge
The most straightforward way to install seq2science is by using [conda](https://docs.continuum.io/anaconda/) using the [bioconda](https://bioconda.github.io/) channel. To install seq2science in a fresh environment using bioconda:

```console
(base) user@comp:~$ conda create -n seq2science seq2science
(base) user@comp:~$ mamba create -n seq2science seq2science
```

This should install the *newest* official release of seq2science.
Expand Down
2 changes: 2 additions & 0 deletions docs/content/seq2science_profiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
It is possible that you have your own custom settings you always use, e.g. when running seq2science on a cluster.
For these cases we made generic profiles, and can be found here: https://github.com/vanheeringen-lab/seq2science-profiles

Support for profiles is **experimental**, and might not always work as intended. Please let us know if you come across any issues.

Running seq2science after installation of a profile is as easy as:

```
Expand Down
28 changes: 16 additions & 12 deletions docs/content/workflows/download_fastq.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Downloading public data in bulk from the NCBI, ENA, and DDBJ databases has never

#### Download SRA file

The four most popular databases that store sequencing data are National Center for Biotechnology Information (NCBI), the European Nucleotide Archive (ENA), the DNA Data Bank of Japan (DDBJ), and the Genome Sequence Archive (GSA).
Only ENA and GSA store the actual fastq files, and DDBJ and NCBI store the raw data (as a sra file) from which a fastq can be derived.
For this reason for each sample will first be checked if it can be downloaded from ENA.
The five most popular databases that store sequencing data are National Center for Biotechnology Information (NCBI), the European Nucleotide Archive (ENA), the DNA Data Bank of Japan (DDBJ), the Genome Sequence Archive (GSA), and the Encode project (ENCODE).
ENA, ENCODE, and GSA store the actual fastq files, and DDBJ and NCBI store the raw data (as a sra file) from which a fastq can be derived.
For this reason for each sample on DDBJ and NCBI seq2science will first check if it can be downloaded from ENA as a fastq directly.
Otherwise we will download the samples in its raw format. To convert this data to a fastq it has to be "*dumped*".

### Filling out the samples.tsv
Expand All @@ -22,22 +22,26 @@ As an example, the `samples.tsv` could look something like this:

```
sample
ERX123 <-- EBI ENA experiment
ERR321 <-- EBI ENA run
GSMabc <-- GEO sample
SRX456 <-- SRA experiment
SRRxzy <-- SRA run
DRX890 <-- DDBJ experiment
DRR098 <-- DDBJ run
CRX123 <-- GSA experiment
CRX123 <-- GSA experiment
DRX890 <-- DDBJ experiment
DRR098 <-- DDBJ run
ENCSR765 <-- ENCODE assay
ENCFF432 <-- ENCODE fastq file
ERX123 <-- EBI ENA experiment
ERR321 <-- EBI ENA run
GSMabc <-- GEO sample
SRX456 <-- SRA experiment
SRRxzy <-- SRA run
```

#### Sample column

When downloading fastq files there is only one column in the samples.txt.
This is the sample column, where each sample is specified.
Samples are specified with their name of the accession (e.g. GSM2837484).
(Accepted formats start with "GSM", "SRR", "SRX", "DRR", "DRX", "ERR" "ERX", or "CRX")
(Accepted formats start with "CRX", "DRR", "DRX", "ENCFF" "ENCSR", "ERR", "ERX", "GSM", "SRR", or "SRX")

When specifying an ENCODE fastq file, and it belongs to a paired sequencing run, both fastq files will be downloaded. They will have the file name of the sample, and R1 and R2 will correspond to ENCODE.

#### Final notes

Expand Down
3 changes: 2 additions & 1 deletion seq2science/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@ def seq2science_parser(workflows_dir="./seq2science/workflows/"):
"-p",
"--profile",
metavar="PROFILE NAME",
help="Use a seq2science profile. Profiles can be taken from: https://github.com/s2s-profiles",
help="Use a seq2science profile. Profiles can be taken from: https://github.com/vanheeringen-lab/seq2science-profiles. "
"Profiles are an experimental feature and might not always work as intended. Please let us know if something doesn't work properly!",
)
subparser.add_argument(
"-c",
Expand Down
20 changes: 11 additions & 9 deletions seq2science/rules/configuration_generic.smk
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,7 @@ def parse_pysradb():
sampledict.update(samples2metadata(missing_samples, config, logger))

pickle.dump(
{k: v for k, v in sampledict.items() if k.startswith(("ERR", "ERX", "SRR", "SRX", "GSM", "DRX", "DRR"))},
{k: v for k, v in sampledict.items() if k.startswith(("ERR", "ERX", "SRR", "SRX", "GSM", "DRX", "DRR", "CRX", "ENCSR", "ENCFF"))},
open(pysradb_cache, "wb"),
)

Expand Down Expand Up @@ -476,31 +476,31 @@ def parse_pysradb():
if (values["layout"] == "PAIRED") and values.get("ena_fastq_ftp") is not None
for run in values["runs"]
]
gsa_single_end = [
gsa_or_encode_single_end = [
run
for values in sampledict.values()
if (values["layout"] == "SINGLE") and values.get("gsa_fastq_http") is not None
if (values["layout"] == "SINGLE") and values.get("gsa_fastq_http", values.get("encode_fastq_http")) is not None
for run in values["runs"]
]
gsa_paired_end = [
gsa_or_encode_paired_end = [
run
for values in sampledict.values()
if (values["layout"] == "PAIRED") and values.get("gsa_fastq_http") is not None
if (values["layout"] == "PAIRED") and values.get("gsa_fastq_http", values.get("encode_fastq_http")) is not None
for run in values["runs"]
]
sra_single_end = [
run
for values in sampledict.values()
if (values["layout"] == "SINGLE")
for run in values.get("runs", [])
if (run not in ena_single_end and run not in gsa_paired_end)
if (run not in ena_single_end and run not in gsa_or_encode_single_end)
]
sra_paired_end = [
run
for values in sampledict.values()
if (values["layout"] == "PAIRED")
for run in values.get("runs", [])
if (run not in ena_paired_end and run not in gsa_paired_end)
if (run not in ena_paired_end and run not in gsa_or_encode_paired_end)
]

# get download link per run
Expand All @@ -515,6 +515,8 @@ def parse_pysradb():
run2download[run] = values["ena_fastq_ftp"][run]
if "gsa_fastq_http" in values:
run2download[run] = values["gsa_fastq_http"][run]
if "encode_fastq_http" in values:
run2download[run] = values["encode_fastq_http"][run]

# if samples are merged add the layout of the technical replicate to the config
failed_samples = dict()
Expand All @@ -539,9 +541,9 @@ def parse_pysradb():
logger.error("\n")
os._exit(1) # noqa

return sampledict, ena_single_end, ena_paired_end, gsa_single_end, gsa_paired_end, sra_single_end, sra_paired_end, run2download, pysradb_cache_lock
return sampledict, ena_single_end, ena_paired_end, gsa_or_encode_single_end, gsa_or_encode_paired_end, sra_single_end, sra_paired_end, run2download, pysradb_cache_lock

SAMPLEDICT, ENA_SINGLE_END, ENA_PAIRED_END, GSA_SINGLE_END, GSA_PAIRED_END, SRA_SINGLE_END, SRA_PAIRED_END, RUN2DOWNLOAD, PYSRADB_CACHE_LOCK = parse_pysradb()
SAMPLEDICT, ENA_SINGLE_END, ENA_PAIRED_END, GSA_OR_ENCODE_SINGLE_END, GSA_OR_ENCODE_PAIRED_END, SRA_SINGLE_END, SRA_PAIRED_END, RUN2DOWNLOAD, PYSRADB_CACHE_LOCK = parse_pysradb()

# workflow

Expand Down
20 changes: 10 additions & 10 deletions seq2science/rules/get_fastq.smk
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
all rules/logic related downloading public fastqs should be here.
"""
localrules: run2sra, ena2fastq_SE, ena2fastq_PE, gsa2fastq_SE, gsa2fastq_PE, runs2sample
localrules: run2sra, ena2fastq_SE, ena2fastq_PE, gsa_or_encode2fastq_SE, gsa_or_encode2fastq_PE, runs2sample


import os
Expand Down Expand Up @@ -71,12 +71,12 @@ rule run2sra:


# ENA > SRA
ruleorder: ena2fastq_SE > gsa2fastq_SE > sra2fastq_SE
ruleorder: ena2fastq_PE > gsa2fastq_PE > sra2fastq_PE
ruleorder: ena2fastq_SE > gsa_or_encode2fastq_SE > sra2fastq_SE
ruleorder: ena2fastq_PE > gsa_or_encode2fastq_PE > sra2fastq_PE
# PE > SE
ruleorder: ena2fastq_PE > ena2fastq_SE
ruleorder: sra2fastq_PE > sra2fastq_SE
ruleorder: gsa2fastq_PE > gsa2fastq_SE
ruleorder:gsa_or_encode2fastq_PE > gsa_or_encode2fastq_SE


rule sra2fastq_SE:
Expand Down Expand Up @@ -234,9 +234,9 @@ rule ena2fastq_PE:
shell("exit 1 >> {log} 2>&1")


rule gsa2fastq_SE:
rule gsa_or_encode2fastq_SE:
"""
Download single-end fastq files from the GSA.
Download single-end fastq files from the GSA or ENCODE DCC.
"""
output:
temp(expand("{fastq_dir}/runs/{{run}}.{fqsuffix}.gz", **config)),
Expand All @@ -247,7 +247,7 @@ rule gsa2fastq_SE:
benchmark:
expand("{benchmark_dir}/gsa2fastq_SE/{{run}}.benchmark.txt", **config)[0]
wildcard_constraints:
run="|".join(GSA_SINGLE_END) if len(GSA_SINGLE_END) else "$a",
run="|".join(GSA_OR_ENCODE_SINGLE_END) if len(GSA_OR_ENCODE_SINGLE_END) else "$a",
priority: 1
retries: 2
run:
Expand All @@ -261,9 +261,9 @@ rule gsa2fastq_SE:
shell("exit 1 >> {log} 2>&1")


rule gsa2fastq_PE:
rule gsa_or_encode2fastq_PE:
"""
Download paired-end fastq files from the GSA.
Download paired-end fastq files from the GSA or ENCODE DCC.
"""
output:
temp(expand("{fastq_dir}/runs/{{run}}_{fqext}.{fqsuffix}.gz", **config)),
Expand All @@ -274,7 +274,7 @@ rule gsa2fastq_PE:
benchmark:
expand("{benchmark_dir}/gsa2fastq_PE/{{run}}.benchmark.txt", **config)[0]
wildcard_constraints:
run="|".join(GSA_PAIRED_END) if len(GSA_PAIRED_END) else "$a",
run="|".join(GSA_OR_ENCODE_PAIRED_END) if len(GSA_OR_ENCODE_PAIRED_END) else "$a",
priority: 1
retries: 2
run:
Expand Down
117 changes: 112 additions & 5 deletions seq2science/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os
import sys
import glob
import json
import time
import colorsys
import tempfile
Expand Down Expand Up @@ -128,7 +129,7 @@ def samples2metadata_local(samples: List[str], config: dict, logger) -> dict:
):
SAMPLEDICT[sample] = dict()
SAMPLEDICT[sample]["layout"] = "PAIRED"
elif sample.startswith(("GSM", "DRX", "ERX", "SRX", "DRR", "ERR", "SRR", "CRX")):
elif sample.startswith(("GSM", "DRX", "ERX", "SRX", "DRR", "ERR", "SRR", "CRX", "ENCSR", "ENCFF")):
continue
else:
extend_msg = ""
Expand All @@ -143,7 +144,7 @@ def samples2metadata_local(samples: List[str], config: dict, logger) -> dict:
f"We checked directory '{config['fastq_dir']}' "
f"for gzipped files starting with '{sample}' and containing '{config['fqsuffix']}'.\n"
+ extend_msg
+ f"Since the sample did not start with either GSM, SRX, SRR, ERR, and DRR we "
+ f"Since the sample did not start with either GSM, SRX, SRR, ERX, ERR, DRX, DRR, CRX, ENCSR, or ENCFF we "
f"couldn't find it online..\n"
)
os._exit(1) # noqa
Expand Down Expand Up @@ -207,7 +208,7 @@ def samples2metadata_gsa(samples: List[str], logger) -> dict:
"runs": ["CRR1234", "CRR4321"],
"gsa_fastq_http": {CRR1234: [link_1, link_2], ...,
"SRR5678": {"layout": "SINGLE",
"CRR5678": {"layout": "SINGLE",
"runs": ["CRR5678"],
gsa_fastq_http: {CRR5678: [link_1]},
...
Expand Down Expand Up @@ -362,6 +363,110 @@ def samples2metadata_sra(samples: List[str], logger) -> dict:
return sampledict


def samples2metadata_encode(samples: List[str], logger) -> dict:
"""
"""
metadata = {}
for sample in samples:
if sample.startswith("ENCSR"):
metadata.update(ENCSR2metadata(sample))
elif sample.startswith("ENCFF"):
metadata.update(ENCFF2metadata(sample))
else:
assert False
return metadata


def ENCFF2metadata(sample):
"""
encode file (fastq) to metadata
"""
url = f"https://www.encodeproject.org/files/{sample}/?format=json"
response = urllib.request.urlopen(urllib.request.Request(url)).read()
response = json.loads(response.decode('utf-8'))
response

assert response["file_format"] == "fastq"

metadata = {sample: {}}

if response["run_type"] == "single-ended":
metadata[sample]["layout"] = "SINGLE"
metadata[sample]["encode_fastq_http"] = "https://www.encodeproject.org" + response["href"]
metadata[sample]["runs"] = [sample]
elif response["run_type"] == "paired-ended":
metadata[sample]["layout"] = "PAIRED"
mate = response["paired_with"].split("/")[2]
mate_url = f"https://www.encodeproject.org/files/{mate}/?format=json"
mate_response = urllib.request.urlopen(urllib.request.Request(mate_url)).read()
mate_response = json.loads(mate_response.decode('utf-8'))

if response["paired_end"] == "1":
metadata[sample]["encode_fastq_http"] = {
sample: [
"https://www.encodeproject.org" + response["href"],
"https://www.encodeproject.org" + mate_response["href"],
]
}
else:
metadata[sample]["encode_fastq_http"] = {
sample: [
"https://www.encodeproject.org" + mate_response["href"],
"https://www.encodeproject.org" + response["href"],
]
}

metadata[sample]["runs"] = [sample]
else:
assert False
return metadata


def ENCSR2metadata(sample):
"""
encode assay to metadata
an assay can exist out of multiple isogenic and technical replicates
"""
url = f"https://www.encodeproject.org/search/?type=File&dataset=/experiments/{sample}/&file_format=fastq&format=json&frame=object"
response = urllib.request.urlopen(urllib.request.Request(url)).read()
response = json.loads(response.decode('utf-8'))
response

metadata = {sample: {}}

# check if all run types are the same
assert len({file["run_type"] for file in response["@graph"]}) == 1
layout = response["@graph"][0]["run_type"]

if layout == "single-ended":
metadata[sample]["layout"] = "SINGLE"
metadata[sample]["encode_fastq_http"] = {
x["accession"]: "https://www.encodeproject.org" + x["href"] for x in response["@graph"]
}
metadata[sample]["runs"] = list(metadata[sample]["encode_fastq_http"].keys())

elif layout == "paired-ended":
graph = {x["accession"]: x for x in response["@graph"]}

metadata[sample]["layout"] = "PAIRED"
metadata[sample]["runs"] = [x["accession"] for x in response["@graph"] if x["paired_end"] == "1"]

metadata[sample]["encode_fastq_http"] = {
run: [
"https://www.encodeproject.org" + graph[run]["href"],
"https://www.encodeproject.org" + graph[graph[run]["paired_with"].split("/")[2]]["href"],
]
for run in metadata[sample]["runs"]
}
else:
assert False

return metadata


def samples2metadata(samples: List[str], config: dict, logger) -> dict:
local_samples = samples2metadata_local(samples, config, logger)
public_samples = [sample for sample in samples if sample not in local_samples.keys()]
Expand All @@ -371,7 +476,9 @@ def samples2metadata(samples: List[str], config: dict, logger) -> dict:

gsa_samples = [sample for sample in public_samples if sample.startswith("CRX")]
gsa_samples = samples2metadata_gsa(gsa_samples, logger)
rest_public_samples = [sample for sample in public_samples if sample not in gsa_samples]
encode_samples = [sample for sample in public_samples if sample.startswith(("ENCSR", "ENCFF"))]
encode_samples = samples2metadata_encode(encode_samples, logger)
rest_public_samples = [sample for sample in public_samples if sample not in (list(gsa_samples.keys()) + list(encode_samples.keys()))]

# chop public samples into smaller chunks, doing large queries results into
# pysradb decode errors..
Expand All @@ -383,7 +490,7 @@ def samples2metadata(samples: List[str], config: dict, logger) -> dict:
# just to be sure sleep in between to not go over our API limit
time.sleep(1)

return {**local_samples, **sra_samples, **gsa_samples}
return {**local_samples, **sra_samples, **gsa_samples, **encode_samples}


def sieve_bam(configdict):
Expand Down
Loading

0 comments on commit d840a71

Please sign in to comment.